Nowadays, there is data available for almost everything, and it can be utilized to answer various questions. Can clinical trials truly demonstrate the efficacy of a drug? Can surveys accurately predict the outcome of an election? Can a financial manager accurately forecast a winning portfolio?
However, the University of Bonn’s Joachim Freyberger and Björn Höppner (a PhD student), along with Washington University in St. Louis’s Andreas Neuhierl and Chicago Booth’s Michael Weber, suggest that the results of predictive models may be significantly skewed due to adjustments made for missing information. These adjustments involve factors such as dropouts from drug trials, unanswered survey questions, and incomplete corporate financial reports.
These researchers propose an enhanced method for handling missing data, which they have tested against two commonly used approaches in the practical application of data, specifically in predicting stock returns. The results indicate that their method consistently provides an advantage.
In order to compare the three methods, the researchers obtained a database containing US stock and balance-sheet data from 1978 to 2021. Initially, the data set consisted of 2.4 million observations, or rows, each containing 82 variables related to trading volume, accounting information, momentum indicators, and similar factors. However, like many data sets, it was incomplete, with some rows lacking values for all 82 variables.
The first widely used method, known as the “complete cases” approach, eliminates all incomplete observations, despite contradicting a fundamental rule of data analysis: “Do not discard data.” This approach necessitated the exclusion of rows where any information was missing. For instance, if a stock lacked trading volume data for a specific month, the complete cases method required discarding all the data collected for that stock during that month. Consequently, only 10 percent of the data remained after this process, with most of the discarded instances having missing values for five variables or fewer.
The other well-known method, “mean imputation,” retains all observations but introduces biases. It replaces missing values with the average of all existing data points for a given variable and month in the data set. However, the missing data may include extreme values that could significantly impact prediction models. For example, consider a housing database where most high-end houses are sold by a realtor who never lists the square footage. If analysts replaced the missing data with the average square footage of all houses, their model’s predictions of market values would likely be underestimated and distorted.