It gets better with time: the benefits and challenges of predictive data modeling

Duo photo of Eduardo Sanchez and Yosef Khan
Eduardo Sanchez, M.D., M.P.H., Chief, Center for Health Metrics and Evaluation, American Heart Association; Yosef Khan, M.D., Ph.D., Director, Health Informatics and Analytics, Center for Health Metrics and Evaluation, American Heart Association

The myriad data that surround us strengthen our understanding of public health problems and inform our response strategies as they offer insights to past, present and future trends of disease progression and outcomes. Sometimes, public health problems demand quick action (e.g., COVID-19), even in the face of limited data to inform a response. But as data accumulate over time, they help refine our understanding of the problem and our corresponding action strategies—to the extent that we accurately interpret and wisely apply the data.

A key method for predicting future disease trends and outcomes is predictive data modeling. In public health, a primary purpose of predictive data modeling is to help project potential outcomes of different policy and intervention strategies that are intended to curb the spread of a disease and its consequences. Uses of predictive data modeling go beyond health and include forecasting financial, economic and market risks; weather patterns; election results; consumer behavior; and sporting event outcomes.

Predictive data modeling and projections involve aggregating the appropriate data, applying the proper analytic approaches, establishing assumptions (e.g., for every 100 cases of COVID-19, one person will die) and developing “models” that integrate diverse data sources, assumptions and statistics to map potential future trends. This process relies on principles of data science but is also a skillful, iterative process of choosing the right data and applying correct and valid assumptions, methodologies, applications and interpretations. Each data set comes with its own assumptions, collection methods and potential biases. Consequentially, outcome projections are only as good as the data and critical assumptions that are fed into the models, which may be based on limited information in the early stages of a situation.

Models can be sensitive to rapidly changing data, so they may change as data accumulate, data sources change and assumptions evolve based on new insights. Otherwise, a model will only be marginally predictive or just plain wrong. Some have historically characterized data modeling and projections as simplified descriptions of complex systems and thus always “wrong,” yet acknowledge that they may provide useful approximations. Indeed, models can help compare the relative impact of different response options, but should not claim to make 100% accurate predictions. Perhaps the public’s unrealistic expectations of models are the cause of their annoyance when a model’s projections change.

The correlation of models to real world experience and adjustment of assumptions and methods can improve the predictive “accuracy” of a model. Thus, repeated testing and evaluation to ensure that theoretical models fit our observations is important.

A critical and timely example of how data models and projections could be perceived as inaccurate is the COVID-19 pandemic, with multiple trackers and modelers using different methods and assumptions to predict a range of outcomes. Among the forecasters are the University of Washington’s Institute for Health Metrics and Evaluation (IHME), the University of Texas Austin and the University of Geneva.

The IHME model was designed to be a planning tool for hospital administrators and government officials who need to anticipate the timing of the greatest demand on health system resources. The model’s outcomes include deaths, hospital bed occupancy, intensive care unit occupancy and ventilator use.

The IHME model’s initial mortality rate projections were much higher than its current projections. This is more likely a result of refinements in the model (based on collected data) than a failure of the model itself. The revised projections take into account the real-world data, assumptions, behavioral changes, duration, and impact of worldwide shelter-in-place policies and practices and physical distancing measures and practice on COVID-19 deaths. As more up-to-date, robust and diverse data are integrated, underlying assumptions can be adjusted, new assumptions can be added, and models can be modified, refined and enhanced. Such evolution is the norm for predictive data modeling.

Predictive data modeling has application for the American Heart Association: broadly, in realizing its goals to extend healthy life years, and more narrowly, in achieving science-based, cost-effective improvements to cardiovascular and brain health in the United States. The AHA Center for Health Metrics and Evaluation strives to implement dynamic predictive modeling, incorporating historic data and trends and evaluating today’s experiences to make plausible assumptions and methodically refine predictive capacity, promoting a continuous cycle of improvement.