# COVID-19: Can We Estimate Infection Speed and Fatalities?

Two characteristics are most important for the evaluation of any pandemic: how fast it spreads and how deadly it is.  The first one is characterized, among other parameters, by the reproduction number (R0), which shows how many people on average are infected by one sick person.  The second one, the case fatality rate (CFR), shows what percent of infected people will die from the infection.

One of the authors had an estimated R0 that showed that the current pandemic started to decline in the second half of March.  In that article, the "pool" from which the new cases appear was estimated as an average number for the last 14 days.  It will be more precise to recall that a "case" is a result of a particular process.  In the majority of cases, people get registered if they have some symptoms of the illness.  The interval between catching the virus and getting symptoms is called the incubation period; it varies from person to person.

The study estimates the shape of the distribution (approximately lognormal) and its parameters (mean incubation period is 5.5 days; 13 days covers 98% of all cases).  Having this estimate, one could recalculate the "pool size."  Adjusted data are presented in Chart 1.  Three states combined are included because, in fact, they represent one agglomeration with a source of infection.

Chart 1.Estimation of the reproduction number with incubation period adjustment (here and below 3 states include New York, New Jersey, and Connecticut; data source
Johns Hopkins University).

The graphs in Chart 1 retain the same shape as earlier, but they suggest more conservative estimates.  The main point is that the values in the second half of March are located below level 1, which means that the intensive phase of the virus distribution has passed (mainly due to the social distancing).

In other words, we have entered into the contraction zone.

The daily increase in case numbers continues, but percentwise, it is sharply reduced (see Chart 2), which contradicts the "exponential growth" (together with R0 below 1).

It means that the contamination process is going to be "flattened" soon.

Chart 2.Chain increments of the number of cases (3-days moving average).

Concerning fatality estimations, it is not a simple question.  A fatality rate is a number of deaths due to the cause divided by a number of people with the infection.  If we calculate, for example, 10,783/366,566 = 2.94%, it would be a CFR for the USA as of April 6, 2020 (it is often called "naïve SFR").  When the director general of the World Health Organization, Dr. T. Ghebreyesus, reported a 3.4% fatality rate from COVID-19 on March 3, he did just that.  It immediately triggered sharp criticism from the medical community.  Why?

There are several obstacles to proper calculations.

1. An underestimated number of real cases, mainly because many people do not notice the illness and, accordingly, do not get tested and registered.
2. Insufficient quantity and not a good quality of the tests.
3. Misclassification of deaths in two aspects: bulking all pneumonia cases into the virus category and lack of distinction between the death caused by the virus only and the death by the virus in combination with other possibly mortal complications.
4. Bureaucratic and political considerations suppressing the correct information about both infection cases and fatalities, but in unknown proportions (doubts about the validly of Chinese data, etc.).
5. The accurate determination of the "number of the infected people" in the denominator of CFR.

The points 1–4 above are important but cannot be adequately discussed here.  The last point is more constructive although also tricky enough.  Imagine an ideal scenario for the calculation: a cruise liner has left New York for Tokyo.  On the next day, the captain got a message that an unknown virus appeared on the ship, and all passengers should be immediately isolated in cabins for quarantine during the whole seven-day trip.  The test kits will be delivered by helicopter tomorrow, and everyone should be tested daily.  During that week, 700 people out of a total of 3,000 have been tested as positive.  In Tokyo, all 3,000 were quarantined and tested; an additional 300 were found infected.  Each of the infected 1,000 people was treated; 20 days after arrival, 50 of them died; all the rest were released as healthy.

In this case, the CFR 50/1,000 = 5% is correct because a) the number of all infected people was known and b) the time for any outcome (death — rehabilitation) has passed.  In reality, we do not have these conditions until the very end of the pandemic.  For that reason, some special models have been developed, but they need additional assumptions and complex data.

We do not estimate the "general unknown CFR."  Still, we have tried empirical estimates and tested them in a predictive modeling fashion.  This simple approach yields satisfactory results.

This approach has three steps.

1. The number of deaths each day is estimated via a linear univariate regression model without the intercept, as a function of infected people from the previous days with lags from 1 to 10.
2. For each state, the best lag is selected by the minimum level of errors after the comparison of forecasts with actual data.  Regression coefficients represent the CFR for a given lag.
3. The best models are presented by a series of moving regressions that do not use any future data.  If stability and good forecast quality are confirmed, the coefficients of regressions are used in the real forecast for a short period of time.

On steps 1 and 2, we convincingly found that best lags are somewhere between 5 and 7 for most contaminated states and the USA.  (Even  Minnesota, with a very short history of infection, shows the best lag 5.)  The level of forecasting levels with those lags is amazingly small: 3–7%.  The data for the USA are the most convincing — all deaths occurred from contaminated people within the country (that cannot be said about separate states or even three states combined); it points to lag in 6 days from infection to death.  The relevant fatality rate for the country is 5.84%.  It is a very high number.  It is still lower than the estimate for Italy of 7.2% but much higher than the naïve estimates quoted earlier.  Paradoxically, the real number may still be much smaller for the reasons listed above.

Testing on new data points (not used in a model) confirms that conclusion about small errors.  We even dare to predict the number of fatalities for the next week (see Chart 3).

Chart 3.  Forecasting fatalities.

In essence, it shows two estimates, with the preferable lag of 6 days.  Such a prediction permits the indication of a "benchmark," even for a short time.  If the actual value is noticeably higher than the benchmark, more aggressive preventive measures are urgently needed.   However, if it is lower, the applied measures have a good effect on the suppression of the infection and reducing its consequences.

Igor Mandel, Ph.D., Dr. Sc. is president at Redviser Inc.  He authored numerous papers in statistics, sociology, and marketing research.

Stan Lipovetsky, Ph.D. is an independent consultant.  He authored many articles in applied statistics, mathematics, economics, and marketing research.

Two characteristics are most important for the evaluation of any pandemic: how fast it spreads and how deadly it is.  The first one is characterized, among other parameters, by the reproduction number (R0), which shows how many people on average are infected by one sick person.  The second one, the case fatality rate (CFR), shows what percent of infected people will die from the infection.

One of the authors had an estimated R0 that showed that the current pandemic started to decline in the second half of March.  In that article, the "pool" from which the new cases appear was estimated as an average number for the last 14 days.  It will be more precise to recall that a "case" is a result of a particular process.  In the majority of cases, people get registered if they have some symptoms of the illness.  The interval between catching the virus and getting symptoms is called the incubation period; it varies from person to person.

The study estimates the shape of the distribution (approximately lognormal) and its parameters (mean incubation period is 5.5 days; 13 days covers 98% of all cases).  Having this estimate, one could recalculate the "pool size."  Adjusted data are presented in Chart 1.  Three states combined are included because, in fact, they represent one agglomeration with a source of infection.

Chart 1.Estimation of the reproduction number with incubation period adjustment (here and below 3 states include New York, New Jersey, and Connecticut; data source
Johns Hopkins University).

The graphs in Chart 1 retain the same shape as earlier, but they suggest more conservative estimates.  The main point is that the values in the second half of March are located below level 1, which means that the intensive phase of the virus distribution has passed (mainly due to the social distancing).

In other words, we have entered into the contraction zone.

The daily increase in case numbers continues, but percentwise, it is sharply reduced (see Chart 2), which contradicts the "exponential growth" (together with R0 below 1).

It means that the contamination process is going to be "flattened" soon.

Chart 2.Chain increments of the number of cases (3-days moving average).

Concerning fatality estimations, it is not a simple question.  A fatality rate is a number of deaths due to the cause divided by a number of people with the infection.  If we calculate, for example, 10,783/366,566 = 2.94%, it would be a CFR for the USA as of April 6, 2020 (it is often called "naïve SFR").  When the director general of the World Health Organization, Dr. T. Ghebreyesus, reported a 3.4% fatality rate from COVID-19 on March 3, he did just that.  It immediately triggered sharp criticism from the medical community.  Why?

There are several obstacles to proper calculations.

1. An underestimated number of real cases, mainly because many people do not notice the illness and, accordingly, do not get tested and registered.
2. Insufficient quantity and not a good quality of the tests.
3. Misclassification of deaths in two aspects: bulking all pneumonia cases into the virus category and lack of distinction between the death caused by the virus only and the death by the virus in combination with other possibly mortal complications.
4. Bureaucratic and political considerations suppressing the correct information about both infection cases and fatalities, but in unknown proportions (doubts about the validly of Chinese data, etc.).
5. The accurate determination of the "number of the infected people" in the denominator of CFR.

The points 1–4 above are important but cannot be adequately discussed here.  The last point is more constructive although also tricky enough.  Imagine an ideal scenario for the calculation: a cruise liner has left New York for Tokyo.  On the next day, the captain got a message that an unknown virus appeared on the ship, and all passengers should be immediately isolated in cabins for quarantine during the whole seven-day trip.  The test kits will be delivered by helicopter tomorrow, and everyone should be tested daily.  During that week, 700 people out of a total of 3,000 have been tested as positive.  In Tokyo, all 3,000 were quarantined and tested; an additional 300 were found infected.  Each of the infected 1,000 people was treated; 20 days after arrival, 50 of them died; all the rest were released as healthy.

In this case, the CFR 50/1,000 = 5% is correct because a) the number of all infected people was known and b) the time for any outcome (death — rehabilitation) has passed.  In reality, we do not have these conditions until the very end of the pandemic.  For that reason, some special models have been developed, but they need additional assumptions and complex data.

We do not estimate the "general unknown CFR."  Still, we have tried empirical estimates and tested them in a predictive modeling fashion.  This simple approach yields satisfactory results.

This approach has three steps.

1. The number of deaths each day is estimated via a linear univariate regression model without the intercept, as a function of infected people from the previous days with lags from 1 to 10.
2. For each state, the best lag is selected by the minimum level of errors after the comparison of forecasts with actual data.  Regression coefficients represent the CFR for a given lag.
3. The best models are presented by a series of moving regressions that do not use any future data.  If stability and good forecast quality are confirmed, the coefficients of regressions are used in the real forecast for a short period of time.

On steps 1 and 2, we convincingly found that best lags are somewhere between 5 and 7 for most contaminated states and the USA.  (Even  Minnesota, with a very short history of infection, shows the best lag 5.)  The level of forecasting levels with those lags is amazingly small: 3–7%.  The data for the USA are the most convincing — all deaths occurred from contaminated people within the country (that cannot be said about separate states or even three states combined); it points to lag in 6 days from infection to death.  The relevant fatality rate for the country is 5.84%.  It is a very high number.  It is still lower than the estimate for Italy of 7.2% but much higher than the naïve estimates quoted earlier.  Paradoxically, the real number may still be much smaller for the reasons listed above.

Testing on new data points (not used in a model) confirms that conclusion about small errors.  We even dare to predict the number of fatalities for the next week (see Chart 3).

Chart 3.  Forecasting fatalities.

In essence, it shows two estimates, with the preferable lag of 6 days.  Such a prediction permits the indication of a "benchmark," even for a short time.  If the actual value is noticeably higher than the benchmark, more aggressive preventive measures are urgently needed.   However, if it is lower, the applied measures have a good effect on the suppression of the infection and reducing its consequences.

Igor Mandel, Ph.D., Dr. Sc. is president at Redviser Inc.  He authored numerous papers in statistics, sociology, and marketing research.

Stan Lipovetsky, Ph.D. is an independent consultant.  He authored many articles in applied statistics, mathematics, economics, and marketing research.