How Reliable Are the Coronavirus Numbers?

Any pandemic or epidemic is a complicated process, and the current one is not an exception.  Besides all the actual problems (how to treat people, how to organize logistics, how to inform media and so on), there are basic ones: how to measure different aspects of the process.  It was always not easy; it is very hard now.  Here are some often repeated critical comments about different widely published and discussed indicators.

  1. The real number of people with the virus is drastically higher than reported (according to some authors, by a factor of 10), because many people have the virus but were not tested (passed the sickness easily without noticing).
  2. The number of available test kits is growing in time, and the number of discovered cases is increasing as a result, which makes data incomparable in dynamics.
  3. Quality of test (ability to capture the smaller or higher concentration of viruses in a body) is different from country to country or even from state to state; it is also changing in time, which adds a mess into a problem of comparability.
  4.  Test availability is varied a lot by different regions, countries, and states, which makes any comparisons even harder. 
  5.  Registered deaths are considered a more reliable indicator than the number of cases, but a suspicion is that too many cases are attributed to the virus without proper reason.  If, say, one has tested positive, but also had many other complications (often mortal), the patient will be, most likely, included in the virus category.

All those and other concerns are valid and allow one to say that, very likely, the real lethality rate is seriously lower than reported 1%–5% (as was discussed, in particular, by Prof. J. Ioannidis at the outset of the crisis).  But regardless of all those considerations, two things are clear: the same or similar measurement problems took place in the past, and decision-makers should work with available data only, whatever its quality is.  Hopefully, even if data do not reflect the size accurately, they reflect the tendency, and this is what's important to look at.     

Chart 1 was calculated by dividing the daily new cases by the average number of cases for the last 14 days.  The idea is that new cases may appear only from existing ones; 14 days is a typical period when symptoms are not apparent.  So those infected earlier generate the new increment.  If, say, today was 100 new cases and the average number of cases for the previous seven days was 200, then the statistics are 100/200 = 0.5 — i.e., each of 100 people could catch the virus from any of 200.  This indicator is one of the critical characteristics of virulence and called in the epidemiology reproduction number (R0).  Of course, the correct calculation of it is a complex procedure, but the proposed way is sufficient for rough estimates.

Chart 1. Estimation of the reproductive numbers in time (how many people may catch a virus from one infected person), four states (values are three-day moving averages; data presented from the day when the state had more than 20 cases; data source, Johns Hopkins University.

As it is quite clear, the virulence is seriously dropping (the last observed day was April 1).  For Florida and especially Illinois, there is a peak about a week after the emergency was announced.  For Washington, the state from which it all began, the process of declining started earlier.  For relaxed California, small decreasing is taking place.

The declining contamination effect is much more pronounced in the most affected states and the United States generally (Chart 2).  By averaging data in three and six states, I tried to reflect better the "pool" from which new cases appear and minimize the random fluctuations.  Indeed, they all show similar patterns.  The intensity of the infections is steadily dropping, starting from the peak, which was reached on March 19–20, 2020.  Most likely, social distancing began to work about a week after it was vigorously enforced.

Chart 2.  Estimation of the reproductive numbers in time, USA and East Coast (data presented from the day when the states had more than 20 cases; data source, John Hopkins University; three states include New York, New Jersey, and Connecticut; six states — these three plus Massachusetts, Pennsylvania, and Delaware).  

These charts show that the concept of exponential growth, which is still often pronounced, is wrong.  If it were right, the indicator should grow.  A reproduction number below 1 means that the pandemic is to vanish.  The earlier estimates of it showed a wide interval, from 1.5 to 5.5 (quite similar to the ones on the left area on Chart 2).  Those days are over.

We shouldn't be overly optimistic about predictions based on that – too many other factors are in place.  But the universal slowing of contamination cannot be random.  At least one type of threshold is surpassed.  It definitely looks as though we've passed a point of diminishing returns.

Igor Mandel, Ph.D., Dr. Sc. in statistics, president at Redviser Inc.  

Any pandemic or epidemic is a complicated process, and the current one is not an exception.  Besides all the actual problems (how to treat people, how to organize logistics, how to inform media and so on), there are basic ones: how to measure different aspects of the process.  It was always not easy; it is very hard now.  Here are some often repeated critical comments about different widely published and discussed indicators.

  1. The real number of people with the virus is drastically higher than reported (according to some authors, by a factor of 10), because many people have the virus but were not tested (passed the sickness easily without noticing).
  2. The number of available test kits is growing in time, and the number of discovered cases is increasing as a result, which makes data incomparable in dynamics.
  3. Quality of test (ability to capture the smaller or higher concentration of viruses in a body) is different from country to country or even from state to state; it is also changing in time, which adds a mess into a problem of comparability.
  4.  Test availability is varied a lot by different regions, countries, and states, which makes any comparisons even harder. 
  5.  Registered deaths are considered a more reliable indicator than the number of cases, but a suspicion is that too many cases are attributed to the virus without proper reason.  If, say, one has tested positive, but also had many other complications (often mortal), the patient will be, most likely, included in the virus category.

All those and other concerns are valid and allow one to say that, very likely, the real lethality rate is seriously lower than reported 1%–5% (as was discussed, in particular, by Prof. J. Ioannidis at the outset of the crisis).  But regardless of all those considerations, two things are clear: the same or similar measurement problems took place in the past, and decision-makers should work with available data only, whatever its quality is.  Hopefully, even if data do not reflect the size accurately, they reflect the tendency, and this is what's important to look at.     

Chart 1 was calculated by dividing the daily new cases by the average number of cases for the last 14 days.  The idea is that new cases may appear only from existing ones; 14 days is a typical period when symptoms are not apparent.  So those infected earlier generate the new increment.  If, say, today was 100 new cases and the average number of cases for the previous seven days was 200, then the statistics are 100/200 = 0.5 — i.e., each of 100 people could catch the virus from any of 200.  This indicator is one of the critical characteristics of virulence and called in the epidemiology reproduction number (R0).  Of course, the correct calculation of it is a complex procedure, but the proposed way is sufficient for rough estimates.

Chart 1. Estimation of the reproductive numbers in time (how many people may catch a virus from one infected person), four states (values are three-day moving averages; data presented from the day when the state had more than 20 cases; data source, Johns Hopkins University.

As it is quite clear, the virulence is seriously dropping (the last observed day was April 1).  For Florida and especially Illinois, there is a peak about a week after the emergency was announced.  For Washington, the state from which it all began, the process of declining started earlier.  For relaxed California, small decreasing is taking place.

The declining contamination effect is much more pronounced in the most affected states and the United States generally (Chart 2).  By averaging data in three and six states, I tried to reflect better the "pool" from which new cases appear and minimize the random fluctuations.  Indeed, they all show similar patterns.  The intensity of the infections is steadily dropping, starting from the peak, which was reached on March 19–20, 2020.  Most likely, social distancing began to work about a week after it was vigorously enforced.

Chart 2.  Estimation of the reproductive numbers in time, USA and East Coast (data presented from the day when the states had more than 20 cases; data source, John Hopkins University; three states include New York, New Jersey, and Connecticut; six states — these three plus Massachusetts, Pennsylvania, and Delaware).  

These charts show that the concept of exponential growth, which is still often pronounced, is wrong.  If it were right, the indicator should grow.  A reproduction number below 1 means that the pandemic is to vanish.  The earlier estimates of it showed a wide interval, from 1.5 to 5.5 (quite similar to the ones on the left area on Chart 2).  Those days are over.

We shouldn't be overly optimistic about predictions based on that – too many other factors are in place.  But the universal slowing of contamination cannot be random.  At least one type of threshold is surpassed.  It definitely looks as though we've passed a point of diminishing returns.

Igor Mandel, Ph.D., Dr. Sc. in statistics, president at Redviser Inc.