One of my favorite mathematical thingamajigs is Benford’s Law, the weirdly counter-intuitive finding about which digits are most likely to appear in most data sets.

Basically, it says that in many naturally occurring collections of numbers, no matter how random they seem, the first digit is going to be 1 about 30% of the time, 2 about 18% of the time, 3 about 12% of the time and so on down to 9, which is the first digit about 5% of the time. This pattern is found in collections of data ranging from population statistics to heights of buildings to rainfall totals – virtually any set of numbers that are naturally occurring. The pattern was famously noted back in the days when tables of logarithms were printed in books; the pages in the front of the book in a library were much more worn than those in the back.

There’s a long wikipedia article about it, if you want to know more. If you’re wondering why this happens, there’s no good quick-and-easy explanation.

I bring this up because Benford’s Law can be used to spot faked data: If the numbers in a data set don’t fit the expected pattern – only 15% of them start with a 1, perhaps – at least some have probably been made up. Bendford’s Law has become a standard analysis tool in financial fraud cases. And now a Venezuelan researcher has tried it out on national reported COVID-19 data and founds some numbers which seem suspicious. This is from the abstract, which reads like it was machine-translated into English:

The results indicated that results from Italy, Portugal, Netherlands, United Kingdom, Denmark, Belgium and Chile are suspicions of data manipulation because the numbers fail the Benford’s Law according to the results obtained until April 30, 2020.

The temptation to fudge COVID-19 data is huge, given the financial cost of bad results. That’s why so many people were alarmed when the Trump Administration, which has a terrible record in dealing fairly with unpleasant facts, took the COVID-19 reporting task away from the Centers for Disease Control. Benford’s Law is one way to spot any fakery (although by now I suspect that most data-fakers know enough to incorporate it in their fudging).

As an academic and social liberal, I immediately began compiling COVID data to test my theory that the data being released by the white house was less than accurate. I am still working on numerous data subsets, but what I can tell you is that I am currently shocked by the results I have seen to date. The data being release by the Trump administration seems to follow Benford’s Law on a national level. Every large democratic run state that I have looked at so far has NOT. This includes NY, NJ & CA. The 3 republican run states that I analyzed follow Benford’s Law: OH, FL, GA. Obviously this has me confused, curious, and even a bit concerned. Are democratic governors exaggerating this disease? Maybe COVID just doesn’t follow Benford’s Law. At this point I’m unsure but rest assured I am not a Russian bot, Trump supporter OR attempting to troll anyone. I’m only sharing my findings to date.

Hi Kelly – I wonder if this could explain your findings? Seems it would apply since the states you cited (NY, NJ, CA) have dramatically flattened their COVID curves… https://www.sciencedirect.com/science/article/pii/S0378437120305719

Covid numbers only follow Benford’s law when they are increasing / when proper measures are not taken / when those measures are not followed. When the covid numbers stop growing exponentially, or when they start to decline, due to effective measures / people adhering to those rules, the covid numbers no longer follow Benford’s law. You have to check if that’s what you see! … that republican states are less able to fight the disease

Hi Kelly, would you be so kind as to share a bit about your findings. Are the date you’re using the deaths by months, or number of cases by months, etc.? I’m extremely interested in this research. Thanks much! My email is dgumie where it’s never cold. thanks

I read that the law works better if the numbers in perview that span several orders of magnitude. Do you mind throwing some light on what numbers in the covid-19 reporting could this test have possibly been applied to?

I guess statistics like daily deaths or daily reported cases would not be very different (in terms of order) than they were a month ago, let’s say. Also, these numbers would tend to follow an increasing (and eventually a decreasing) pattern.

I checked coronavirus deaths by US county data as of 8/6/20 for fit to Benford. I found the data does not fit based on a Goodness of Fit chi-square test.

Looking at the data set a little differently, I plotted and rationalized the following columns from worldometers.info for US states

total cases(1) deaths(2) active cases(3); then combined deaths + active cases(4)

Plotting (1) vs (4) I find Texas far out of line, especially over time.

Is Texas withholding data? I have seen Texas Health Scientists report that deaths are not faithfully recorded in intervews.

COVID-19, flattening the curve, and Benford’s law.

Please see the link below:

https://www.researchgate.net/publication/343657736_COVID-19_flattening_the_curve_and_Benford's_law

I saw Connected too. I’m also responsible for tracking our COVID cases at my hospital.

I though I would find it in Medical Record Numbers…but no. Those numbers are computationally assigned. In fact, I was surprised to learn that the pattern didn’t show any time that I included any numbers at all that were auto-assigned by a computer.

The only numbers that worked, that followed the pattern organically, were the “seemingly random” or even “insignificant” data. Like birth month, day of birth, but then…the street number of each patient address.

The pattern was most reliable applied to the “randomness” of the street addresses. When I looked at each of our patients, I found it in the data sets of…

All COVID test records (even when I include duplicated tests); PCR-only COVID test records; In COVID-positive cases; In unique patients tested; then when pulling all patients who walked in to the hospital during that same time frame.

My graphing also has a dip like yours in “7”… but also in “9”. Nevertheless, this is more than enough for me to start a research project. If we can use this for verification, how can we apply Benford’s law to prevention methods and as a starting point for the health task forces that have formed in all of our cities…

*my brain is like a million miles an hour with this*

Randomly assigned digits do not follow Benford’s Law, so it looks like you’re analyzing it correctly!

Bedford’s law reminds me of another phenomena in nature called the Fibonacci sequence where petals on flowers only have certain numbers for various plants. The number pattern is the sum of the previous two numbers. 3,5,8,13,21 etc. It is also found in other patterns in nature. Kind of strange how so much that seems totally random is not.