Gimme Shelter, Gimme Testing Data

On Day 2 of our Alameda County Shelter-in-place order, I am creating graphs, mostly for my sanity. Today’s topic is data, in particular, Covid-19 testing data. If you’re a data geek like me, this is for you.

I have blathered on for days (?or is it weeks? I’ve lost track…days seem like weeks) that our biggest problem right now is lack of testing. We don’t know what we don’t know. Because the U.S. didn’t roll out testing capacity early on, people who feel sick or at risk for Covid haven’t been able to get tested. We’ve heard that for weeks and are still hearing it. Because people who know they’re sick can’t get tested, we have no idea who is sick and how many would test positive. Without knowing that, everyone has to STOP moving. That’s the problem right now.

Yes, it’s definitely a problem that hospitals are starting to become overwhelmed and might become swamped. It’s definitely a problem that travel is cancelled and that there is a black market for toilet paper and sanitizers. (Anybody know where we can get some ramen? That turns out to be a big concern in our house.) It’s an even bigger problem that we don’t know how long this will last, and we won’t know until there’s a robust testing structure in place. South Korea put in an excellent testing structure early on, and they seem to be moving into a better part of the pandemic curve. We can learn something from their experience, and we can learn something looking at data.

The Most Important Data Is Under-Reported

The problem has been a lack of good data, and good testing data is still hit and miss. In a world that’s used to hitting the “refresh” button every minute and seeing numbers update, having data that is only reported every few days or not at all is killer to the psyche. Up until about a week ago, data on how many people were being tested was nearly impossible to find. This was due partly because few had been tested; I might also speculate that some didn’t want the public to know just how few that was.

I can illustrate this by looking at Daily Case data compared with Daily Testing data. Here is the number of cases in California, shown per day and total to date. By the way, note that the red bars (daily cases) are linked to numbers on the left side and the purple line (cases to date) linked to the right side. Showing data on different axes is important because if you show cumulative and daily on the same graph, the cumulative would make the daily increases too small to see. You would have no sense of the underlying infection curve.

Graph of California Covid cases
Graph by kajmeister based on data sources in COVID Tracking project.

What public health officials are trying to do is to get the daily cases to drop, although at the moment, daily cases are still increasing. The Growth factor, which is new cases today over new cases yesterday, is still running over 1, an increasing increase in new cases. Which is bad but probably not surprising to any of us. (Since I posted this, the NY Times actually ran a “slightly good news” article pointing out that Italy’s Growth factor might finally be under 1.)

Testing data reported is not nearly as clean. Testing capacity has been increasing significantly in the past week, which means there have been huge spikes in number of people tested. But the data seems to change rapidly for no reason other than not being reported. For example, after a few test numbers popped up in articles here and there, California reported on Monday night that 8316 people had been tested. They surely did not test 7000 people in one day, after barely any. So, not only has the testing itself been spotty, but the reporting spotty as well.

Graph by kajmeister based on data sources in COVID Tracking project.

Let’s say for the sake of argument that this a reporting issue on top of a testing issue. Assuming that this most recent testing spike really reflects the last several days (but not just one day), I smoothed the data. Why am I allowed to do that? Call it occupational expertise, thirty years of working with other people’s lousy data. Plus, it’s my blog.

Anyway, if you’re ok with a little adjustment of data, then the view below is probably what the data reflects. We have had a recent spike in testing and started ramping up to testing 1500 or so people per day. The LA Times also recently reported that our local labs are working towards capability of 20,000 per day, so these numbers will probably spike again. That would be a good thing because we know there is pent-up demand for testing.

Graph by kajmeister based on data sources in COVID Tracking project.

Why do we care, again? Because we need to know what we don’t know. We need to know, at a minimum, how many people who feel sick, were in contact with a sick person, or were potentially exposed for some other reason like travel to a hot spot, are themselves sick. One epidemiologist called it “blocking and tackling.” Once you can test anyone quickly, then anyone who is positive stays home, isolated from everyone. Once you can determine who is positive, then you could allow for some limited interaction instead of Community Freeze Tag. Because then, whenever anyone displays any symptoms at all, you can immediately test and isolate them. That’s what South Korea did; they did not have to institute a countrywide shelter-in-place.

This is where South Korea is, with their 10,000 or more tests per day. This is what we want our graph to look like. Even more, what we fervently want is to soon be at the day when we hit that top spike. Until we have sufficient daily testing, we’re probably not there yet. (My guess is at least ten days away, in California.)

Graph of South Korea new cases, reported by

The Least Important Data Is Over-Reported

What is additionally frustrating, on top of not being able to see widely reported consistent testing numbers, is to see widely reported useless information. A hot spot world map for a pandemic doesn’t tell us anything. If I know that there a lot of cases in Iran or Italy, other than feeling sad, what does that tell me? If I know that there are now cases in all fifty states, does that matter? It’s curious that it took West Virginia longer than Alaska to report a case, but it doesn’t matter to the general public unless we were planning to go to West Virginia.

Interesting, but not data we can act on. Reported by BING COVID tracker.

Snapshot data is not helpful, even if it’s put in a fancy picture. Death tolls are not necessarily helpful, either. They’re useful in projecting things like fatality rates, but as those don’t change over time, knowing the individual number each day isn’t useful. The media reports them because as Don Henley once sang: It’s interesting when people die, give us Dirty Laundry…

What we really need is longitudinal data, where we’re trending.

Luckily, We Can Get Some Data

County and state health departments, “print” media (i.e. online newspapers), and the CDC have begun to report numbers, which is a really good thing. The Atlantic has instituted a large scale tracking effort called The Covid Project that has identified volunteers who are pulling and submitting data for every state. Bless you, data collectors! (I was too late to volunteer.) Shout out also the the LA Times, which found a source to consistently report daily cases a week or two before the COVID Project got underway.

When you have detailed, consistent, daily data, you can use it to understand what’s going on better. Here’s an example. Given the high number of positive cases and low numbers of testing, up until a week ago, most of the people testing were already running fevers and high-risk to begin with. So those “early” high percentages of positive tests out of the puny number of total tests would scare the hair off anyone. But once you start ramping up testing, you start to see an increasingly lower number of people testing positive. When we finally ramp up to testing 20,000 people a day, both the daily and cumulative graphs will drop like a rock.

Graph by kajmeister based on data sources in COVID Tracking project.

When you have good data, you can start to make comparisons and ask questions. That was California’s data. Here are also New York and Washington, two other locales hit hard early from Covid.

Graph by kajmeister based on data sources in COVID Tracking project.
Graph by kajmeister based on data sources in COVID Tracking project.

New York has seen a higher positives even as they seem to be testing more people, and their positive rate started moving upwards again rather than improving. Washington state had a huge spike–outbreak in a nursing home–but now is on a sharp downward trend. Why? That’s a whole ‘nother blog.

If anyone sitting at home is twiddling their thumbs and would like to talk data rather than watch yet another program repeating the same death toll as yesterday or showing empty pictures of the toilet paper shelves, let me know. In the absence of March Madness and sitting down in a coffee shop, we can at least get a good daily dose of data.

Next Post, I may try some pivot tables.

Anyone have other data ideas? Let me know.

Posted in partial response to Fandango’s Provocative Question, related to a CNN anchor’s thoughts about the current state of the U.S. under Covid.

3 Replies to “Gimme Shelter, Gimme Testing Data”

Leave a Reply