No Data, No Nothing


Unless you are a number nerd, this number probably means nothing to you. If you are a number nerd, you’ve likely understood its significance. It is, of course, 2^16. (Thanks for pointing this out Bhushan Mehendale. I’d earlier written it as 2^15)

And if you are an Excel nerd, you’ve probably let out a wistful little sigh. For 65,536 is the maximum possible number of rows in the old version of MS-Excel – you know the ones that end in .xls, rather than .xlsx

And if you are a British citizen, you’ve likely gritted your teeth – for chances are that you’ve become painfully familiar with this number over the last year.

Because, you see, the British government was using the old version of MS-Excel1 to record covid-19 cases last year. Yes, really. And one thing led to another, and well, things got messed up.

The BBC has confirmed the missing Covid-19 test data was caused by the ill-thought-out use of Microsoft’s Excel software. Furthermore, PHE was to blame, rather than a third-party contractor.
The issue was caused by the way the agency brought together logs produced by the commercial firms paid to carry out swab tests for the virus.
They filed their results in the form of text-based lists, without issue.
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.
The problem is that the PHE developers picked an old file format to do this – known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.

That particular story, as you can imagine, doesn’t end well for anybody. I was reminded of this story when I read (yet another) excellent column written by Tim Harford:

The pandemic has taught us the same lesson, the hard way. Weaknesses in our information systems have been telling. The tragic failure to produce enough accurate Covid-19 tests swiftly — particularly shocking in the US — is well known.
Subtler failures have received too little attention. Consider this paragraph about social care, in a new report from UK fact checkers Full Fact: “Basic information, such as the number of people receiving care in each area, was not known to central government departments, and local authorities only knew about those people whose care they paid for.” Patchy data cannot have made protecting care homes any easier — nor, more recently, vaccinating them.
Alexis Madrigal, co-founder of The Covid Tracking Project in the US, attests that the UK is not alone. At the beginning of the crisis, he says, “We didn’t even know how many hospitals there were in the United States.” (emphasis added)
That may seem surprising. Yet useful statistics do not simply arrange themselves neatly in a spreadsheet somewhere, waiting to be downloaded. They must be collected: someone must set the standards, link up the systems, hire the personnel. If not, there are gaps.

In their excellent book, In the Service of the Republic, Ajay Shah and Vijay Kelkar talk about approaching policymaking as a siege-style assault.

In mountaineering, the climbers choose from two strategies. In the siege-style assault, a large team establishes a base camp, which sets up the second camp and establishes the logistics for resupplying it, and so on. In the case of Mount Everest, there is a base camp at 5400 metres, camp 1 at 6100 metres, camp 2 at 6400 metres, camp 3 at 6800 metres and camp 4 at 8000 metres. Finally, from here, a few climbers try to get up to the top, which is at an altitude of 8848 metres. The siege-style assault is slow, expensive and reliable.

Kelkar, Vijay; Shah, Ajay. In Service of the Republic . Penguin Random House India Private Limited. Kindle Edition.

Basecamp, they say, is good data.

Stage 1 of the policy pipeline is the establishment of the statistical system. Facts need to be systematically captured. Without facts, the entire downstream process breaks down. Our only hope for truth to matter is for truth to be recorded and widely disseminated. In the modern world, few actors in the economy have an incentive to do a good job of measurement. As an example, academic economists are quite comfortable doing research with faulty data, because the academic economists who will review their work do not ask questions about data quality.

Kelkar, Vijay; Shah, Ajay. In Service of the Republic . Penguin Random House India Private Limited. Kindle Edition.

I’d go a step beyond and say that the sentence “Facts need to be systematically captured.” can and should be expanded.

  1. Facts need to be systematically captured.
  2. Where capturing them is not possible, they need to be systematically sampled, so as to arrive at some plausible conclusion about the overall picture.
  3. They need to be stored efficiently.
  4. They need to be made widely available for research.
  5. Retrieval of facts2 needs to be as simple as possible.

And the bad news is, India’s ability to do this well has been steadily worsening over time.

In December 1956, [[Zhou En Lai]], the Chinese premier and, after Mao, the second most powerful man in China, created much consternation by refusing to leave his meeting at the National Sample Survey Organisation (NSSO) office at the Indian Statistical Institute (ISI) in Kolkata. He was talking to Prasanta Chandra Mahalanobis, the founder of the institute, and one of the pioneers in the field of survey methods.
Zhou was frustrated by his country’s inability to produce useable data on time. China at the time collected data in every single economic unit, which generated more data than they could process. By contrast, India, under Mahalanobis, had opted to use carefully designed random samples of the economic units to infer what was going on for the entire population, which was cheaper and quicker.

Thus begins an article written four years ago by Abhijit V Banerjee, Pranab Bardhan, Rohini Somanathan & TN Srinivasan in the Economic Times. Back in the day, it really was the case that India was looked up to for its strength in efficient data collection. That’s a story waiting to be told – there really is a book in here!

Hints of this story are available online, and make for interesting reading:

Only after India became independent did the Government of India establish a Central Statistical Unit (1949), which was later (1951) converted into the [Central Statistical Organization] and the Department of Statistics, which constitute presently the National Statistical Organisation (NSO) of the Ministry of Statistics and Programme Implementation.
Professor P.C. [[Mahalanobis]], who is regarded as a pioneer in both theoretical and professional statistics, was appointed as the first statistical adviser to the Cabinet, Government of India in January 1949. He was the architect of the statistical system of independent India. Professor P. V. [[Sukhatme]], as Statistical Adviser to the Ministry of Agriculture, was responsible for the development of Agricultural Statistics.

The Banerjee et al article also has useful snippets about this:

The National Sample Survey (NSS), when launched by the NSSO in 1949, was the most ambitious household survey in the world, covering over 1,800 villages and over 100,000 households across India. The methods used by the NSS became the standard for household surveys the world over.
For example, the use of inter-penetrating samples — essentially, two independent samples drawn from the same population — to test the reliability of the survey results, was developed by Mahalanobis in a 1936 paper and remains a standard tool for survey design. The Living Standard Measurement Surveys the World Bank still carries out in many countries are a direct descendent of the NSS.

But alas, as the Banerjee et al article goes on to show, a bit of institutional decay has set in over the years:

If you believe the NSS, GDP could be just about half of what it is according to the CSO. There are occasional academic debates about which one is correct, which no one in power pays any attention to. And yet, it is almost surely true that both estimates (and their growth rates) are off by a huge margin. More worrying, this divergence has been known for nearly 50 years (though it has grown a lot).
And though we are occasionally told that the NSS is understaffed, or that no one knows where the CSO got a particular number, there is absolutely no political interest in improving things. From being the world leader in surveys, we are now one of the countries with a serious data problem while people talk about the really good data you can get in Indonesia or Brazil or even Pakistan.

The story of how India’s statistical capabilities have steadily worsened over time needs to be more popular, and more needs to be done in terms of improving the situation.

But the rather more important point that I want to make is this: you cannot – I repeat, cannot – be a student of economics without understanding data capture, data storage, data retrieval and data cleaning.

Analysis, the glamorous part of the job, begins after this – and trust me, it is the easiest part of the job in comparison.

  1. older versions of MS-Excel, the ones before MS Office 2007, could only store 65,536 rows of information. The ones since then can go up to a million rows[]
  2. For the purpose of this essay, I’m going to use facts and data interchangeably. That’s not strictly speaking accurate, but I’ll sacrifice accuracy for ease of reading[]