Playing Around With Data

In yesterday’s post, I spoke about collection, and a teeny-tiny bit about the history of the institutions behind data collection exercises in India.1

In today’s post, I’ll compare two websites – one American and one Indian – to show you how both countries allow researchers to use the data that has been collected. Spoiler alert: the American website does a way better job. The idea isn’t to run down the Indian website, but to see how much distance we need to cover in terms of improvement.

And I think it is a worthwhile question to ask – why is the American website so much better? What is it about us that we cannot come up with a website of a similar quality? Is it a question of capacity, of bureaucratic inertia, of not enough demand from the research community in India or something else altogether? This is a topic worth thinking about… but not today.


The American website is FRED, hosted by the St Louis branch of the Federal Reserve. FRED stands for Federal Reserve Economic Data, and it is a magnificent resource. It really and truly is.

Federal Reserve Economic Data (FRED) is a database maintained by the Research division of the Federal Reserve Bank of St. Louis that has more than 765,000 economic time series from 96 sources. The data can be viewed in graphical and text form or downloaded for import to a database or spreadsheet, and viewed on mobile devices. They cover banking, business/fiscal, consumer price indexes, employment and population, exchange rates, gross domestic product, interest rates, monetary aggregates, producer price indexes, reserves and monetary base, U.S. trade and international transactions, and U.S. financial data. The time series are compiled by the Federal Reserve and many are collected from government agencies such as the U.S. Census and the Bureau of Labor Statistics.

The economic data published on FRED are widely reported in the media and play a key role in financial markets. In a 2012 Business Insider article titled “The Most Amazing Economics Website in the World”, Joe Weisenthal quoted Paul Krugman as saying: “I think just about everyone doing short-order research — trying to make sense of economic issues in more or less real time — has become a FRED fanatic.”

https://en.wikipedia.org/wiki/Federal_Reserve_Economic_Data

I’ve been using the website for years now in classes that I teach, but I’m sure there are features of the website that I have not been able to use. It’s got the ability to create charts on the fly, it has embeddable widgets, it even has a functional Excel add-in.

If you’re looking at this website for the first time, try going through these exercises. Or, if you are a video kind of person, try this playlist on YouTube.

It is, all things considered, a wonderful way to take a look at data – mostly American, naturally, but it does have a whole host of other data series as well.


The Indian website is our comparable offering: the database on the Indian economy. As you will see once you click on the link, it isn’t nearly as user-friendly as FRED, and in my experience, the website itself isn’t always “up” all the time. There isn’t, to the best of my knowledge, a YouTube channel that explains how to use the website, and while there is a brochure about DBIE, it isn’t quite as helpful as it ought to be.

Indian researchers will also visit the MOSPI website often. That is the Ministry of Statistics and Programme Implementation. If you read the link supplied in the first footnote of today’s blogpost, you will know that MOSPI is the culmination of India’s data collection exercises – these have been ongoing since at least 1881.

The MOSPI website itself is a bit problematic, because there are two now. One is mospi.nic.in, which is the one I have linked to above, and the other is mospi.gov.in. This one seems to not be fully functional just yet, and the data is far from complete. Gratifyingly, what little data there is on the new website is made available in Excel formats.

That is actually a major problem, because on the old (but current, if you see what I mean) MOSPI, data is given in PDF format. There is an army of Indian researchers who have fought the Great PDF Wars, as a consequence, and therefore have learnt about Chrome extensions, and about Tabula. If you are planning on researching the Indian economy, you will have to acquire these skills sooner or later, for MOSPI and DBIE are the best we have on offer in terms of data portals2.


I said I won’t speak about the “why” regarding data portal quality, but I would like to offer a suggestion about the “how” in terms of improving it.

Appoint an educational institute to be the nodal agency3, and get them to work on a report about what needs to change, and why and how, for the DBIE website to become better than it is right now. That doesn’t mean (at all) a blind copy of FRED, awesome though FRED definitely is.

And if the team that does end up working on this is also allowed to come up with a beta version of the new website, well, that would just be the proverbial cherry on top.

I mean, why not?

  1. Really teeny-tiny bit. Please read the whole thing[]
  2. that are free and government run. There are other data portals available, but of course one must pay for them[]
  3. IGIDR would be a good pick for obvious reasons[]

The Indian FRED

So from yesterday’s post, this is where you need to go to get the data about India’s agricultural exports. There may be more than one correct answer, of course, but the Excel file that I generated came from here. The DGCIS website also offered to give me the data, but after telling me that I would need to pay the princely amount of Rs. 169 for it. Why Rs. 169? They charge Rs. 1 for each row of data in MS Excel. Nope, I’m not making this up.

A dummy query that I ran on http://ftddp.dgciskol.gov.in/

I can go on and on about the theme of working with data in India. Anybody who works with, or has worked with data published by the Indian government for the last twenty years can go on and on about this. We make it really difficult to access data easily in India, and that in the following ways:

  • It is not clear which site to access to get the data that you want
  • That data may not have been updated for a while
  • That data will probably only be available in PDF format (which is a whole separate level of hell)
  • The website may often be down (looking at you, dbie!)

To give you just one, already painfully familiar example: to download CPI data, should one go to the RBI website or the MOSPI website? If the MOSPI website (which is the correct answer), which MOSPI website? There have been two for a while now: this one, and this one.

And when you eventually do reach what may be the correct page, this is what you get:

http://164.100.34.62:8080/TimeSeries_2012.aspx

For the record, I know you can get CPI data from the old MOSPI website. But the point I am trying to make here is this: surely we can get (and surely we deserve) better data portals? For a country with the kind of software talent that India possesses, surely this is not the best way to design a UI?

I’ve written about this before here on EFE, but every time I write a post about data in India, I get frustrated enough to write about it all over again.

Appoint an educational institute to be the nodal agency, and get them to work on a report about what needs to change, and why and how, for the DBIE website to become better than it is right now. That doesn’t mean (at all) a blind copy of FRED, awesome though FRED definitely is.

https://econforeverybody.com/2021/03/16/playing-around-with-data/

Is there anybody in India working on trying to figure out ways to get Indian data to be more easily accessible? On documenting what data sources are needed, and how to arrange for their capture, their storage, and to make it easy to retrieve it? And this across all three levels of government1? And not for private profit, but so that data is open to all?

If there is such a project, I would be most grateful if you could point me towards it. And if there isn’t one, why are those of us in Indian academia not working towards figuring out how to get this done? This is India, and this is 2021. Surely we can do a better job of making data more accessible to ourselves?

  1. state level data is a whole different problem. And data below that level of government is, well, let’s leave it be for the moment[]

Learn Macro by Reading the Paper

Macro, and I’ve said this before, is hard.

But a useful way to start understanding it, at least in an Indian context, is by:

  • carefully reading a well written article
  • understanding and noting for oneself key concepts within that article
  • recreating the charts from that article
    • That includes figuring out the source of the data…
    • … as well as acquiring the ability to build out these charts
  • And most important of all, creating a piece of your own (could be a YouTube video/short, a blog, an Instagram story, a Twitter thread) that helps simplify the article you’ve read.1

Now, Arvind Subramanian and Josh Felman have generously obliged us by writing a well written article. I’ll oblige you by carefully reading it and annotating it, including pointing out key concepts, sources for data and recommendations for building out the charts.

That just leaves the last point for you, dear reader. We’ll call that homework.

Now, the well written article:

For more than a decade, India’s fiscal problem has been on the back-burner, acknowledged as a concern, but excluded from the ranks of pressing issues. Now, however, the problem is back with a vengeance. COVID has upended the fiscal position, and fixing it will require considerable time and effort, even if the economy recovers. This worrisome prospect has prompted calls for the Fiscal Responsibility and Budget Management Act (FRBM) to be dusted off, reintroduced, and implemented — this time, strictly and faithfully. But before we heed them, we need to understand why the previous FRBM strategy failed and how to prevent a repeat. We argue below that the new strategy will look nothing like the current FRBM.

https://indianexpress.com/article/opinion/columns/coronanvirus-india-economy-gdp-growth-post-covid-7261915/

First things first, what is FRBM?

The Fiscal Responsibility and Budget Management Act, 2003 (FRBMA) is an Act of the Parliament of India to institutionalize financial discipline, reduce India’s fiscal deficit, improve macroeconomic management and the overall management of the public funds by moving towards a balanced budget and strengthen fiscal prudence. The main purpose was to eliminate revenue deficit of the country (building revenue surplus thereafter) and bring down the fiscal deficit to a manageable 3% of the GDP by March 2008.

https://en.wikipedia.org/wiki/Fiscal_Responsibility_and_Budget_Management_Act,_2003

Think of it as a one-person Alcoholic’s Anonymous club. It is of the government, for the government and by the government, and the idea is to wean the government off a dangerous addiction that it is hopelessly affixed to: debt.


By the way, there are many reasons this is a good essay, not the least of which is how well structured it is. The first three sentences in the very first paragraph, excerpted above, point out the problem that is going to be addressed, without using any difficult words or jargon. Then they point out the tool that will be used to address the problem. Then they point out the tool itself has problems. Finally, the explain that the essay is about fixing those problems. And then the essay follows. You might want to keep this in mind when writing your own essays (or indeed creating your own podcasts/videos etc.)


Now, back to the essay:

  1. What is general government debt? Where can I access the data?
    Note the second hyperlink above: I’ve linked to the Fred St Louis page about India’s debt, which itself gets the data from the IMF. Here is the page from the Ministry of Finance’s own website titled Public Finance Statistics. It has not been updated since September 2015. Here is a Motilal Oswal report on the subject that pegs general government debt at INR 157,227 billion. (Exhibit 1 in the report). If you read footnote 3 of that exhibit, two things happen. The first thing that happens is that you realize that tracking down general government debt might take a while. The second thing that happens is you feel a rather large twinge of sympathy for the folks who have tried to do this exercise.
    Figure 1 in the well-written article that we are analyzing in today’s post doesn’t mention a source, unfortunately. So recreating that chart will involve a rather large part of our day – but I would strongly recommend that you do the exercise. If you want to analyze Indian macroeconomic data for a living, this will be a good initiation. And indeed, a write-up about this exercise alone is a worthy addition to your CV!
  2. Second r-g: what is r, and what is g?
    1. “r” is the policy rate, which in our case will be the repo rate. This is available on the homepage of the RBI, top-left, under current rates.
    2. Time series data? Available on the DBIE page, under key rates.
    3. “g” is the nominal growth rate of the economy, and can be found at MOSPI.
    4. A useful thing to do as a student is to try and recreate the chart in the well-written article.
    5. Pts 1 and 2 here will help you get most of the data, and try and use either Microsoft Excel or Datawrapper to recreate the chart.2
  3. Next, what is primary balance?3 Where does one get that data in India?4
  4. Next, this sentence from the article: “Simple fiscal arithmetic shows that debt does not explode when the former (primary balance) is greater than the latter (interest-growth differential)”. What is this “simple fiscal arithmetic”? They’ve explained it in equations 1 and 2 in this paper.5
  5. The next three paragraphs after Figure 1 in the article point out how precarious India’s situation is when it comes to government debt, and why. It is one thing to read about the equation in a textbook, it is quite another to “run” the numbers in practice. Give it a shot, please, and see if it makes sense.
  6. Next, this paragraph from the article:
    “First, India should abandon multiple fiscal criteria for guiding fiscal policy. The current FRBM sets targets for the overall deficit, the revenue deficit and debt. This proliferation of targets impedes the objective of ensuring sustainability, since the targets can conflict with each other, creating confusion about which one to follow and thereby obfuscating accountability.”
    This paragraph is a good way to understand the importance of reading In The Service of the Republic, by Kelkar and Shah (and also to read up about the Tinbergen Rule).
  7. The next three paragraphs after that are a good way to understand what Goodhart’s Law means in practice.
  8. And finally, see if you can explain to yourself why targeting the primary balance is better than other options. Personally, I agree that it is a better target, and I agree that rather than setting down a concrete number to reach, averaging out half a percentage point worth of reduction is better. In essence, what they’re saying is that you shouldn’t try to reach x kilos of weight on a diet, but lose x% body weight every month. As our ex-captain might have put it, process over results. One of our gods advocates this too, as Navin Kabra points out.
    My reservation comes from the fact that sticking to a diet is hard, and that is true whether you’re targeting a process or a target. In other words, it is the ongoing implementation of the plan that is the challenge, not it’s design!
  9. One last point: without creating something that you are willing to put up for public consumption, and highlighting on your CV as an exercise you have done – you haven’t really learnt. Reading either that article or this blog is the easy part – explaining it somebody else is the much more difficult (and causally speaking, therefore meaningful) bit.
  10. Please, do it!
  1. Skipping this last point is missing the point altogether, rascalla![]
  2. Document your learnings as you go along.[]
  3. Read the whole article, please. It’s a good way to clear your understanding of this topic, and it is free[]
  4. The Excel link under Deficit Statistics was down when I tried to access the data. Your mileage may vary.[]
  5. Page 3[]

Reproducibility and Replicability

I and a colleague conducted a small behavioral economics and experimental economics workshop for our students at the Gokhale Institute. It was a very small, very basic workshop, but one of the things that came up was the reproducibility problem, or as Wikipedia puts it, the replication crisis.

The replication crisis (also called the replicability crisis and the reproducibility crisis) is an ongoing methodological crisis in which it has been found that many scientific studies are difficult or impossible to replicate or reproduce. The replication crisis most severely affects the social sciences and medicine. The phrase was coined in the early 2010s as part of a growing awareness of the problem. The replication crisis represents an important body of research in the field of metascience.

https://en.wikipedia.org/wiki/Replication_crisis

And further on in that same article:

A 2016 poll of 1,500 scientists reported that 70% of them had failed to reproduce at least one other scientist’s experiment (50% had failed to reproduce one of their own experiments).[9] In 2009, 2% of scientists admitted to falsifying studies at least once and 14% admitted to personally knowing someone who did. Misconducts were reported more frequently by medical researchers than others.

https://en.wikipedia.org/wiki/Replication_crisis

The basic idea behind replicability is very simple: you should be able to take the data and the code from the paper you are reading/reviewing, and replicate the results obtained. You don’t have to agree with the choice of method, or with the results or with anything – you should be able to replicate the results, that’s all.

One basic standard of economic research is surely that someone else should be able to reproduce what you have done. They don’t have to agree with what you’ve done. They may think your data is terrible and your methodology is worse. But as a minimal standard, they should be able to reproduce your result, so that the follow-up research can then be in a position to think about what might have been done differently or better. This standard may seem obvious, but during the last 30 years or so, the methods for reproducibility have been transformed.

https://conversableeconomist.blogspot.com/2021/01/the-reproducibility-challenge-with.html

Now (to me, at any rate) this is interesting enough in and of itself, but at the risk of becoming a little meta, reading the rest of Tim Taylor’s post is worth it because it raises so many interesting issues.

The first is a link to a lovely overview of the problem by Lars Vilhuber, published in the Harvard Data Science Review. It is relatively simple to read, and is recommended reading. For example, Vilhuber draws a careful distinction between replicability and reproducibility, and is full of interesting nuggets of information. I’ll list out the major ones (major to me) here. Note that I have simply copy-pasted from the link:

  1. Publication of research articles specifically in economics can be traced back at least to the 1844 publication of the Zeitschrift für die Gesamte Staatswissenschaft (Stigler et al., 1995).
  2. As the first editor of Econometrica, Ragnar Frisch noted, “the original data will, as a rule, be published, unless their volume is excessive […] to stimulate criticism, control, and further studies” (Frisch, 1933)
  3. …only 17.4% of articles in Econometrica in 1989–1990 had empirical content (Stigler et al., 1995)
  4. As Dewald et al. (1986) note: “Many authors cited only general sources such as Survey of Current Business, Federal Reserve Bulletin, or International Financial Statistics, but did not identify the specific issues, tables, and pages from which the data had been extracted.”
  5. Among reproducibility supplements posted alongside articles in the AEA’s journals between 2010 and 2019, Stata is the most popular (72.96% of all supplements), followed by Matlab (22.45%; Vilhuber et al., 2020) (Note: Do check figure 2 at the link. Fascinating stuff.)
  6. It was concluded that “there is no tradition of replication in economics” (McCullough et al., 2006).
  7. The extent of the use of replication exercises in economics classes is anecdotally high, but I am not aware of any study or survey demonstrating this.
  8. The most famous example in economics is, of course, the exchange between Reinhart and Rogoff, and graduate student Thomas Herndon, together with professors Pollin and Ash (Herndon et al., 2014; Reinhart & Rogoff, 2010). (Note to students: this is a fascinating tale. Read up about it!)

There is much more at the link of course, but Tim Taylor’s post does a good job of extracting the key points. I’m noting them here in bullet point fashion, but you really should read the entire thing.

  1. Economic data – our understanding of the phrase needs to change, because a lot of it is in fact not publicly available today.
  2. “Vilhuber writes: “In 1960, 76% of empirical AER [American Economic Review- articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use …””
  3. Restricted Access Data Environments is a new thing that I discovered while writing this blogpost. “…where accredited researchers can get access to detailed data, but in ways that protect individual privacy. For example, there are now 30 Federal Statistical Data Research Centers around the country, mostly located close to big universities.” We could do with something like this in India. Actually, we would be a lot happier with just dbie working the way it was supposed to, but that’s for another day.
  4. Data that is given by creating a sub-sample, data that is ephemeral (try researching Instagram stories, for example) and data that you need to pay for are all challenging, and relatively recent, developments.
  5. I worked for four years in the analytics industry, so believe me when I say this. Data cleaning is a huge issue.
  6. Tim Taylor writes five paragraphs after this one, but this is a glorious para, worth quoting in full:
    “As a final thought, I’ll point out that academic researchers have mixed incentives when it comes to data. They always want access to new data, because new data is often a reliable pathway to published papers that can build a reputation and a paycheck. They often want access to the data used by rival researchers, to understand and to critique their results. But making access available to details of their own data doesn’t necessarily help them much.”

If there are those amongst you who are considering getting into academia, and are wondering what field to specialize in, reproducibility and replicability are fields worth investigating, precisely because they are relatively underrated today, and are only going to get more important tomorrow.

That’s a good investment to make, no?