On Specificity and Sensitivity

Before the pandemic came along, it was relatively more difficult to get students to be truly interested in the topic of specificity and sensitivity. And in a sense, understandably so. By that I do not mean the topic is not important – it absolutely is – but rather that I can understand why eyes may glaze over just a little bit:

Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. If individuals who have the condition are considered “positive” and those who don’t are considered “negative”, then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

But when we’ve all got skin in the game, it’s a whole other story.

“We’re going to learn all about specificity and sensitivity today” is one way to begin a class on the topic.

“Let’s say you self-administered a Rapid Antigen Test in 2020, and the test came back positive. Do you have Covid or not?” is another.

Incentives matter!


I’ve linked to this thread before, but it is worth sharing once again, for it remains the best way to quickly grok both what specificity and sensitivity are, but also to get a sense of how to untangle the two in your own head:


Why do I bring this up today? Because now that we’re past the pandemic, how do we now motivate students to learn about specificity and sensitivity?

By asking, as it turns out, if we’d prefer detection systems to pick up on more objects in the sky (sensitivity), or get better at picking up only the relevant objects in the sky (specificity)!

After the transit of the spy balloon this month, the North American Aerospace Defense Command, or NORAD, adjusted its radar system to make it more sensitive. As a result, the number of objects it detected increased sharply. In other words, NORAD is picking up more incursions because it is looking for them, spurred on by the heightened awareness caused by the furor over the spy balloon, which floated over the continental United States for a week before an F-22 shot it down on Feb. 4.

https://nyti.ms/3HUWnGD

To a statistician, it doesn’t matter if it’s objects in the sky or objects in your body. The principle remains the same, and it is the principle that you should internalize as a student. But also, it is equally important that you ask yourself a very important, and a very underrated question once you’ve learned the principle in question:

Where else is this applicable?

I cannot begin to tell you how much more interesting things become when you ask and answer this question. UFO’s and viruses in your body – what a class in statistics this would be!

No?

Signal, Noise and Glenn McGrath

Don’t miss Tony Greig’s explanation at the end.

Accuracy is awesome – accuracy with minor variations is terrifying!

JASP as a Way to Teach Statistics

I’ve long been in search of a statistical tool that, well, does statistical analysis, but does so in a way that doesn’t require much by way of coding and whose output is intuitive. Above all, the idea behind distributions, and the ability to play around with distributions should become clearer for having used the software, and not the other way around.

If, as a bonus, it could give a visual representation of the solutions to your typical ‘normal distribution’ type problems from undergrad level textbooks – well, that would be awesome.

And it is early days yet, but JASP seems to be just that software. I learnt about it because of the magic of Twitter:

And much as I dislike the circus around the social media company these days, I still remain hopelessly addicted to the app. The benefits that I get by being a part of it still make it entirely worthwhile.

And well, I downloaded it (it’s free, do go ahead and give it a try), kicked the wheels for a bit, and it seems to be very good as a teaching tool for undergrad students. Not just undergrad students, I suppose – anybody who is relatively new to statistics will enjoy this software more than the alternatives when it comes to developing an intuitive feel for the subject.

This was one of the first videos that I saw, and it is reassuring to note that the learning curve is not steep.

I will be trying more things on this in the days to come, but in the meantime, if any of you have tried JASP in the past, and have resources to share, please do send them my way – I’ll update this blogpost.

Thank you!

But Why 68?

Doing mental math has been a way to avoid boredom for years, in my case. I would dream up mental math games to stay awake in classes in school and college, and often play around with number pattterns in my head when I’m out driving or riding. It doesn’t make me the world’s best travel companion, I suppose, but on the flipside, I never get bored on long solo trips.

But even in extremely short trips, such as those between different floors of the same building, I can’t help but think about math. And I’ve always wondered why lifts (elevators, if you want to be all fancy) assume that the average body weight is 68 kilograms.

Try it, the next time you get into a lift anywhere in India. Take a look at the number of people the lift will mention as the upper limit in terms of capacity, and take a look at the maximum permissible weight. The second number divided by the first will always be 68.

Why?


This question has been asked (and answered) on Quora, but the answer isn’t very satisfactory. Nor is the link used within that answer very informative. It simply states, in the case of a tragic accident involving elevators that one should assume an average weight of 68 kg, and given this assumption, it goes on to state that the accident was because of excess weight.

But that, of course, still leaves our question unanswered – why should one assume that the average weight of a human being is 68 kg?

The internet is a wonderful thing, and its long tail is a sight to behold. A Google search took me to a website that seems to be – best as I can tell – for fans of elevators. Yes, really – and they had this interesting factoid:

In Europe Standard, every passengers are assigned to at least 75 kg by default[1]. However, the suggested people can carry by the elevators are related how many spaces can the elevator allocated. Which is the Available Car Area[2]. (sic)

https://elevation.fandom.com/wiki/Capacity

That first footnote takes you to a Facebook page (of all things), and even there you’ll need to use Google Translate to get at what the text (which is in Chinese) is saying. But that just turns out to be a Hong Kong manual of some sort that simply says the same thing again. And no matter whether it is 68 kgs or 75 kgs, there still seems to be no answer to the question: why?

Wikipedia has an entry on the issue, and here is a table from that entry:

RegionAdult population
(millions)
Average weightOverweight population /
total population
Source
Africa53560.7 kg (133.8 lb)28.9%[12]
Asia2,81557.7 kg (127.2 lb)24.2%[12]
Europe60670.8 kg (156.1 lb)55.6%[12]
Latin America and the Caribbean38667.9 kg (149.7 lb)57.9%[12]
North America26380.7 kg (177.9 lb)73.9%[12]
Oceania2474.1 kg (163.4 lb)63.3%[12]
World4,63062.0 kg (136.7 lb)34.7%[12]
https://en.wikipedia.org/wiki/Human_body_weight#Average_weight_around_the_world

Note that the data is from 2005. Lots of interesting data points to ponder over in that table, not the least of which is the ratio of the overweight population to total population column – North America is at an eye-popping 74%! But the average weight, for Asia, was only at 57.7 kgs. Does that mean that we can get more people in lifts in India? Let me perfectly clear: I am not for even a single second suggesting that we carry out crazy experiments like these ourselves. But as a statistician and an economist, I cannot help but ask if there is slack in the system.

Before we move on to the next table from Wikipedia, a slight digression based on [12] in the last column of that table. That footnote refers to a paper titled “The weight of nations: an estimation of adult human biomass“, and it is quite a read in its own right. This paragraph in particular caught my eye:

The average BMI in USA in 2005 was 28.7. If all countries had the same age-sex BMI distribution as the USA, total human biomass would increase by 58 million tonnes, a 20% increase in global biomass and the equivalent of 935 million people of world average body mass in 2005. This increase in biomass would increase energy requirements by 261 kcal/day/adult, which is equivalent to the energy requirement of 473 million adults. Biomass due to obesity would increase by 434%.

Walpole SC, Prieto-Merino D, Edwards P, Cleland J, Stevens G, Roberts I. The weight of nations: an estimation of adult human biomass. BMC Public Health. 2012 Jun 18;12:439. doi: 10.1186/1471-2458-12-439. PMID: 22709383; PMCID: PMC3408371.

Back to the next table in the Wikipedia entry, however. Here is where it gets even more fascinating. This table claims that the average weight of Indians is 65 kgs – but that’s for men. Women, on the other hand, are at 55 kgs. The table is too large to paste over here, but you can click here.

So where do they get that number from? The table links to an out-of-date webpage, but a little searching around on that page takes you here.

https://drive.google.com/file/d/1j3umH5zcJAGNR_WUFwl3-0rBiemYw8DR/view

But does this mean that the average weight of Indian men is 65kgs, and that of women is 55 kgs? Or does this mean (which is what I think) that the reference Indian adult man and woman have a fixed body weight of 65 and 55 kgs respectively?

The NFHS factsheet simply tells us the percentage of people in India who are adult and overweight (30% in urban areas for both sexes, and about 20% in rural areas, in case you were wondering) – but we still dont know the average weight of Indians!

My best guess is that the number doesn’t come so much from the average weight of Indians, but rather from the nice round nature of the number itself. Now, you might not think 68 kilograms to be a round number, but try converting it into pounds, and see if my little theory makes any sense.


If anybody knows anything about this, and is willing to help, it would be much appreciated. But more importantly, to any student of statistics reading this, you can bring the subject magically alive by just looking around you, and by asking questions of a stubborn nature – such as the one that I just tried to answer here!

Just one Object

When I teach courses in introductory statistics, my focus isn’t so much on helping students memorize definitions and formulas as it is on helping them understand the point of the core statistical concepts.

I often ask a student in class to tell us about their favorite movie, for example. Let’s assume that the student in question says “Dulhe Raja”.* Ok, I might say, rate the movie for us. And let’s assume that the student says 9.

I then ask the student if every single aspect of the movie is 9/10. All the songs, all of the fight sequences, all of the dialogues, every single directorial decision – is everything a 9/10? And the usual answer, of course, is no. Parts of the movie do much worse, and there might be some that are a perfect 10. But all in all, if the entire movie had to be summarized in just one number, that number would be nine (in that student’s opinion). Which, of course, is one way to think about averages. It’s a great way to summarize, distill or boil down a dataset into just one data point.

Of course, you would want to worry about whether each dimension of the movie has been given equal importance or otherwise. Dilli-6, for example, gets a score of 6/10 from me, but that’s because the music is just so utterly fantastic. But I’m giving much more importance to the music, and not that much importance to anything else (which, for me, was almost uniformly meh). And then, of course, we start to talk about weighted averages. And this also is a great way to segue into what standard deviation is all about. Then come the formulas and the problem solving, but that’s a whole other story.


So why am I speaking about this right now? Because I read an article in The Print the other day, which asked an interesting question that reminded me of all of what I’ve written about above:

If there was a cultural artefact that truly represents everything that is India today, what would it be?

https://theprint.in/opinion/pov/rasgulla-taj-mahal-sanskrit-what-if-i-told-you-to-pick-one-object-that-represents-india/1175979/

What a question to think about, no? Read the rest of the article to find out the author’s own answer, but in what follows, I want to try and think through my own answer to this question.

First, the recognition that we’re talking about a truly multi-dimensional problem. India is diverse in terms of her geography, her languages, her dance forms, her religions, her architecture, her food, her music – I can go on and on. As, I’m sure, can you!

So should we try and come up with an artefact that covers all (or as many dimensions as possible) at once? Maybe a movie, maybe a song, maybe an epic? But can (and should) a movie or a song encompass all of what India is across space and time?

The Mahabharata, maybe? A saga told in multiple languages, in various forms, from the viewpoint of many different protagonists, interpreted in a variety of ways over the centuries, and contains innumerable references to music, dance, food, sport, architecture besides so much else.

The only other artefact that might qualify must have something to do with food. We might have different ingredients, different techniques and different methods of preparing our food, but we all love a good meal, no? So might there be a dish, or a drink, that truly represents everything that India is today?

Tea? Nimbu paani? Khichdi?

Or does this dataset have so much variance that the average isn’t really a good representation?

I haven’t yet found an answer that satisfies me – which is a good thing! – but I do think I have found a good question to teach statistics better. No?

*This is a fantastic movie, and I will not be taking any questions.

Statistics 110 from Harvard University

Knock yourself out, and join me as you do so! And do follow @stat110 on Twitter.

Thinking Probabilistically

In this past Monday’s post, I spoke about how I disagree with the idea that economics is about putting numbers on everything.

But a conversation I had over the weekend helped me think about how I might be wrong in this regard. If economics is about getting the most out of life, and if there are opportunity costs to everything in life – and I think both of these ideas to be central to thinking like an economics – then thinking probabilistically is a skill more of us should pick up.

Why? Because life is uncertain.


Getting the most out of life requires us to make decisions. These decisions are based on information that is almost always going to be incomplete. This is because acquiring all possible information relevant to the decision making process is an expensive, time consuming process. How do we know if have collected “enough” information? We don’t – we make at best an educated guess.

And regardless of how much information we have collected, we live in an uncertain world. Our best laid plans are likely to go awry. And so we make decisions on the basis of incomplete information, and the outcome of these decisions is uncertain.

For example: which college should I enroll in? I’ll decide this by taking a look at the college website, speak to folks within the college, speak to some of its alumni, maybe visit the college and speak to some professors. I could do this for as long as I like, and as thoroughly as I possibly can. But I will never be able to acquire all information relevant to this decision. And so my decision to enroll in a college is on the basis of incomplete information.

And say you did all the background research, and attended all the introductory seminars, and visited the college, and enrolled in it in, say, 2019. Six months into your course, the pandemic hits. Not your fault, not the college’s fault, and maybe we will never know for sure whose fault it is – but the outcome of your decision to enroll certainly wasn’t one you were expecting.

But this also applies to which chai tapri to visit after having bunked one of the lectures in the first six months. And whether to have a second cutting chai at that tapri. The point is, every single decision needs an evaluation on your part. And this evaluation is always with incomplete information, and the outcomes are always uncertain.

But can you evaluate the probability that things will work out reasonably well? What if, in 2019, you asked yourself about the chance that there would be a major pandemic that would disrupt college life completely? Well, an entirely reasonable approach would have been to take a look at the past one hundred years and ask if something along these lines had taken place. And you would have to conclude, quite reasonably, that the chance of something like this happening was one in a hundred, at best. You should therefore have bet on something like this not happening.

The point I’m making is not that you would have been wrong in this particular case – the point is the process of thinking probabilistically. I’m not saying you should whip out paper and pencil every time you decide to have chai at the tapri. But for most major decisions where probability based calculations are possible, it helps to put a probability based estimate on things working out. You will still end up making the wrong bet every now and then, and that’s fine. The point is that you know the odds (more or less) going into battle, and this helps you decide whether or not to engage in battle in the first place.


Again, I have not read Russ Roberts book just yet, but I think the point he is making in the book is that some problems are beyond the pale of this probability based approach to life. And I still stand by what I said in Monday’s post, economics ought to be about more than putting a number on most things.

But that being said, thinking probabilistically is very underrated, and I would encourage you to get started.

Two final points: my thanks to Amit Varma for a conversation about this that inspired this post. And trust me, he knows more than a thing or two about thinking probabilistically.

And second, via Samarth Bansal, a website that seems very promising in terms of helping all of us learn about statistics and probability.

What is common to Butyrylcholinesterase and Vitamin D, or why English is an underrated skill in statistics

Today’s blog post title is in the running for the longest title that I have come up with, but let’s ignore this particular bit of potential trivia and get on with it.

Today’s story really begins with the tragic tale of Sally Clark. It is a very lengthy extract, from a piece I wrote along with a friend some months ago. Lengthy, but fascinating:

In November of the year 1999, an English Solicitor named Sally Clark was convicted on two charges of murder, and sentenced to life imprisonment. This tragic case is notable for many reasons — one of those reasons was the fact that her alleged victims were her own sons. Another was the fact that both were toddlers when they died.
The cause of death in both cases was initially attributed to sudden infant death syndrome (SIDS), also known as cot death in the United Kingdom. We did not know then, and do not know until this day, about the specific causes of SIDS. But suspicion grew on account of the fact that two children from the same family had died due to unspecified causes, and shortly after the death of her second child, Sally Clark was arrested, tried and convicted.
One of the clinching pieces of evidence was expert testimony provided by the pediatrician Professor Sir Roy Meadow. He put the odds of two children from the same family dying of SIDS at 1 in 73 million — in other words, an all but impossible eventuality. On the back of this testimony, and others, Sally Clark was convicted of the crime of murdering her own sons, and sent to prison for life.
One cannot help but ask the question: how did Sir Roy Meadow arrive at this number of 1 in 73 million? Succinctly put, here is the theory: for the level of affluence that Sally Clark’s family possessed, the chance of one infant dying of SIDS was 1 in 8543. This was simply an empirical observation. What then, were the chances that two children from the same family would die of SIDS?
The answer to this question, statisticians tell us, depends on whether the two deaths are independent of each other. If one assumes that they are, then the probability of two deaths in the same family is simply the multiplicative product of the two probabilities. That is, 1 in 8543 multiplied by itself, which is 1 in 73 million and that would be enough to convince any “reasonable man” that the deaths were deliberate and could not have been just coincidence.
But on the other hand, if the two events are not independent of each other — say, for example, that there are underlying genetic or environmental reasons that we simply are not aware of just yet — then it is entirely possible that multiple children from the same family may die of SIDS. In fact, given a SIDS death in a family, research shows that the likelihood of a second SIDS death goes up.
Sally Clark’s convictions were overturned on her second appeal, and she was released from prison. She died four years later due to alcohol poisoning.

https://www.scconline.com/blog/post/2021/12/14/data-analysis-an-essential-skill-for-the-legal-community/

We’ll get back to this truly tragic tale, but let’s go off on a tangent for a second.


Today’s a day for extracts from my own earlier work, it would seem, for I have another one for your consideration:

Us teaching type folks love to say that correlation isn’t causation. As with most things in life, the trouble starts when you try to decipher what this means, exactly. Wikipedia has an entire article devoted to the phrase, and it has occupied space in some of the most brilliant minds that have ever been around.
Simply put, here’s a way to think about it: not everything that is correlated is necessarily going to imply causation.
For example, this one chart from this magnificent website (and please, do take a look at all the charts):

https://econforeverybody.com/2021/05/19/correlation-causation-and-thinking-things-through/
https://www.tylervigen.com/spurious-correlations

Hold on to this line of thinking, and let’s get back to the tragic Sally Clark story, but with a twist towards the rather more optimistic side of things.


Great news, right? We’ve found what causes SIDS!

Well, that’s where it gets tricky, and we go off on yet another tangent.


Do Vitamin D supplements help? We know that sunlight gives us Vitamin D, and that’s A Good Thing. So if we don’t get enough sunlight, hey, let’s get Vitamin D injections or supplements:

In interpreting vitamin D-related study results, correlation should not be understood as causation. Diets composed of vitamin D–rich foods such as dairy products and salmon also contain high levels of other healthy nutrients. Those who have a high vitamin D level are likely to participate in active outdoor activities and exercises, to be interested in health issues, and to have a healthy lifestyle. Without considering these confounders, misleading results can be obtained. In the study by Kim et al.,4) a univariate analysis revealed a correlation between a low vitamin D level and a low quality of life score; however, its significance was lost when age, sex, income, education level, and disease state were considered.
Sometimes, correlations shown in cross-sectional studies are used as evidence for requiring vitamin D supplements. A recent increasing trend of taking vitamin D supplements may be due to these effects.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4961851/

What if Vitamin D is just a marker? That is, what if sunlight causes a lot of good things in our bodies, and it also causes Vitamin D levels in our body to go up? So it’s not sunlight that causes vitamin D to go up, and vitamin D that causes an increase in our wellbeing. Maybe it’s sunlight causing an uptick in our wellbeing and causing an increase in Vitamin D levels in our body? They (health and vitamin D levels) may just be correlated, without there being any causation.

What I’m about to say is important: I’m not a doctor. All I’m saying is, I’ve been confused often enough about correlation and causation to wonder about whether vitamin D causes good health. It is correlated, there’s no arguing with that. But causation? Ah, that’s another (very tricky) thing altogether.


And now that we have the mis-en-place of this blogpost done, let’s get the dish together.

Butyrylcholinesterase doesn’t necessarily cause SIDS in infants. Infants who die of SIDS stop breathing (for reasons that are still not understood clearly), and these infants have low levels of Butyrylcholinesterase. Butyrylcholinesterase may not even cause breathing to stop in infants. It is just a marker – there is correlation there, but we don’t know if there is causation.

In fact, the paper’s title itself says as much:

“Butyrylcholinesterase is a potential biomarker for Sudden Infant Death Syndrome”

But the tweet above speaks about how we’ve found the cause, and that’s not quite right.

Again, please don’t misunderstand me – the fact that this has been discovered is awesome, it is fantastic, and the joy, the relief and the euphoria should absolutely be there.

But:

Sally Clark lost her life at least in part to a fundamental misunderstanding of statistical theory, and we still don’t know what causes SIDS. We understand it better, but there is a ways to go.


The most underrated skill in statistics is the English language.

Words matter, and we all (myself included!) need to be more careful about what exactly we mean when we speak about statistics.

And thank god we’re closer to figuring out how to deal with the horrible, horrible thing that SIDS is.

But if you’re teaching or learning statistics, tread very, very carefully.

Reflections on Whole Numbers and Half Truths

Single narratives have never been able to explain all of India.

S, Rukmini. Whole Numbers and Half Truths: What Data Can and Cannot Tell Us About Modern India (p. 220). Kindle Edition.

There is this line that is often quoted when big picture discussions about India take place, and it is only a matter of time before it comes up: whatever you say about India, the opposite is also true. The quote is attributed to Joan Robinson, and I can’t help but wonder if I will end up creating a paradox of sorts by agreeing wholeheartedly with it.

But I do agree with the spirit of the quote, which is why that one line extract from Rukmini S’s book, Whole Numbers and Half Truths, resonated so much with me. All countries are complex and complicated, but India takes the game to giddying heights.

Take a look at this map, a version of which is present in Rukmini’s book:

https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_fertility_rate

What is India’s TFR? First, for those uninitiated in the art and science of demography, what is TFR? It stands for Total Fertility Ratio, or as Hans Rosling used to put it, babies per woman. Well, it’s 2.0, which is good, because roughly speaking, two parents giving birth to two children will mean we’re at the replacement rate (note that this is a very basic way of thinking about it, but useful as a rough approximation).

But as any student of statistics ought to tell you, that’s only half the story (or half the truth). Uttar Pradesh, Bihar and Jharkhand are well above the so-called replacement rate, and that will have implications for labor mobility, taxation, political representation and so, so much more in the years to come.

Data then, is only half the story. How is the data collected? If it is a sampling exercise rather than a census, how was the sampling done? Has the sampling method changed over time? If so, are earlier data collection exercises comparable with current ones?

How should one think about the data that has been collected? What does it mean, and how much does context matter? For example:

‘That’s data about marriage, madam,’ he said—not about love. ‘I think if your data asked people if they have ever fallen in love with someone from another caste or religion, many will say yes. I see that all around me among my friends. But when it comes to getting married, most of us are not yet ready to leave our families. That’s why your data looks like that,’ he said. As for the rest? ‘There is a lot we will not admit to someone doing a survey. But things are changing. At least for some of us,’ he said.

S, Rukmini. Whole Numbers and Half Truths: What Data Can and Cannot Tell Us About Modern India (pp. 127-128). Kindle Edition.

Rukmini’s excellent book is, in one sense, a deep reflection on the data that we have, have had, and would like to have where India is concerned. It speaks about how data has been collected, which are the agencies and institutions involved, how these have changed (and been changed) over time, and with what consequences.

But it also is a reflection on a truism that many economists and statisticians underrate: data can only take you so far. As the subtitle of her book puts it, it is an analysis of what data can and cannot tell you about modern India.

And what data leaves out is often as fascinating as what it includes:

Yet, most people know little about the NCRB’s processes and methodology. For instance, the NCRB follows a system known as the ‘principal offence rule’. Instead of all the Indian Penal Code (IPC) sections involved in an alleged crime making it to the statistics, the NCRB only picks the ‘most heinous’ crime from each FIR for their statistics. I stumbled upon this then unknown fact in an off-the-record conversation with an NCRB statistician in the months after the deadly sexual assault of a physiotherapy student in Delhi in September 2012. In the course of that conversation, I learnt that the crime that shook the country would have only made it to the NCRB statistics as a murder, and not as a sexual assault, because murder carries the maximum penalty. This, I was told, was to prevent the crime statistics from being ‘artificially inflated’: ‘If the FIR is for theft, there will be a[n IPC] section for assault also, causing hurt also. If you include all the sections, people will think these are separate crimes and the numbers will seem too huge,’ he told me. After I reported this,2 the NCRB for the first time began to include the ‘principal offence rule’ in its disclaimer.3

S, Rukmini. Whole Numbers and Half Truths: What Data Can and Cannot Tell Us About Modern India (p. 13). Kindle Edition.

The paragraph that follows this one is equally instructive in this context, but the entire book is full of such Today-I-Learnt (TIL) moments. Even for those of us involved in academia, there is much to learn in terms of nuance and context by reading this book. If you are not in academia, but are interested in learning more about this country, recommending this book to you is even easier!

Rukmini’s books spans ten chapters on ten different (but obviously related) aspects of India. We get to learn how Indians tangle (or quite often choose not to!) with the cops and the courts, how we perceive the world around us, why Indians vote the way they do in the first three chapters. The next three are about how (and with whom) we live our lives, and how we earn and spend our money. The next trio is about how and where we work, how we grow and age and where Indians live. The final chapter is about India’s healthcare system.

Each chapter makes us familiar with the data associated with each of these topics, but each chapter is also a reflection on the fact that data can only take us so far. When you throw into the mix the fact that the data will always (and sometimes necessarily) be imperfect, we’re left with only one conclusion – analyze the data carefully, but always bear in mind that the reality will always be more complex. Data is, at the end of the day, an abstraction, and it will never be perfect.


One reason I liked the book so much is because of its brevity. Each of these chapters can and should be be a separate book, and condensing them into chapters can’t have been an easy task. But not only has she managed it, she has managed to do so in a way that is lucid, thought-provoking and informative. Two out of these three is a good achievement, to achieve all three and that across ten chapters is a rare ol’ achievement.

If I’m allowed to be greedy, I would have liked a chapter on the world of data that the RBI collects, and to its credit does share with us via its website. But it does so in a way that is best described as unintuitive. In fact, a book on how data sharing practices with the citizenry need to improve out of sight where government portals across all verticals and at all levels are concerned would be a great sequel (hint, hint!).


I’d strongly recommend this book to you, and I hope you enjoy reading it as much as I did.

We will be hosting Rukmini on the Gokhale Institute campus this coming Friday, the 29th of April. The event will be from 5.30 pm to 7.00 pm at the Kale Hall. She and I will speak about the book for about an hour, followed by a Q&A session with the audience.

If you are in Pune, please do try and make it!

Should students of law be taught statistics?

I teach statistics (and economics) for a living, so I suppose asking me this question is akin to asking a barber if you need a haircut.

But my personal incentives in this matter aside, I would argue that everybody alive today needs to learn statistics. Data about us is collected, stored, retrieved, combined with other data sources and then analyzed to reach conclusions about us, and at a pace that is now incomprehensible to most of us.

This is done by governments, and private businesses, and it is unlikely that we’re going to revert to a world where this is no longer the case. You and I may have different opinions about whether this intrusive or not, desirable or not, good or not – but I would argue that this ship has sailed for the foreseeable future. We (and that’s all of us) are going to be analyzed, like it or not.

And conclusions are going to be made about us on the basis of that analysis, like it or not. This could be, for example, a computer in a company analyzing us as a high value customer and according us better service treatment when we call their call center. Or it could be a computer owned by a government that decides that we were at a particular place at a particular time on the basis of the footage from a security camera.

In both of these cases (and there are millions of other examples besides), there is no human being who makes these decisions about us. Machines do. This much is obvious, because it is now beyond the capacity of our species to deal manually with the amount of data that we generate on a daily basis. And so the machines have taken over. Again, you and I may differ on whether this is a good thing or a bad thing, but the fact is that it is a trend that is unlikely to be reversed in the foreseeable future.

Are the conclusions that these machines reach infallible in nature? Much like the humans that these machines have replaced, no. They are not infallible. They process information much faster than we humans can, so they are definitively better in handling much more data, but machines can make errors in classification, just like we can. Here, have fun understanding what this means in practice.

Say this website asks you to draw a sea turtle. And so you start to draw one. The machine “looks” at what you’ve drawn, and starts to “compare” it with its rather massive data bank of objects. It identifies, very quickly, those objects that seem somewhat similar in shape to those that you are drawing, and builds a probabilistic model in the process. And when it is “confident” enough that it is giving the right answer, it throws up a result. And as you will have discovered for yourself, it really is rather good at this game.

But is it infallible? That is, is it perfect every single time? Much like you (the artist) are not, so also with the machine. It is also not perfect. Errors will be made, but so long as they are not made very often, and so long as they aren’t major bloopers, we can live with the trade-off. That is, we give up control over decision making, and we gain the ability to analyze and reach conclusions about volumes of data that we cannot handle.

But what, exactly, does “very often” mean in the previous paragraph? One error in ten? One in a million? One in an impossibly-long-word-that-ends-in-illion? Who gets to decide, and on what basis?

What does the phrase “major blooper” mean in that same paragraph? What if a machine places you on the scene of a crime on the basis of security camera footage when you were in fact not there? What if that fact is used to convict you of a crime? If this major blooper occurs once in every impossibly-long-word-that-ends-in-illion times, is that ok? Is that an acceptable trade-off? Who gets to decide, and on what basis?


If you are a lawyer with a client who finds themselves in such a situation, how do you argue this case? If you are a judge listening to the arguments being made by this lawyer, how do you judge the merits of this case? If you are a legislator framing the laws that will help the judge arrive at a decision, how do decide on the acceptable level of probabilities?

It needn’t be something as dramatic as a crime, of course. It could be a company deciding to downgrade your credit score, or a company that decides to shut off access to your own email, or a bank that decides that you are not qualified to get a loan, or any other situation that you could come up with yourself. Each of these decisions, and so many more besides, are being made by machines today, on the basis of probabilities.

Should members of the legal fraternity know the nuts and bolts of these models, and should we expect them to be experts in neural networks and the like? No, obviously not.

But should members of the legal fraternity know the principles of statistics, and have an understanding of the processes by which a probabilistic assessment is being made? I would argue that this should very much be the case.

But at the moment, to the best of my knowledge, this is not happening. Lawyers are not trained in statistics. I do not mean to pick on any one college or university in particular, and I am not reaching a conclusion on the basis of just one data point. A look at other universities websites, conversations with friends and family who are practicing lawyers or are currently studying law yields the same result. (If you know of a law school that does teach statistics, please do let me know. I would be very grateful.)


But because of whatever little I know about the field of statistics, and for the reasons I have outlined above, I argue that statistics should be taught to the students of law. It should be a part of the syllabus of law schools in this country, and the sooner this happens, the better it will be for us as a society.