Rukmini S discusses data journalism in India, reporting on opinion surveys, and the latest census figures

Rukmini S is the National Data Editor at The Hindu, India’s second-most circulated English-language newspaper. She studied in Pune, Mumbai, and London and has worked at The Times of India previously. Her interests lie in politics, gender issues, and caste, and she is passionate about integrating data and story-telling into everyday news.

On Friday afternoon, I met with Rukmini at The Hindu office in Delhi to talk about data journalism in India, reporting on opinion surveys, the recent release of religion data from the 2011 census, and the 2011 Socio Economic and Caste Census. The transcript of the conversation below has been lightly edited for length and clarity.

On data journalism in India

Sam Solomon: In the United States, data journalism is a growing phenomenon. Last year was really significant in that Ezra Klein left Wonkblog to start up Vox, Nate Silver left The New York Times to move FiveThirtyEight to ESPN, and The New York Times started their own data vertical with The Upshot. I’d like to know what the status of data journalism is in India.

Rukmini S: I definitely feel much more positive about it now than I did say, a year ago. There’s been a lot of growth in the last year or so. The way things stand right now is that there’s just two mainstream organizations that have in-house capacity, which would be us at The Hindu and The Times of India. Nobody else has in-house data journalists.

But there’s a bunch of startups that do data stuff whose work gets syndicated by the mainstream media. Very good ones. They’re all of excellent quality. There’s one called How India Lives, then there’s another called IndiaSpend. There’s one called Factly, and there’s Gramener. I think I’ve got the main ones covered. They are all of very high quality, and they get syndicated and used. So the end result is that there is on most days some quality data stuff in most newspapers.

TV, no. TV is just elections. It doesn’t do data stuff otherwise. They do data only for elections. They outsource it to a company that does pre-polling and then on the day of the elections they usually outsource some analysis. Usually to the same company.

I think there’s a problem both in talent and investment for in-house stuff in the media. But there’s exciting startups whose work can be used by other people.

SS: And these organizations that you just listed, are they independent journalists?

RS: Yes. So all of them have some — not all, Gramener doesn’t — but the rest have all got some journalistic presence in them. How India Lives, for example, is set up entirely by three former journalists who used to do some data stuff with The Economic Times, then set this up.

So their quality is very good, but I still feel there’s an in-house data need. For example, on any given day we [The Hindu] are able to respond very fast to things that happen. We are able to use data in our regular stories. We are able to get other journalists who don’t necessarily do data stuff to incorporate it into their reporting. We’re hiring more people. This is something we’re building. So I do think that organizations that don’t do that are losing our talent. They’ll get special things that these people do for them, but they’re not going to get sustained data integration into their journalism.

SS: You describe yourself as a “data geek” on your Twitter profile. Who should data geeks who are trying to understand India better be reading?

RS: All of these organizations that I have mentioned. Who else? Nobody else that I can think of in the data journalism space. But if it’s not journalism, if it’s data that you’re interested in, then of course there’s a bunch of organizations, some government, some not. There’s CSDS (the Centre for the Study of Developing Societies). There’s the National Council of Applied Economic Research (NCAER), which puts out great socioeconomic data. They’re the only people with a large enough nationally representative sample. They do a household survey every five years, so they’re the only non-government people [with such samples].

And then there’s smaller groups. There’s an organization called Janaagraha in Bangalore, which is also very rigorous with their sampling, which is often invaluable. They’re good on urban issues.

It’s a bit limited. I’d say that all of these organizations that I mention and me have similar interests, so that just means that the world of data journalism is restricted to our interests. Gramener is a bit broader. But we don’t have things like sports data journalism or a lot of entertainment stuff. There’s a lot of space.

SS: Using data, you write about political issues, social issues, economic issues, gender issues. Which sources of data do you generally rely on when you’re writing about these issues?

RS: My first preference is always to go with official sources, not because I have some particular faith in the government, but because the size and scope of the datasets that the government produces are unparalleled.

So the big thing would be the census and the National Sample Survey Office (NSSO). That covers a broader range of issues than most people know about. I was able to do something a month or two ago on Indians’ domestic holidaying activities. That was from the National Sample Survey Office. Most people wouldn’t even know that they survey holidaying.

Then there’s NCAER. CSDS, whenever I can get– as you know, not all of their data is publicly available. It’s easier to get insights from them than to get the data and do your own insights, but whenever I can use them.

Then on specific things, like on crime, there’s a National Crime Records Bureau. Some of the ministries collect their own data. There’s a big national family health survey, which is part of the global DHS (Demographic and Health Survey), so whenever that comes out, that’s valuable. And then occasionally things happen like the caste census, so those become very valuable. The election commission is a very valuable source for data as well.

On non-official data, I tend to end up writing more about individual studies done by academics; many of which I do secondary analysis of primary data collected by the government. Other primary data sources besides NCAER… it’s not much fun to write about any of the rest, because they’re just not representative enough.

SS: In your view, what issues in India are most in need of data? More data or better data?

RS: Caste, for sure.

SS: This is because the last comprehensive caste census was done in the 1930s, right?

RS: Yeah. So the new caste census, whenever the numbers come out, does address some of the things that you will want to know, but… I suppose when better primary data comes out, there will be possibilities for better analysis. There’s a lot that’s said anecdotally but not really backed up by evidence about the relative socioeconomic levels of specific castes, and people talk extraordinarily confidently about things that there just isn’t data for. It would be interesting to be able to actually hold those commonly-held beliefs to data.

Then there’s not enough good polling on opinions, views, the Gallup / Pew world of things. Sometimes Pew’s India stuff sounds very wonky, so I don’t know. I trust Pew and its systems, but it’s not satisfactory enough.

Simple things like how many Indians are vegetarian, for example. The Hindu commissioned something from CSDS ten years ago, I don’t think there’s been any opinion polling after that on it. Or things like inter-caste marriage. Again, CSDS occasionally asks this in its surveys, but it shouldn’t come as a byproduct of one election. We should be having much more polling on this stuff.

One of the big problems with our crime data is that, as in most countries, it’s that officially recorded crime statistics, not experience or victimization [statistics]. So the UK, for example, has moved to a victimization survey as its primary crime statistics metrics. Victimization surveys would be useful here.

And in general, I think better integration of gender-specific issues into general polling would also be really useful. People deduce from electoral surveys things like, “It appears as if more women would vote for this or that.” But there isn’t too much direct polling about that.

SS: What types of gender-specific issues?

RS: On violence, for example. NCAER asks a good set of questions on intra-family decision-making. But it would be useful to have more of that. On wealth-related things. Or on why women’s participation in India’s labor force is so low. It would be good to have more actual questions.

SS: Going back to caste, because the last caste census was done in the 1930s, what sources of data do you or other analysts use to look at caste-related trends in India?

RS: There is some amount of broad grouping available on SC (Scheduled Castes), ST (Scheduled Tribes), and others.

SS: SC, ST, that’s how the census reports them.

RS: Not only the census. The NSSO as well. That’s quite valuable actually, because the NSSO has such a wide range of surveys, If you have SC/ST specific data on a whole range of issues. But within SCs and STs, there is no disaggregation of the data, and there’s very little OBC (Other Backwards Classes) data as well.

I feel like we’re really able to say broad brush things about these groups, though we’re still able to say valuable things and I think things that don’t get enough play. It is still valuable that we’re able to say what a large gap there still is in educational outcomes, socioeconomic outcomes between SCs, STs, and others. And since we have it for years, because the NSSO goes way back, we are able to also talk about where things are diverging or converging. There is some amount of valuable stuff we are able to say.

The conversation about caste is so disaggregated, and for so much of it we’re talking about specific caste groups. That’s stuff we don’t have really have data on. And now there is more and more a feeling that the benefits of affirmative action are not being equally enjoyed by all of the groups within the particular groupings, so as well it would be valuable to be able to find which are the groups that require extra emphasis.

SS: You said there’s very little data available on OBCs. Is there data available on OBCs?

RS: I don’t feel like they do it in every survey but I’ve seen an OBC column occasionally. You should let me get back to you on that because I’ll figure out exactly what I’m talking about. The NCAER are able to say some things about OBCs, and also on some NSSO surveys, but I can’t at the moment remember which.

On reporting on opinion surveys

SS: My research project is looking at the measurement of public opinion in India. When you are reporting on survey data for an article, what are the different things that you are looking at to assess the quality of the data?

RS: Do you mean electoral opinion polling surveys only?

SS: Maybe you could answer for both.

RS: So one of the things — I tell everybody I can about this — is that I’m very concerned with how little attention is paid to the sample description before we write about surveys. It’s absolutely an epidemic in the media, which is writing about surveys without explaining anything about the sample and, I fear, having bothered to find out anything about the sample either.

So with opinion polling, I find that sometimes before a big election, I’ll need to report what the big opinion polls have said and I might not invest too much into that. But if I’m writing something detailed, as I have in the past, I don’t write much about those polling agencies which do not give me a detailed sample description. This is why I write much more about CSDS and occasionally about CVoter than the rest, because the rest do not give me sample details. And this applies to all sorts of surveys in India.

This is the problem with not having in-house data journalism in newspapers. Sometimes I can tell from the reporting that the reporters have some sense that a representative sample matters and they might even ask about it. But the way they sort it out is that the person will tell them, “Yeah, we administered this survey to these people we know, and a majority of them were poor, and half were women.” This sort of post facto description of the sample. There’s a lot of that as well. All of that is really problematic and I do try to tell my colleagues in other cities as well, even if you don’t necessarily describe it all in the article, you need to figure it out before you write on the survey. So I avoid writing about surveys that I don’t have a full sample description of, and of which I’m not confident that they’re not just inventing it.

Janaagraha, they’re a small organization and they work on select issues. They make serious attempts to have, even if they’re just doing some polling in Bangalore, they’ll go down to the ward level, and it will be representative on all these demographic indicators. Booster sampling too, for specific groups–women, Muslims, these groups. I would just give them more play. I would have more time for this sort of thing.

The other problem is that there is now a lot of Internet opinion polling and smartphone based opinion polling. Since we know the description of India’s Internet-connected population, we know very clearly what biases that’s going to throw. But none of that gets reflected in the coverage of it.

More than anything else in writing about data, I think writing about surveys is the most problematic part.

SS: Has journalistic coverage of opinion surveys changed over your time in journalism at all?

RS: No. I’ve only been doing data journalism for five years now, and I know I will always ensure that one paragraph refers to methodology. We were tied up with CSDS once, for example. And CSDS is very happy to put an entire methodological note themselves if you give them the space for it. I still don’t see that anywhere else, and I still don’t see that from any other organization either. No other media organizations nor other polling agencies are doing this. No, that’s not really changed.

The thing that still matters for people is knowing the number. They don’t even want a margin. People put out projections which are like a range, and then Times of India will take the midpoint of the range and put that in it. So no, I don’t feel like that’s getting better. I don’t know if there’s appetite for it.

I do feel like–if I point it out on Twitter, for example–there was a terrible opinion poll for which I pointed out that there was no methodology described in it. And it later turned out that it was a smartphone-based one. So then there will be other people who then, you know, tweet at the editor about the methodology.

SS: Trolls?

RS: (laughs) No, I think in this case nerds, really, more than trolls.

SS: They’re not trolls. They’re nerds. Okay.

RS: So maybe there’s some appetite for it, but if you can get by without doing it… All that will push you to actually do it is your own sense of integrity and I don’t know if that’s come.

On the August 2015 release of the religion data from the 2011 census

SS: You wrote a piece on how the religion data that were released in August were released in a way that was very different from the release of the 2001 religious data in 2004. No press conference. No technical briefings for journalists. No panelists with an expert demographer to contextualize the numbers. How did you interpret these data when they were first released?

RS: We all knew that the data was going to come for some time now, so I had with me my spreadsheet with the past numbers ready, because I knew that on the day it came I was going to have to scramble. I didn’t expect that none of that would be put out. Maybe not all the way back to 1951, but I assumed at least the last two [censuses], 2001 and 1991, would be put out whenever the numbers were released. I had it with me; I speak to demographers often so it was easy to get through to them.

I wasn’t very surprised that it came out the way it did because about six months ago the highlights from it had already been leaked to select journalists, none of whom are journalists who ever really cover census or data issues. So there was every indication that this wasn’t going to be released in the regular way. Census data are very, very rarely leaked. Almost never gets leaked. So there was something clearly up.

SS: What do you think was up? Why do you think the data were released in this way?

RS: I know there’s a lot of conspiracy theorizing about this, but I just don’t find the numbers controversial enough for any conspiracy theories to make sense. So I cannot say that I know why this happened. If there was something wildly shocking that was coming out of these numbers, then you could have some theory that– I do not understand why the UPA government, who had these numbers, did not release it. It does not conflict with their politics in any way. I just don’t see why they didn’t release it earlier because I don’t see how the numbers conflict with their politics. They just aren’t controversial numbers. They’re completely in line with everything that was expected. So I am unable to come up with theories for why. I do not feel like it had anything to do with the Bihar elections either.

The dangerous part has been in the media reporting of it, and I’m not sure if even the people releasing it anticipated just how terrible the media coverage could be. I’m not able to figure out what the grand plan, if there was one behind this, was. To me, the most surprising and disturbing part was just how poorly it was reported by most of the media.

SS: Your headline for The Hindu on the data, “Muslim population growth slows,” was quite different from that of, say, The Hindustan Times, which was “Hindus less than 80% of country’s population” [editor’s note: the online headline reads “Muslim population grows marginally faster: 2011 census data,” but here’s the headline from the print edition]. It was also pretty different from a lot of Hindi-language press coverage that focused on the Hindu growth rate as being lower than the Muslim growth rate. Why did you choose to focus on the relative convergence between the Hindu and the Muslim growth rates since 2001?

RS: The census, especially when it comes to these things, is primarily a source of demographic information. And demographic information is not a dot in the graph disconnected to everything else. It has a history, and that history is clearly documented and freely available. So whatever census numbers I cover, I always compare it to past numbers. It just seems irrelevant and of no use to present census numbers with no reference to the past. These numbers–the religion numbers for the past–are particularly easily available because they’ve been widely studied by a large number of demographers. And fertility rates are of great interest to Indian demographers. So it was no trouble to find any of these numbers.

To give some leeway to the media, I’d say that I don’t blame them for some of it, which is that most of them essentially went by the press release that was put out by the Home Ministry — it was not put out by the census itself, but by the Home Ministry — which was not a very well-worded release and it had no context in it at all.  So if you wanted to do the bare minimum and not make much extra effort, most of them faithfully, largely reproduced that press release. And if you don’t have the resources, you don’t know who to contact, you don’t have these numbers with you, and the numbers came out at 7 PM that day… I don’t blame them for part of it, for their inability to do any better. But because it was so easy to put some context in it, maybe the media should have put some effort into it as well.

And to me the thing is that clearly the first point in it was that the Muslim growth rate is faster than the Hindu growth rate. But if I had given that as the headline, I could have given that headline every year for the last sixty years. That seems like the least interesting thing anybody could say about the census. What’s interesting is what’s changed. And this was what has changed. So it was not any sort of political decision on our part to go with that. As journalists, what’s interesting is what has changed and this is what has changed.

SS: But it seems they did sort of contextualize it, just at a simpler level of analysis, in that they said, “Before the Hindu share was 80.5%, and now it’s 79.8%. And that’s the change we’re looking at it. That’s the context.”

RS: That’s another thing that I found mystifying. What makes 80 a special number? For the previous census, should the headline have been “Hindu proportion now under 81%?” Just because it’s a round number doesn’t make it a particularly important number. It seems like a very juvenile impulse to go with that as a newspaper. It’s an irrelevant number. And people can call it — what did the RSS (Rashtriya Swayamsevak Sangh) call it? — a “psychological point” or a “psychological number.” But that makes no sense to me.

On the 2011 Socio Economic and Caste Census (SECC)

SS: Do you think the data from the Socio Economic and Caste Census of 2011 will be released any time soon? It certainly would be of great benefit to researchers.

RS: So I would say that it seems like an extremely hard task to do. It’s been handed over now to an expert group within the NITI Aayog, which I’m not sure is the right home for it.

SS: NITI Aayog?

RS: The Planning Commission has now been re-named the NITI Aayog, which is the planning and research organization under the government. A committee has been set up under Arvind Panagariya, who heads it, to figure out what to do with it. I think that the expertise for it lies with the census. So I’m not sure how far this committee’s going to get.

It seems incredibly hard to categorize basically every last name or the people who have given their caste name, to sort out all the different ways people call them. Technically, it seems like an extremely hard thing to do, and that might take time apart from whether they want to put it out or not.

And then the other thing is it bothers me that multiple things are being sought. The SECC sought to do multiple things. Not all of the jati [caste] data were its job or should be its job. One of the main things the SECC was meant to do was to accurately identify poor families. All that we have in India at the moment is the poverty line and identifying who lies below that poverty line is on the basis of an old and outdated BPL census.

SS: The BPL census is…?

RS: I think ‘95, maybe.

SS: What does BPL stand for?

RS: Below Poverty Line. In India, experts draw a poverty line, then you have to find the people under it and those are the poor people.

SS: And they haven’t updated it since 1995?

RS: No. I’ll check the data again, but I think it’s 1995. Part of what the SECC was supposed to do is identify a range of socioeconomic indicators for each family, then decide what combination of socioeconomic indicators you wanted to, as a government, add up to poverty. And then you would know specifically is this household is poor or not. It seemed like a much more direct and useful way of identifying the poor than we’ve had for a while.

That has been combined with the caste census part of it. Both seem like very difficult, complicated exercises which are now expected to be done from one census. That seems a bit fraught to me, to try to do two such important jobs from one thing. I suppose if they’re able to order and group all of them properly, it seems like a valuable source of information.

Again, the political implications of it are complicated. The argument never was that affirmative action was meant to be a purely economic activity. It was always meant to have a socially redistributive aspect to it. So I think it’s genuinely worth discussing the concerns about whether this socioeconomic data on caste groups is going to be used to rethink caste and rethink affirmative action. I see the concerns.

It sounds like it’s going to take ages to me.

SS: And you don’t think the holdup is political at all?

RS: The holdup was political for a while, because there’s no reason for it… It was ready some time ago.

SS: It was?

RS: The caste part of it was not ready. The socioeconomic part of it was ready, and the caste part, no work was done on it for a while. The information was there but it wasn’t grouped.

I’m not comfortable with immediately jumping to conspiracy theorizing about this because politicians in particular make really wise — surprisingly wise and thoughtful statements about socioeconomic status and caste. I think politicians more than anybody else are very well aware of the dimensions of caste, and not just the socioeconomic thing alone. So I don’t think that there’s an immediate worry about showing somebody as richer than the other, and then what will we do about it? I don’t think it would immediately become something explosive and that politicians couldn’t manage the conversation.

So I don’t actually know, and I don’t have a conspiracy theory that I am particularly fond of on why it has taken this long. It does seem very hard to do.

SS: To code all those jatis, yeah.

RS: And especially because people aren’t ticking boxes. They’re answering things. They don’t necessarily answer things in the proper format.

SS: Using different spellings for the same name, things like that.

RS: Right. Things like that.

SS: Thank you very much.


