Access to better data about social media and other Internet activities can teach scholars and the public a lot about how to confront the ongoing coronavirus crisis as well as future pandemics. With effective structures and incentives, policymakers and major Internet companies can use the current crisis to improve scholarship and aid public health. These efforts should start with an initiative to standardize data formats and make them more accessible to more researchers.

Quite simply, the profusion of data that exists around social media and other aspects of online life provides insight into human behavior unparalleled in human history. From the geotags attached to photographs to trending hashtags, there’s a lot that could shed light on the pandemic. Existing tools, including certain public data sets, readouts of individual body temperatures around the country, common search terms and frequent topics of discussion, already help track the pandemic and direct resources.

And, once the pandemic subsides and life returns to normal, hindsight will provide even more important information. There might be surprising trends that correlate with better post-pandemic outcomes and evidence of places where behavior before the pandemic made things worse. Social media may provide evidence as to how well (or poorly) people obey quarantine orders. Searches may reveal what people miss the most about daily human contacts. The causes and impacts of the pandemic are sure to be a major topic of study for at least a generation and social media will be a key source of data for this research.

But right now, high quality social media data can be difficult to come by. Legitimate issues of user privacy, commercial risk and data costs combine to render such data difficult or impossible to obtain. Even when researchers can acquire relevant data, tasks as simple as grouping anonymized user accounts by geography across platforms require significant data cleaning and computational prowess. This difficulty is due in large part to a lack of interoperable standards. Much of the best research we have on social media deals with Twitter, for example; largely because it has made its data more accessible than others, not because it’s the most important platform. There’s no point in assessing blame for this set of circumstances but urgent public health needs provide a reason to do something about it soon.

This is where standards come in. The entire Internet is built on a series of interoperable, widely used standards — SMTP for email, JPEG for many types of graphics and RTF for many text files. While many of these standards originated at a specific business or university, they’re open for public use and have some sort of independent oversight. Similar standards for some aspects of social media data make sense and, obviously, the wide range of platforms indicates we’ll need more than one. The development of interoperable standards for medical records was helped by limited antitrust safe harbors for IT and medical providers to collaborate.


Internet companies that want similar standards should have the same assurances. Some have proposed their own versions of interoperability. There are good reasons why many companies might fear such standards particularly if they involve government mandates: Los Angeles’ messy fight over a mobile data standard gives them reason to do so. But the best Internet standards have typically emerged from long dialogs. Better data standards — including standard ways of anonymizing data and formatting common elements such as hyperlinks and screen names — could both give companies more tools and stave off high-handed efforts.

Likewise, data access for researchers needs improvements. Although social media giants have made improvements in recent years, policies can be inconsistent and have changed frequently and in unpredictable ways. Measures taken to protect individual privacy may be effective but tend to greatly reduce the way that data can be used.

While it’s probably impractical to create an entirely new data access structure and procedure in the middle of a crisis, it should be a priority soon afterward. There are real pitfalls, of course: Companies may be understandably reluctant to take part and past experience shows that big data can be misused when it becomes a substitute rather than a supplement for traditional methods. Whatever happens, voluntary, collaborative standards should rule the day.

Better social media data access could make a difference in fighting future pandemics. Researchers should care about data access and, once the crisis subsides, it’s something that university administrators, social media giants and policymakers should consider carefully.

Featured Publications