Bigger Data Sets Don’t Necessarily Yield Better Results

December 8, 2022
By Stephanie Desmon

Working with a data set of more than 38 million produces more than a few insights, insights important to the larger global health community. But two researchers from the Johns Hopkins Center for Communication Programs caution those using large data sets not to assume that the size of the data set is all that matters.

CCP’s Marla Shaivitz, director of digital strategy, and Tuo-Yen Tseng, PhD, an assistant scientist at the Johns Hopkins Bloomberg School of Public Health, presented their findings, “Lessons Learned About Big Data, Transparency and Advocacy from a Global Survey on Behavior,” at the 2022 International Social and Behavior Change Communication Summit in Marrakech, Morocco.

They discussed how they interpreted data about COVID-19 knowledge, attitudes and practices gleaned from the enormous COVID-19 Trends and Impact Survey (CTIS) to populate the COVID Behaviors Dashboard.

The dashboard was designed to give policymakers and public health practitioners answers to important questions such as who was getting vaccinated in their countries, whether the population believed COVID was a real risk and whether people were regularly wearing masks. The survey was conducted over Facebook from May 2021 to June 2022. It was the largest survey of its kind and provided policymakers in many countries crucial data they would have otherwise gone without.

The researchers discussed the decision-making process behind how they chose to present their data and how the team dealt with issues related to bias. They spoke about the phenomenon called, “The Big Data Paradox” and how it related to the dataset.

“The is the belief that larger datasets provide more accurate estimates because of their size,” Shaivitz said. But, she cautioned, that’s not always the case.

In the case of the CTIS, Tseng says, much of the data provided periodic snapshots of COVID behaviors that corresponded to reports on the ground. It soon became clear, however, there were data gaps within demographic groups and in many subnational regions. In many countries, responses were majority male, more urban and better educated than the general population, and the poor were mostly ignored because they didn’t have digital access to take the survey.

Tseng emphasized the need for future studies to promote equitable and inclusive representation with data. “It’s important because our data inform policy and practices and we need to think about who we are serving,” she said. “This means that we need to be strategic and tailor recruitment approaches and consider ways to reach hard-to-reach communities.”

While working with the data, inequalities in data became clear, particularly in the areas of gender, age and geographic location (urban vs. rural), as well as representation from low- and middle-income countries. Shaivitz illustrated the point by showing data from Mozambique in June 2022, the last data period visualized on the COVID Behaviors Dashboard.

She noted it’s important to be transparent on what the data do and do not reveal. It was intentional, she said, to include the phrase “insufficient data” when the response rate was under 100 people. For example, in June, Mozambique’s sample for women wasn’t large enough to display on the dashboard, nor was data from the less educated segments of the population.

It is important to ensure transparency in data use and interpretation, Shaivitz said. There needs to be clear communications around the quality and limitation of the data, as well as how to best use it.

In the case of CTIS, it is important to recognize the biases that exist in an online population and understand that surveys recruited through social media are not well-suited for estimating percentages and values among the general population. But they are powerful tools with high efficiency and cost-effectiveness, and are great for surveillance to track changes over time and make comparisons across spaces or groups.

Working with a large sample size doesn’t automatically make it representative. Tseng noted, “We should not be blindly pursuing big sample sizes. However, it doesn’t mean that we should trash data that are imperfect; everything has its limitation. And there’s no single data set that can answer all questions.”