Behavioural scientists use social media to quickly and cheaply gather huge amounts of data about what people are thinking and doing but researchers at Carnegie Mellon University in the US and McGill University in Canada have found that those massive datasets may be misleading. Carnegie Mellon's Juergen Pfeffer and McGill's Derek
Ruths said that scientists need to find ways of correcting for the biases inherent in the information gathered from Twitter and other social media, or to at least acknowledge the shortcomings of that data. It is not an insignificant problem, researchers noted that thousands of research papers each year are now based on data gleaned from social media, a source of data that barely existed even five years ago.
"Not everything that can be labelled as 'Big Data' is automatically great," Pfeffer said. He said that many researchers think - or hope - that if they gather a large enough dataset they can overcome any biases or distortion that might lurk there.
Despite researchers' attempts to generalise their study results to a broad population, social media sites often have substantial population biases; generating the random samples that give surveys their power to accurately reflect attitudes and behaviour is problematic, scientists said.
Instagram, for instance, has special appeal to adults between the ages of 18 and 29, African-Americans, Latinos, women and urban dwellers, while Pinterest is dominated by women between the ages of 25 and 34 with average household incomes of $100,000. Yet Ruths and Pfeffer said researchers seldom acknowledge, much less correct, these built-in sampling biases.
Other questions about data sampling may never be resolved because social media sites use proprietary algorithms to create or filter their data streams and those algorithms are subject to change without warning.
Most researchers are left in the dark, though others with special relationships to the sites may get a look at the site's inner workings. The rise of these "embedded researchers," Ruths and Pfeffer said, in turn is creating a divided social media research community.
In an article published in the journal Science, researchers also noted that not all "people" on these sites are even people. Some are professional writers or public relations representatives, who post on behalf of celebrities or corporations, others are simply phantom accounts. Some "followers" can be bought.
The social media sites try to hunt down and eliminate such bogus accounts - half of all Twitter accounts created in 2013 have already been deleted - but a lone researcher may have difficulty detecting those accounts within a dataset, according to Ruths and Pfeffer.