Social media data leak highlights murky world of data scraping

A data brokerage left its database of 235 million Instagram, TikTok and YouTube profiles exposed to anybody who cared to access it

A company that sells data on social media influencers to marketers left an unsecured database of information pulled from 235 million Instagram, TikTok and YouTube accounts exposed on the web without any form of password or other authentication measures required to gain access to it, raising questions over the ethics of scraping publicly available data.

This is according to Bob Diachenko of Comparitech’s cyber security research team, who discovered three identical copies of the datasets accessible from the public internet at the beginning of August.

The data comprised almost 200 million Instagram records in two separate sets, 42 million TikTok records, and four million YouTube records. It included profile names, real names, profile photos, account descriptions, profile status, follower engagement statistics, and the age and gender of the account holder. Diachenko said that a significant number of the records also contained contact details such as phone numbers and email addresses

The incident raises serious questions about the ethics of data brokers, and how the data that social media users put on their accounts is scraped, used and shopped around.

Diachenko’s investigation at first appeared to suggest that the data came from a company called Deep Social, which was banned from Facebook and Instagram’s marketing APIs two years ago and threatened with legal action if it continued to engage in the practice of copying data and information from social media profiles, which is against the terms of service of all the platforms concerned.

However, when Deep Social was contacted its admins forwarded the disclosure to a different company called Social Data, whose chief technology officer acknowledged the exposure and subsequently removed the servers within a couple of hours.

In emails to Diachenko, Social Data insisted that it had not obtained the information surreptitiously, and that the data concerned had been freely available to anybody with internet access, even reckoning without its activities, because the information was publicly available on the social media platforms themselves.

Opening the floodgates

Nevertheless, wrote Comparitech’s Paul Bischoff in a disclosure blog, the information is still vulnerable to spam and marketing campaigns, and users of the platforms should be on the lookout for scams or phishing messages.

“Even though the information is publicly available, the size and scope of an aggregated database makes it more vulnerable to mass attack than it would be in isolation,” he said.

Besides providing useful information for phishing campaigns, said Bischoff, there are other risks to affected users. For example, he said, the images and data of high-profile influencers could be used to create fake, imitation accounts to lure in followers and promote scams or misinformation, or their photos could be used to train facial recognition algorithms – as was done by a company called ClearView AI, which is facing legal action over its unethical practices.

Comforte AG’s Mark Bower, senior vice-president and data security specialist, said that even though the data exposed was for the most part publicly available, if it had fallen into the hands of cyber criminals it could be used as an accelerant for targeted attacks to obtain more valuable information.

“Specific personal data enables more effective spear phishing to attack an enterprise with higher risk, higher value data,” he said. “The bottom line here is enterprises need to be both protecting their own personal data to neutralise it from risk of theft and scraping, and ensuring employees don’t become the vector of exploits from attackers who have more socially-exploitable data on them than the businesses they report to.”

Chris DeRamus, vice-president of technology at Rapid7’s cloud security unit, added: “While most of the user data in this leak was publicly available on user profiles, the risk of phishing is amplified due to the large accumulation of user data collected in the exposed databases. 235 million social media users are at risk of their information being sold on the dark web because of unsecured databases, one of the most common yet easily preventable security risks.

“Companies must employ security tools that are capable of detecting and remediating misconfigurations (such as databases left unsecured without a password) in real time, or better yet – preventing them from ever happening in the first place.”

Usability versus security

Gurucul CEO Saryu Nayyar said this incident spoke to an age-old conundrum for social media users – the challenge of striking a balance between their ability to use the platform effectively and their own cyber security hygiene.

“We have to assume our information will escape from 3rd parties, so how little information can we expose and still use the social media services we've come to rely on? At the very least, it's worth separating the addresses and information we associate with our critical accounts, such as banking or health, from our strictly social activities. That keeps a compromise of one from leading to a direct compromise of the other,” said Nayyar.

Read more about data protection

Chloé Messdaghi, Point3 Security strategy vice-president, said the incident showed how it was important for people to understand how data scraping works and how it puts them at risk.

“It’s essentially the use of personal information without permission, for profit,” she said. “It is an act against the individual’s privacy rights and it puts all of those whose data is scraped at sharply increased risk of attack from phishers. Data scraping companies, perhaps unintentionally, support malicious actors and enable cyber criminals to do the things they do. 

“Hackers respect the terms and conditions of social media sites, but data scraping companies and malicious actors do not – yet these companies are unregulated and face no consequences,” said Messdaghi.

“Data scrapers conveniently say the data they’re scraping is public but disregard that social media sites have terms and conditions that scrapers tend to ignore…. Clearly, when scraping is involved, the personal data we entrust to one platform doesn’t stay on that platform – despite the site’s own policies.”

Ultimately, to avoid putting your data at risk on a social media platform the best option is not to use the platform at all – if this is not an option you can face, the next best option is to lock down your profile as tightly as possible, as Social Data, the firm at the centre of this incident, said itself in its response to Comparitech.

”Social networks themselves expose the data to outsiders – that is their business – open public networks and profiles. Those users who do not wish to provide information, make their accounts private [sic],” the firm said.

Read more on Data breach incident management and recovery

Data Center
Data Management