News

Social media data leak highlights murky world of data scraping

A data brokerage left its database of 235 million Instagram, TikTok and YouTube profiles exposed to anybody who cared to access it

Alex Scroxton, Security Editor

Published: 20 Aug 2020 13:15

A company that sells data on social media influencers to marketers left an unsecured database of information pulled from 235 million Instagram, TikTok and YouTube accounts exposed on the web without any form of password or other authentication measures required to gain access to it, raising questions over the ethics of scraping publicly available data.

This is according to Bob Diachenko of Comparitech’s cyber security research team, who discovered three identical copies of the datasets accessible from the public internet at the beginning of August.

The data comprised almost 200 million Instagram records in two separate sets, 42 million TikTok records, and four million YouTube records. It included profile names, real names, profile photos, account descriptions, profile status, follower engagement statistics, and the age and gender of the account holder. Diachenko said that a significant number of the records also contained contact details such as phone numbers and email addresses

The incident raises serious questions about the ethics of data brokers, and how the data that social media users put on their accounts is scraped, used and shopped around.

Diachenko’s investigation at first appeared to suggest that the data came from a company called Deep Social, which was banned from Facebook and Instagram’s marketing APIs two years ago and threatened with legal action if it continued to engage in the practice of copying data and information from social media profiles, which is against the terms of service of all the platforms concerned.

However, when Deep Social was contacted its admins forwarded the disclosure to a different company called Social Data, whose chief technology officer acknowledged the exposure and subsequently removed the servers within a couple of hours.

In emails to Diachenko, Social Data insisted that it had not obtained the information surreptitiously, and that the data concerned had been freely available to anybody with internet access, even reckoning without its activities, because the information was publicly available on the social media platforms themselves.

Opening the floodgates

Nevertheless, wrote Comparitech’s Paul Bischoff in a disclosure blog, the information is still vulnerable to spam and marketing campaigns, and users of the platforms should be on the lookout for scams or phishing messages.

“Even though the information is publicly available, the size and scope of an aggregated database makes it more vulnerable to mass attack than it would be in isolation,” he said.

Besides providing useful information for phishing campaigns, said Bischoff, there are other risks to affected users. For example, he said, the images and data of high-profile influencers could be used to create fake, imitation accounts to lure in followers and promote scams or misinformation, or their photos could be used to train facial recognition algorithms – as was done by a company called ClearView AI, which is facing legal action over its unethical practices.

Comforte AG’s Mark Bower, senior vice-president and data security specialist, said that even though the data exposed was for the most part publicly available, if it had fallen into the hands of cyber criminals it could be used as an accelerant for targeted attacks to obtain more valuable information.

“Specific personal data enables more effective spear phishing to attack an enterprise with higher risk, higher value data,” he said. “The bottom line here is enterprises need to be both protecting their own personal data to neutralise it from risk of theft and scraping, and ensuring employees don’t become the vector of exploits from attackers who have more socially-exploitable data on them than the businesses they report to.”

Chris DeRamus, vice-president of technology at Rapid7’s cloud security unit, added: “While most of the user data in this leak was publicly available on user profiles, the risk of phishing is amplified due to the large accumulation of user data collected in the exposed databases. 235 million social media users are at risk of their information being sold on the dark web because of unsecured databases, one of the most common yet easily preventable security risks.

“Companies must employ security tools that are capable of detecting and remediating misconfigurations (such as databases left unsecured without a password) in real time, or better yet – preventing them from ever happening in the first place.”

Usability versus security

Gurucul CEO Saryu Nayyar said this incident spoke to an age-old conundrum for social media users – the challenge of striking a balance between their ability to use the platform effectively and their own cyber security hygiene.

“We have to assume our information will escape from 3rd parties, so how little information can we expose and still use the social media services we've come to rely on? At the very least, it's worth separating the addresses and information we associate with our critical accounts, such as banking or health, from our strictly social activities. That keeps a compromise of one from leading to a direct compromise of the other,” said Nayyar.

Social media data leak highlights murky world of data scraping

A data brokerage left its database of 235 million Instagram, TikTok and YouTube profiles exposed to anybody who cared to access it

Opening the floodgates

Usability versus security

Read more about data protection

Read more on Data breach incident management and recovery

How to scrape data from a website

ICO orders facial recognition firm Clearview AI to delete all data about UK residents

LinkedIn denies exposure of 700 million user records is a data breach

Egypt, Italy and US most affected in Facebook leak