cherezoff - stock.adobe.com

How organisations can right-size their data footprint

More data is not always better, says Gartner, which is calling for organisations to focus on metadata and synthetic data to reduce their data liability and address privacy challenges

Rather than collect as much data as they can, organisations should focus on metadata and synthetic data to reduce their data liability and address privacy issues, according to data and analytics experts from Gartner.

During the keynote address at the Gartner Data and Analytics Summit in Sydney, Australia, Peter Krensky, director and analyst at the research firm’s business analytics and data science team, said that although organisations have been integrating volumes of big data, not all of that data is useful.

Krensky said that such “big data thinking” – that more data is always better – is outdated and counterproductive. He noted that the issue is being exacerbated with the cloud, which “gives us unlimited space to store everything we thought we wanted”.

Sally Parker, research director in Gartner’s chief data officer leadership team, said that being “data plumbers” who lay data pipelines from source to storage is not a good strategy either. She called for data concierges to provide guidance on data sources that provide the right insights.

And while data has been described as an asset, it can also be a liability because it costs money to mine and keep data, said Parker, who advised organisations to focus on data that drives business results.

This can be done in a few ways, said Krensky, starting with generating metadata, which organisations do not have enough of today. “Without metadata, we do not have meaningful data and we don’t know what we have, what it means and where it came from,” he added.

Krensky said that traditionally, it required a lot of effort to maintain metadata, but by applying machine learning techniques, organisations can transform it into active metadata.

“Active metadata continually detects and adjusts to the patterns in our data, and this is going to enable self-organising and self-optimising concepts such as the data fabric,” he said. “This is not only a more efficient data management environment, but it also drives usage.”

With a data fabric powered by active metadata, Krensky said data scientists, for example, would be able to identify data drifts due to changing customer behaviour and give them a nudge that it is time to refresh their predictive models.

It can also notify data engineers that certain use cases are generating new categories of data, he said. “And if the same data is being used by multiple people across the organisation, active metadata can tell us that they are probably making interrelated decisions.”

Organisations could also look for “small data” that could be more accurate, safer, cheaper and more accessible. Krensky said such data could be more insightful than the big data organisations tend to collect by habit.

Read more about data analytics in APAC

“Just like the minimum viable product is enough the get the job done, we should aspire to have minimum viable datasets,” he said.

For example, a hotel that is looking to launch a health and wellness programme just needs to examine data about guests who use the hotel’s gym and what they ordered for room service, rather than perform complex demographic and psychographic analyses.

Krensky added: “Going on a data diet can be healthy. Cutting out all that junk data that bloats our systems costs us money, raises our data risks and distracts us from the nutritious data that will help us grow. Sometimes, less is truly more.”

To reduce data risks and identify useful data, organisations can create synthetic data, which is artificially created data with similar attributes to the original data. According to Gartner, synthetic data will enable organisations to avoid 70% of privacy violation sanctions.

Parker said: “If you have sensitive customer data that you want to use but you can’t, you could replace it with synthetic data without losing any of the insights it can deliver.” She added that this could also facilitate data sharing across countries and in industries such as healthcare and financial services.

In the UK, for example, the Nationwide Building Society used its transaction data to generate synthetic datasets that could be shared with third-party developers without risking customer privacy, she said.

Parker said synthetic data will also enable organisations to plug gaps in the actual data used by artificial intelligence (AI) models. Gartner estimates that synthetic data will completely overshadow real data in AI models by 2030.

Parker noted that Amazon is already using synthetic data to accelerate the Alexa voice assistant’s ability to support more languages, such as Brazilian Portuguese. This data would have taken longer and been more expensive to collect in real life.

Google’s self-driving subsidiary, Waymo, is also using synthetic data to simulate driving 20 million miles a day. It was also able to simulate unusual driving circumstances and getting out of the way of an ambulance.

Parker added: “Start investigating synthetic data, because today’s use cases of reduced risk will enable tomorrow’s use cases and better predictive models.”

Read more on Big data analytics

CIO
Security
Networking
Data Center
Data Management
Close