Misunderstanding of Data Lakes is causing a number of risks to be overlooked or underplayed by vendors eager to align the emerging concept to Big Data opportunities.
"In broad terms, data lakes are marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, research director at Gartner.
"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization," he added.
“However, while commercial hype suggests customers across an enterprise will leverage data lakes, it assumes that all are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata,” added Hedecker.
"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," said Andrew White, vice president and distinguished analyst at Gartner. "Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise wide data management has yet to be realized."
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured. Thus solving the cost and resource of managing many independent data collections. The problem of Big Data projects requiring a large amount of varied information, which could be difficult to analyse sufficiently in placed in structured storage is also potentially resolved.
"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used — data is simply dumped into the data lake," said White. "However, getting value out of the data remains the responsibility of the business end user.”
The risks are substantial. The biggest is that any data is accepted regardless of quality or value and risks the lake becoming an unusable swamp. Without metadata, every subsequent use of data means analysts start from scratch.
This acceptance of all data is clearly also a security and success risk. The security capabilities of central data lake technologies are still embryonic and possibility of sensitive data being exposed is too high for comfort. Performance issues are also unavoidable with such an unstructured collection of data.
"The concept assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a prior knowledge' and that they understand the incomplete nature of datasets, regardless of structure,” concluded Mr Heudecker
This is certainly not true of all users in an enterprise and this is where the hype over data lakes is likely to cause problems in the channel.
Semantic data lake development driven by medical technologist.