EMC World 2014: what is a data lake?

EMC says that the so-called “data lake” is the foundation for the next generation of our data warehouses, which (again according to EMC) will be specified by software that will create a software defined datacentre.

What is a data lake?

The concept of the data lake comes about as a result of the mainstream use of Hadoop – and the lake itself is also sometimes known as:

• Data Lake

• Bit Bucket

• Landing Zone

So is it all PR spin and buzzword bingo?

Well yes, obviously, to a degree, but there is also something of interest here if we look at the suggestion that companies have huge ‘lakes’ of information that they will want to analyse and gain insights from but ….

… but, much of that data will exist in multiple formats and so it becomes too costly to perform actions upon.

The data lake then is a location where firms can store “practically unlimited” amounts that exists in any format, schema and type.

It is cheaper than previous notions of any data store and relatively inexpensive.

It is, of course, also massively scalable.


The word on this from EMC is that Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive and batch.

EMC Hadoop Starter Kit ViPR Edition is all about the firm’s approach to being able to create data lakes.

According to the EMC official blog, we can add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests.

“This smells like a data lake to me,” says EMC.

“A general-purpose data storage and processing resource centre where big data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest data lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.”

The company insists that EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ data lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks.