Amr Awadallah is the chief technology officer and co-founder of Hadoop distributor Cloudera. Previously, he was the vice-president of product engineering at Yahoo. On a recent visit to London, he briefed Computer Weekly on how the supplier is evangelising the concept of an ‘enterprise data hub’ in counterpoint to the established enterprise data warehouse.
There are two audiences for big data. Those who are interested in the concept and those who are doing it. How do you approach these?
There are people at all stages: some not sure what to do and dabbling, and customers who are all in. It’s normal with any new technology to have an adoption cycle.
Is there less of a need to evangelise with big data technologies? Aren’t the problems more obvious than usual?
No, we are still at the beginning. There are some use cases that are about operational efficiency, just saving money. People do get these right away. But to sell the full vision of what we are calling an ‘enterprise data hub’ – that does require more evangelising, though customers have been receptive.
An enterprise data hub?
More on enterprise data platforms, warehousing
Cloudera’s mission is to enable customers to use all their data to ask bigger questions. ‘All’ is a key word. It’s not big data, but all data. It’s having a holistic view of your customers.
The example I like to give of ‘all data’ is the ATM machine. Ten years ago the only thing recorded was the explicit transaction. Today, we can collect implicit information, such as your face, how you interact with the touchscreen, whether you have a mobile device with the bank’s app, and the information around scanning pictures of cheques. This all makes fraud detection better.
‘Asking bigger questions’ is important. Traditional software has been focused on using SQL to ask questions. Now, SQL is powerful, but there are a lot of questions you can’t ask with it. You can’t do image processing or voice recognition in SQL. You cannot scan PDFs using it.
The ultimate use case for us is a ‘customer 360’, having a 360-degree view of the customer. That solves the data silos problem, data from different channels. Our platform allows the breaking down of those silos.
Cloudera is a Hadoop distributor. Explain what makes this positioning a development?
Cloudera's mission is to enable customers to use all their data to ask bigger questions
It’s not a departure from what we have been doing. But it’s a better language for business. Eighty per cent of Hadoop distributions are ours. But we have technologies alongside Hadoop. Also, Hadoop itself is morphing, as with YARN opening it up. Five years ago, all you could do with Hadoop was a MapReduce operation. Yarn allows other applications to run on top of the data, such as interactive SQL, which [Cloudera’s] Impala allows you to do.
We also now have a natively integrated search function. We have integration with SAS, and Splunk – with Hunk running natively on Hadoop. Also, Informatica’s ETL engine runs natively inside the Cloudera platform.
The analogy we like to use is that we are the smartphone of data, as opposed to the SLR digital camera. Enterprise data warehouses are the SLR digital cameras of the data world. They are expensive and they only do one thing – in the case of the data warehouse, run queries over structured data. The ‘enterprise data hub’ is like a smartphone. The smartphone is convenient and applications can all share the data. It is the same with us. The model is that the applications come to your data instead of your farming out your data to silos, which prevents a 360 view.
Our approach is more economical than traditional enterprise data warehousing. The cost for a terabyte of data with us is around $1,000. You can pay $100,000 per terabyte to store data that you don’t use in traditional data warehouses, let’s say data you haven’t looked at for six months. We offer an active archiving system for that.
More on Hadoop
- Latest Cloudera release upgrades Hadoop distribution
- King.com gaming site unlocks big data with Hadoop
- ComScore moves 'big data' analytics environment from Cloudera to MapR
We do work with Teradata on the integration front. And we have partnerships with Oracle, with its Big Data Appliance, and with HP on the Vertica system. There will be use cases for which an SLR camera is the right device.
A phrase you often hear attributed to big data projects in large companies and organisations is ‘science project’. Are they getting beyond that to enterprise deployments?
First, 60% of the Fortune 500 are using Cloudera, in production, not in science projects. Three of the top four credit card companies in the world are using us for fraud detection. Now, these production use cases don’t necessarily add up to the 360-degree view. About 20 of our 300 paying customers are doing that, though none in the UK or Europe as yet.
Europe is where the US was two years ago. In the US, there is the Federal government (by which I do not just mean Intelligence) and there is Monsanto. Monsanto use the platform to collect experimental data from sensors in fields. They measure temperature, soil composition, humidity, the rate of growth of the plants. They are looking to come up with more efficient seeds for different environments across the world. They reckon that over the next ten years humans will consume more than over the last 100 years. I would never have envisaged the Monsanto use case for our technology when we started out five years ago.
Sectorally, what is the customer base like?
Hadoop provides a system that is much more flexible, so you can add new columns and data types quickly
The top sectors for us are retail, web companies (including eBay), telecoms (both infrastucture and the mobile device manufacturers Nokia, Motorola Mobility and RIM) and oil and gas, genomics, smart energy, automotive and construction equipment makers.
It is a large organisation thing. This is not a small business technology, save for the web startups, such as Box.net, King.com, and so on. Anywhere where there is a data explosion.
How would you sum up the business value of what you are trying to do?
We are trying to provide agility, to lower the cost of curiosity. There is a high cost of curiosity for organisations today. For example, at Yahoo I ran the IT infrastructure. The business would come to me wanting, say, a new column for a data models. That’s hard work with an enterprise data warehouse. It takes weeks, months.
So, I’d say: “how much value will this create for the business?” And they would say: “we can’t tell you the value till you add the column". That prevents the business from innovating. You need a system that is much more flexible, so you can add new columns and data types quickly. Hadoop gives you that. You can experiment more easily. The SLR camera will not go away, but for the right use cases.