echiechi - Fotolia
At the recent Strata conference in London, Doug Cutting, Hadoop co-creator and chief architect at Hadoop distributor Cloudera, took time to talk to Computer Weekly about the state of play in big data software.
Cutting (pictured) is well known as the founder of Hadoop at Yahoo, where he and his colleagues took the MapReduce idea of parcelling out data workloads and then reducing the results back from Google and applied it more widely to a software framework, then named after his child’s toy, Hadoop.
On this occasion, he spoke about a new cyber security application of his company’s technology, the role of Spark, and of open source more generally. What follows is an edited transcript of that interview.
Computer Weekly: What are you working on?
Cutting: I’ve been helping Cloudera and Intel with the Apache Spot project, which is an open source, big data style of doing cyber security. This is instead of the classic approach of having filters that are scanning for particular kinds of behaviour that someone has manually coded in terms of prior attacks. It’s hard to catch new attacks that way. Whereas if you build models that define usual behaviour, you can catch anomalies.
Computer Weekly: But that is an old information security approach – anomaly detection. How has it moved on?
Cutting: We now have the horsepower to store and process a lot more data, with Hadoop and [parallel processing framework] Spark. Also, we are trying to have a standard format for network data, so that different firms can build different applications that are detecting intrusions, so that we can have a cyber security ecosystem, an open data model for cyber security. We have been a horizontal play as Cloudera, but in this case we do want to support industry-specific data, and there could be opportunities to do that for other industries, such as telcos, or in the IoT [internet of things].
Computer Weekly: Open source might be a source for good, but is it a force for business? CIOs have an interest in their open source suppliers not going under.
Cutting: No, open source is a requirement for business. Companies are more and more reluctant to adopt technology that is not open source for their basic storage and processing of data. But it is also a better model for developing software because you have more people participating in the process. When you get technology controlled by a single institution, it becomes a cash cow. The company can’t make fundamental changes easily without threatening its existing business. For example, with Cloudera we have had the MapReduce element of Hadoop as a core component from the beginning. But Spark has come along, and is a better tool.
Computer Weekly: Has Spark now eclipsed MapReduce?
Cutting: In many cases, it has. And the interesting thing is that it does not threaten our business; rather, it makes it stronger, even though it is a technology from outside. Oracle would find that very hard – to replace its database with Spark, and convince customers to replace it. We saw a lot slower progess in database technology when it was proprietary than we are seeing now.
Read more about Doug Cutting and Hadoop
- At the Strata + Hadoop World 2016 conference in San Jose, SearchDataManagement’s Jack Vaughan talked with Hadoop co-creator Doug Cutting to find out more about the big data technology’s origins and where it is headed.
- In a Q&A as Hadoop reaches one 10-year milestone in its development, co-creator Doug Cutting talks about the adoption of the big data framework, and the history and future of Hadoop.
- A reminder of what Hadoop is.
Computer Weekly: How much, then, of the original Hadoop technology stack is in Cloudera?
Cutting: HDFS, MapReduce and Yarn are still used heavily. For example, Uber uses MapReduce. It is not dead, but doing, say, machine learning algorithms with MapReduce is clumsy. There are libraries for doing machine learning in Spark. Or if you are doing streaming, you might use [messaging system] Kafka or Spark streaming.
Computer Weekly: Where are we now with Hadoop’s evolution, in 2017? We’ve spoken before about shifting from taking out costs, using Hadoop, to facilitating more innovative business models. Is it still mainly about taking out costs from storage?
Cutting: In the first year of use, it is mainly about taking out costs from storage. Or combining data sources that you were unable to combine before – that is another way to get started. Fairly rapidly, we see people having two or three applications, using the platform to experiment and innovate. That will be the future. It used to be that you built an application to satisfy a business need, and you ran it for 20 years. You didn’t deploy a platform whose purpose was innovation. Now you want to get a win first, and then start exploring.
Computer Weekly: Back in 2012, I was asking you about big data technologies “crossing the chasm” of adoption, from Californian software engineering companies through Wall Street and the City to more mainstream companies. You said then it would be steady growth.
Cutting: And that is what we have seen. I am very optimistic long-term, but it is deceptive when you look at it short-term. You’ll see various analysts saying “people have used Hadoop and it has failed”. It is not easy to see the progress unless you are in the business of working with it.
Computer Weekly: Coming back to the cyber security effort, does that come under the heading of machine learning? What is your take on that area, which is all the rage?
Cutting: It does. My take on it is that there is real stuff there. But there is also a lot of value to be had from using simpler methods. If you look at the industry over the next decade, I think machine learning will be a smaller part of our business and of the industry than more conventional data management methods. Just being able to get more data more integrated and being able to count things you could not count easily before. Most companies still can’t do that, and when they can, they get a lot of value out of that. There is a lot of room there to deploy ML and AI, but it won’t eclipse more traditional database, search and analytics technologies.