Oleksii - stock.adobe.com
The battle for supremacy in data platforms is heating up amid growing adoption of artificial intelligence (AI), with Cloudera, Databricks, Snowflake and major cloud suppliers all eyeing a slice of the pie.
That’s hardly surprising, given that data platforms are where data is aggregated from multiple sources and managed to power enterprise-wide analytics and increasingly AI workloads.
Cloudera, which merged with long-time rival Hortonworks in 2018, believes it has what it takes to differentiate itself from the pack with its open-source pedigree and heritage in helping enterprises build data pipelines. The company was listed on the New York Stock Exchange in 2017 at the peak of the big data boom, but went private in 2021 after years of sluggish profits and revenues.
In an interview with Computer Weekly in Singapore, Cloudera’s chief revenue officer Frank O’Dowd and chief product officer Sudhir Menon, outline the company’s business and technology strategy to help enterprises harness generative AI and other capabilities through its work in the open source community and other areas.
How is Cloudera setting itself apart from other data platforms in the market as well as hyperscalers that are getting into the same space, particularly in Asia-Pacific (APAC)?
Menon: We’ve been an open-source company and we are one of the few companies that started out as a data platform, which means that we can add new technologies and capabilities to our platform in a very uniform manner.
Our strategy all along, from the time we came together with Hortonworks, has been to make sure that we are an independent entity that can offer our customers choice. So, in the world of hyperscalers, we want to make sure that we give customers the ability to avoid cloud lock-in and concentration risk by going with open-source technologies.
With generative AI, companies like OpenAI spend billions of dollars building foundational models. But what’s happening with open source is that in a matter of six to eight weeks, the foundation models coming out of open source can meet or beat the accuracy of other models. So, open source is a very important element that drives innovation and data.
As far as point solution providers are concerned, they can let you take small elements of data and answer questions against that data, but that is not a platform. With AI, you need a platform to create models and make decisions that impact your transactional systems.
Customers also want to have the ability to train and deploy models. Singapore’s OCBC Bank, for example, has 30 use cases in generative AI that they’ve built on our platform. We see ourselves as a full platform, completely built from the ground up for AI, with massive data gravity and the ability to run on-premise and in all the clouds. That’s a significant advantage for our customers.
Sudhir Menon, Cloudera
We also have more data than most of our competitors, one of which claimed to have more than 6,400 customers and manages over 250PB of data. We have one customer who has over 300PB of data, and we have hundreds of customers with 250PB of data or more. Each of our customers is doing more mission-critical work, dealing with more data, and creating more value out of AI than those of one of our primary competitors.
O’Dowd: The APAC market is a microcosm of the global market where we have different competitors in each country and the market share of hyperscalers can vary throughout the region. In those instances, our competition can be different, and what the complete solution is will vary by market within APAC as well.
But one thing I would point out is that APAC is a very mature market in terms of embracing hybrid cloud. Companies here have also done a phenomenal job of embracing machine learning, which sets them up to do well with AI, beyond what we see in other regions. It’s exciting and it’s not viewed as an anomaly.
Cloudera traditionally has had dedicated teams focused on specific industries such as financial services. Has there been any reorganisation within the company to capture the AI opportunity?
O’Dowd: We will aggressively pursue the market. We are trying to align our sales teams by industry, but also by how we’re doing things in each geography. Obviously, there are nuances between different countries in the region, so we’re trying to structure things geographically, as well by industry and solutions, where we’ve seen triple-digit growth in our cloud solutions. But we’re also finding that it’s the hybrid approach that leverages both on-premise and public cloud that will drive our success.
Menon: With AI, we have the opportunity to take our successes in one market, package them and replicate them in other markets. Typically, how the world works is that we do something in Silicon Valley which takes six months, then we do something in London, and that takes another six months. But the AI market is different – it’s almost like what happened in India with telephones. I grew up in India where we had no landlines. I used to walk three-and-a-half kilometres to the nearest payphone when I was growing up as India had no telecoms infrastructure at the time.
And then cellphones came along and now 1.3 billion people have them, with India adopting mobile payment applications long before the US. I think AI, which is about having volume and accuracy of data, is going to be the same. And so, we’re going to leverage our successes here, particularly in markets like India, China, Indonesia, and Thailand, to win big in the US. We’re seeing the same thing in Latin America, where there’s a lot of AI adoption because they don’t have legacy and can move directly to new systems.
The hyperscalers have been talking about making large language models available to customers. How is Cloudera helping customers that might want to bring some of those models into the Cloudera platform where their data is?
Menon: We have data gravity today and it’s very important that we don’t lose that. We have a product called Cloudera Machine Learning (CML) that gives you a choice of model deployments. You can self-host open-source models or you can use purpose-built models from the internet and license them.
There are also models from OpenAI that you can use through application programming interfaces (APIs). For instance, if your workload is primarily running in Azure, and OpenAI executes in Azure, there are no data egress costs and we have customers who do that. We can support those customers with CML because our platform runs natively on Azure.
Finally, you can also use the foundational models that are being developed by startups through the hyperscalers. So, whether you’re using your own models, self-hosted models running on-premise and the cloud, or using APIs to access large language models from OpenAI and others, we offer customers a full array of capabilities out of the box. That’s our strategy. We’ve always been about choice. As new technologies become available, the platform gives you a secure way of accessing them with a shared data experience.
But even before that, customers would have to manage their data pipelines – is Cloudera doubling down on that, given that many companies are still struggling with getting the data right before they can leverage those models?
Menon: Half of our business is about helping customers build data pipelines which is about efficiency and agility. Ensuring that these pipelines can be scheduled and built in a more agile manner and deployed is what we’ve focused on, independent of AI. We’ve made a ton of progress in that area and it’s our bread-and-butter business.
Cloudera made a few recent announcements related to Apache Iceberg and the move into the data lakehouse space. What’s the thinking around data lakehouses and how they fit into Cloudera’s overall technology strategy?
Menon: You may or may not know that Hive Metastore, the meta store engine that powers big data, both in the cloud and on-premise, was created by us. And when data proliferated, the Hive Metastore started to become a bottleneck. A former Cloudera employee who was part of the Hive Metastore team moved over to Netflix to solve the problem.
Frank O'Dowd, Cloudera
In traditional big data, data and their metadata sit in separate locations, and so you access the metadata to manipulate the data. That’s how it works. What he did was to store the metadata with the data, so you can scale the data and metadata without having a single point of contention. This distinction led to the creation of Apache Iceberg. At a technical level, it’s ensuring that metadata is partitioned with the data instead of being separate.
Today, we run one of the largest data lakes in the world. Initially with big data, it was all about read-only data. But what has happened is that transactional systems now feed a lot of data into data lakes. And transactional data can mutate and change. That means you want to be able to mutate the data, change the data and time-travel to see what’s happening.
To do all of that in the old days, you had to change your pipelines and storage which was painful. Now with Apache Iceberg, you can do the things I talked about with no scalability limitations. You can mutate your data, do schema evolution and time-travel, making the data lakehouse a very powerful concept for working with data.
The most important thing about the Iceberg evolution is that it’s in a table format. That means Cloudera can support Amazon, Snowflake, and others. When we started, we were the biggest ones in town, but today in the cloud ecosystem, there’s Amazon Redshift, Amazon Athena, Google BigQuery and there’s Cloudera. Customers want the ability to work with data across multiple properties and this has not been a big challenge for us because today you can use Iceberg with Flink, Spark and other products in the ecosystem.
What we bring to the table is that we give you the same security, governance, lineage, and data gravity on your data. We will do the ETL [extract, transform, load], build the data pipelines in the Iceberg format and we’ll give you a catalogue that we are building in the community. Anybody will be able to publish into that catalogue. Any product will be able to read and write from it, which is a huge leap for interoperability in the cloud. We’re excited to be leading that effort and we’re not about closed ecosystems. That’s why data lakehouses are important to us.
You started talking about Cloudera as an open-source company. Can you tell us about the work you’re doing in the open-source community?
Menon: This is a topic that’s near and dear to my heart because I run R&D [research and development] as chief product officer. Building a community, sustaining the community, and creating innovations that are broadly applicable is one of our superpowers. We are a humble company, but this is something we know how to do. We know how to bring companies together, create meetups, and we know how to set up a committed programme. We know how to ensure adoption of the product by ISVs [independent software vendors] and customers. And open source is about those things and identifying areas of innovation.
We have more than 950 engineers on my team, and close to 300 of them are open-source committers in the Apache Foundation. Last year, we also started a programme to increase our contributions to the community.
Take HDFS, for example. It’s the most widely used storage system today and we built it from the ground up. The most widely used object storage today is Amazon S3 and as the market moved from HDFS to S3 in the cloud, we looked at file systems and saw the need to build an object store for the community.
Seven years ago, we started an initiative called Apache Ozone. We have about 60 committers on the project, and we created a new protocol for consensus management – the old one was ZooKeeper – because Ozone stores data at seven times the density of HDFS. Ozone is compatible with S3 and more scalable than any object store. We worked with Tencent, which has hundreds of petabytes of data deployed in their datacentre running Ozone. Today, Tencent, along with others like GE Research, also deliver features into the project. We are now in the process of getting large banks here in Singapore to adopt Ozone. So, bringing technology innovations together for the benefit of the community, and making them enterprise-ready for our customers is really what we do with open source.
We also do some work around data pipelines. Most people don’t know that when you build data pipelines, everything you do is important. One person’s pipeline should be able to pre-empt another person’s pipeline to get the job done. You would think that Kubernetes, which is a more agile way of doing things, should allow for this, right? The scheduler in Kubernetes should allow for this, but it does not.
So, who’s building pre-emption into the Kubernetes scheduler? It’s Cloudera and we’re working with Amazon to make this happen. Similarly with Spark, there’s Apache Livy, an open source product that lets you submit Spark jobs at scale. A couple of years ago, Livy went into decline and the Apache Foundation didn’t see the usage for it because it did not support high availability. So, I pulled together a bunch of people, reached out to Amazon and revived Apache Livy. We have committers as do Amazon and Microsoft. And Livy has become the de facto way of submitting Spark jobs. This is what I meant by our superpower when it comes to being able to create and sustain communities and drive adoption.
Increasingly, we are focused on securing the open-source supply chain in the aftermath of the Log4j vulnerability. We are working with the community to ensure what gets put out in the open has safeguards when it comes to open-source, third-party dependencies. Our customers get the best in terms of fixing issues and anybody who’s trying to self-support an on-premise installation and not using us is running blind. Something like the Log4j vulnerability will hit but our customers will get fixes for vulnerabilities that have yet to be disclosed.
It’s been almost two years since Cloudera was taken private. Could you share our thoughts about what you’ve been able to do now that you couldn’t before as a publicly listed company?
O’Dowd: We’ve been able to make decisions and invest in areas without the worry or concern about public markets. We are investing in development, sales leaders, and things that we need for our go-to-market to be successful. And without the shine of the public market on you, you’re able to do what’s best for the business in the long term versus having to solve short-term problems.
What about the impact on your R&D investment priorities?
Menon: When we were a public company, if Frank needed to meet his numbers, he would have to take a deal whether it was good for the company or not, because otherwise the stock price will tank. I was forced to keep my R&D investments on things that were not strategic. We don’t have that worry anymore. So, Frank now has the liberty to say that if something is not strategic to the future direction of the company, we don’t need to do it.
Today, we’ve been able to make very strategic investments for things like Iceberg and Ozone. We also have significant investments in hybrid cloud. When we were public, I was basically running our private cloud data services, but I had to borrow resources. Now, I have lot of engineers working on that.
When generative AI emerged, we went to the board to ask for additional investment and we are in the process of making sure that we can unleash that investment. We’re also able to move quickly and make changes to our products in a nimbler manner. And as an enterprise company, we make sure that we carry our customers with us, but we are able to innovate and meet customers where they are. That’s been the benefit of not having the shot clock winding down every quarter.
Read more about data management in APAC
- MongoDB’s engineering team in Australia has built a database migration tool to help customers migrate traditional relational databases to its document database.
- A move to shed its back office image and a laser focus on data migration has been fuelling Syniti’s growth in data management.
- Globe, the largest telco in the Philippines, has moved its on-premise data warehouse to Snowflake to address scalability challenges and improve customer experience.
- Databricks is making it easier for organisations to adopt a data lakehouse architecture through support for industry-specific file formats, data sharing and streaming processing, among other areas.