Feature

What lies beyond the data warehouse?

Has the traditional data warehouse finally reached the end of its life? If so, what will follow it? Will it be a hybrid? We find out

Stephen Pritchard

Published: 07 Oct 2021

Since the 1990s, organisations have gathered, processed and analysed business information in data warehouses.

The term “data warehouse” was introduced to the IT mainstream by American computer scientist Bill Inmon in 1992, and the concept itself dates back further, with the founding of Teradata in 1979 and work carried out by IBM in the early 1980s.

Their goal was to allow enterprises to analyse business data to improve decision making, without the need to interrogate perhaps dozens of different business databases.

Since then, the technology has evolved, allowing organisations to process data at greater scale, speed and precision.

But some commentators now believe the data warehouse has reached the end of its useful life.

Ever greater volumes of data, along with the need to process and analyse information more quickly, including potentially in real time, are putting stress on conventional data warehouse architectures.

And data warehouse suppliers face competition from the cloud. An on-premise data warehouse can cost millions of dollars, take months to implement, and, critically, more months to reconfigure for new queries and new data types. CIOs are looking at the cloud as a more flexible home for analytics tools.

Exponential growth in business data

Conventional data warehouses are struggling with exponential growth in business data, says Richard Berkley, a data and analytics expert at business advisory firm PA Consulting.

“The cloud now provides much more scalability and agility than conventional data warehouses,” he says.

“Cloud technologies can scale dynamically, pulling in the processing power needed to complete queries quickly just for the processing time. You’re no longer paying for infrastructure that sits idle and you can get far better performance as the processing for individual queries is scaled far beyond what is feasible in on-premise services.”

Nor are data volumes the only challenge facing the data warehouse. Organisations want to avoid being locked into one database, or data warehouse technology.

Increasingly, businesses want to draw insights from data streams – from social media, e-commerce, or sensors and the internet of things (IoT). Data warehouses, with their carefully crafted data schemas and extract, transform and load (ETL) processes, are not nimble enough to handle this type of query.

“The market has evolved,” says Alex McMullan, chief technology officer for Europe, the Middle East and Africa at storage supplier Pure.

“It is no longer about an overnight batch report which you then give to the CEO as a colour printout. People are doing real-time analytics and making money in the space.” Applications, he says, run from “black box” financial trading to security monitoring.

Lakeside view

At one point, data lakes appeared set to take over from data warehouses. In a data lake, information is stored in its raw form, on object storage, mostly in the cloud.

Data lakes are quicker to set up and operate, as there is no prior processing or data cleansing, and the lake can hold structured and unstructured data. The processing, and ETL, takes place when an analyst runs a query.

Data lakes are increasingly used outside of traditional business intelligence, in areas such as artificial intelligence and machine learning, and, because they move away from the rigid structure of the data warehouse, they are sometimes cited as democratising business intelligence.

They do, however, have their own drawbacks. Data warehouses used their structure to build performance, and that discipline can be lost with a data lake.

“Organisations can accumulate more data than they know what to do with,” says Tony Baer, analyst at dbInsight. “They don’t have that discipline of an enterprise architecture approach. We gather more data than we need, and it is not being fully utilised.”

To deal with this, enterprises throw more resources at the problem – all too easy to do with the cloud – and end up with performance “almost as good as a data warehouse, through brute force”, he says.

Controlling queries and costs

This can be inefficient, and costly. Baer points out that cloud analytics suppliers such as Snowflake are building in more “guardrails” to control queries and costs. “They are moving in that direction, but it is still easy to keep adding VMs [virtual machines],” he says.

Data warehouses and data lakes also exist to support different enterprise requirements. The data warehouse is good for repeatable and repeated queries using high-quality, cleaned data, often run as a batch. The data lake supports a more ad-hoc – even speculative – approach to interrogating business information.

“If you are doing ‘what if’ queries, we are seeing data lakes or document management systems being used,” says Pure’s McMullan. He describes this as “hunter gatherer” analytics, while data warehouses are used for “farming” analytics. “Hunter gatherer analytics is looking for the questions to ask, rather than repeating the same question,” he says.

The goal for the industry, though, is to combine elasticity, speed and the ability to handle streamed data, and efficient query processing, all in one platform.

New architectures

This points to a number of new and emerging categories, including the data lakehouse – the approach taken by Databricks – Snowflake’s cloud-based, multi-cluster architecture, and Amazon’s Redshift Spectrum, which connects the supplier’s Redshift data warehouse to its S3 storage.

And, although the industry has largely moved away from trying to build data lakes around Hadoop, other open-source tools, such as Apache Spark, are gaining traction in the market.

Change is being prompted less by technology than by changes in business’s analytics needs.

“Data requirements differ from those of five or 10 years ago,” says Noel Yuhanna, an analyst covering data management and data warehousing at Forrester. “People are looking at customer intelligence, change analysis and IoT analytics.

“There is a new generation of data sources, including sensor and IoT data, and data warehouses have evolved to address this, [by handling] semi-structured and unstructured data.”

The cloud adds elasticity and scale, and cost savings of at least 20%, with 50% or even 70% cost reductions possible in some situations. However, he cautions that few companies genuinely operate their analytics systems at petabyte scale: Forrester calculates that fewer than 3% do.

Those that do are mostly in manufacturing and other highly instrumented businesses. They might, for their part, turn to edge processing and machine learning to cut down data flows and speed decision making.

Back to the future

By no means everyone believes the data warehouse has had its day, however. As Databricks’ Ghodsi concedes, some systems will carry on as long as they are useful. And there are risks inherent with moving to new platforms, however great their promise. “Data lakes, and new infrastructure models, can be too simplistic and do not fix the real complexity challenge of managing and integrating data,” says PA Consulting’s Berkley.

Much will depend on the insights organisations need from their data. “Data warehouses and DL are very complementary,” says Jonathan Ellis, chief technology officer of DataStax. “We don’t serve Twitter or Netflix out of a data warehouse, but we don’t serve a BI dashboard out of Cassandra. [We] run live applications out of Cassandra and do analytics in the data warehouse. What is exciting in the industry is the conjunction of streaming technology and the data warehouse.

“Databases are sticky and although everybody in the data warehousing space broadly supports Sequel, the devil is in the detail,” he says. “How you design schemas for optimum performance differs from supplier to supplier.”

He predicts a hybrid model, comprising on-premise and cloud, open source and proprietary software, to create a “deconstructed data warehouse” that is more flexible than conventional offerings, and more able to handle real-time data.

Others in the industry agree. We are likely to see a more diverse market, rather than one technology replacing all others, even if this poses a challenge for CIOs.

The data warehouse is likely to carry on, for some time at least, as the “gold copy” of enterprise data.

Pure Storage’s McMullan predicts that organisations will use warehouses, lakes and hubs to view different sets of data through different lenses. “It will be a lot harder than it used to be, with modern data sets and the requirements to go with it,” he says. “It is no longer about what you can do in your 42U, 19-inch rack.”

What lies beyond the data warehouse?

Has the traditional data warehouse finally reached the end of its life? If so, what will follow it? Will it be a hybrid? We find out

Exponential growth in business data

Lakeside view

Controlling queries and costs

New architectures

Read more about data warehousing and modernisation

Back to the future

Read more on Data warehousing

What is a data lake?

Dremio: Understanding Apache Iceberg (the data lakehouse backbone)

Oracle targets speed with launch of MySQL HeatWave Lakehouse

Lakehouse architecture the best fit for modern data needs