This is a guest blog post by Ravi Shankar, chief marketing officer, Denodo.
For over two decades, the traditional data warehouse has been the tried-and-true, single source of truth in support of BI. However, BI is rapidly evolving, so the traditional data warehouse will have to evolve as well, to keep pace. Traditionally, data warehouses have required data to be replicated from source systems and stored within it, in a format that enables the data to be readily consumed by BI applications. These might seem like reasonable requirements, but they’re stunting the growth of this venerable technology.
BI analysts are now seeing the potential for delving into new kinds of data, such as machine-generated readings (from vehicles, packages, temperature sensors, manufacturing equipment, etc.) and output from myriad social-media platforms. Traditional data warehouses cannot support these new forms of data, as they appear “structureless” to the ETL processes dedicated to extracting, transforming, and loading the data into the warehouse. These processes would have to be re-written every time a new data source is introduced, which is neither practical nor sustainable, and quickly becomes costly. More importantly, batch-oriented ETL processes are just not set up to accommodate dynamic, real-time data streams.
Also, data sources are getting exponentially larger, which puts a strain on replication and storage, not to mention security, and further contributes to steadily rising expenses.
If data warehouses could accommodate streaming data, they could re-establish themselves as the single source of truth, but at what cost?
Adapting to the limitations
Companies are using open-source frameworks like Kafka and Spark to accommodate the new machine-generated, streaming, and social media data sources, and they are using distributed storage systems like Hadoop to offload data from the data warehouse. These solutions work well, and Hadoop is an extremely cost-effective, scalable alternative to physically expanding the storage capacity of a data warehouse. However, such companies are now saddled with a new problem: Data cannot be queried across the data warehouse, the Hadoop cluster, and the Spark system, severely limiting BI potential.
In this all-to-familiar scenario, the data warehouse is no longer able to be the single source of truth, simply because of its physical limitations: its need to physically replicate data to a central repository, its natural physical storage capacity, and its need for a programmer to physically update ETL scripts to accommodate every new source.
For this scenario, the solution is clear: not a physical data warehouse but a logical data warehouse.
Logical data warehouse
A logical data warehouse doesn’t physically move any data. Instead, it provides real-time views of the data, across all of its myriad sources, and these can be cloud sources, such as Kafka, Spark, and Hadoop, as well as traditional databases of any stripe.
This means, of course, that logical data warehouses can easily accommodate traditional data warehouses or data marts as sources, to support all of an organization’s standard reporting needs. In this way, logical data warehouses are perfectly capable of fulfilling Gartner’s ideal of a bimodal IT infrastructure, one mode characterized by predictability, and the other by exploration. The first mode can be met by the production of standard, highly audited reports, facilitated by the traditional data warehouse, while the second can be met by the ad-hoc, experimental capabilities of self-serve analytics, facilitated by the logical data warehouse.
As far as BI analysts are concerned, all of the company’s data, along with select external sources, sit in a single, logical repository. They do not need to know where different sets of the data may be stored, which data sets needed to be joined to create the view, or what structures define the various source data sets. They only see the data that they need, when they need to see it.
For the IT team, the BI infrastructure built around a logical data warehouse is much easier to manage than one that is built around a physical data warehouse. Since the logical data warehouse doesn’t actually “house” any data, merely the necessary metadata for accessing the various sources, there is no replication or storage to manage, and no ETL processes to maintain. If one source needs to be replaced by another, data consumers will not know the difference; they will experience no downtime, and the IT team can proceed with the migration at their own pace.
A logical data warehouse is the only logical choice for a data warehousing solution that serves as an organization’s single source of truth. It provides seamless, real-time access to virtually any source, including traditional data warehouses and data marts, and it’s easy to introduce new ones into the mix, without affecting users and with minimal impact on IT. Logical data warehouses can scale to accommodate any volume of data, across any number of sources, to meet current and future needs.