A four-phased approach to building an optimal data warehouse

This is a guest post for the Computer Weekly Developer Network by Barry Devlin in his capacity as founder and principal of 9sight Consulting.

A TechTarget contributor is his own right, Devlin is a specialist and founding voice in data warehousing, having published the first architectural paper on the topic in 1988.

Dr Barry is a regular blogger, writer and commentator on information and its use – he is based in Cape Town, South Africa and operates worldwide.

Devlin writes as follows:

If you’re planning to create a data warehouse, make sure you create one that is cross-functional and provides a long-life foundation for data provision and decision support. This means covering the entire enterprise and satisfying the needs of multiple projects and groups for several years. The foundation must provide consistent, reconciled, legally binding data to your business clients.

Dr Barry Devlin

Dr Barry Devlin

Easier said than done. Right?

Think of your project in these four steps: Design, Build, Implement and Maintain.

Designing your data warehouse

Let’s start at the design phase. When planning your design, the vision for your new data warehouse is best laid out over an enterprise data model (EDM), which consists of high-level entities including customers, products and orders.

A traditional design approach involves mapping entities to “loosely normalized” tables based on third normal form (3NF) or based on a dimensional or star-schema model.

Another approach uses the Data Vault Model (DVM), which is a hybrid of the 3NF and star-schema forms.   First introduced by Dan Linstedt, the Data Vault is a detail-oriented, history-tracking, linked set of normalized tables designed to support multiple functional business areas.

The DVM consists of three specialized types of entities/tables: hubs based on rarely changed business keys, links that describe associations or transactions between business keys, and satellites that hold all temporal and descriptive attributes of business keys and their associations. A new version introduced in 2013 consists of a data model, methodology, and systems architecture, which provides a design basis for data warehouses to emphasize core data quality, consistency, and agility to support enterprise-wide data provision requirements.

In May 2017, data warehouse automation specialist, WhereScape announced automation software to enable rapid and agile Data Vault 2.0 development, cutting delivery time of Data Vault-based analytics solutions by two-thirds.

Get busy building

Once you set your design, now comes the hard work of building your data warehouse. But before you start, accept the fact that no matter how nicely you’ve designed your model, you will face the reality of imperfect data source systems. Data warehouse builders struggle with missing data in source systems, poorly defined data structures, incorrect content and missing relationships. Implementation is a delicate balancing act between the vision of the model and the constraints of the sources.

The building process comes down to five steps:

  1. Understand the data sources. Keep in mind that legacy systems might be “bent to fit” emerging and urgent requirements. And modern big data sources might lack documentation.
  2. Compare the data available to the data warehouse model and define appropriate transformations to convert the former to the latter.
  3. Where transformations are too difficult, modify the data warehouse model to accommodate the reality of the data sources. Changing the data sources is usually impossible for reasons of cost and politics.
  4. Test performance of load/update processes and check the ability of the modified model to deliver the data the business requires.
  5. If successful, declare victory. Otherwise, rinse and repeat.

Improved approaches to automating the process have emerged in stages over the history of data warehousing: extract, transform, load (ETL) tools, data integration systems and, finally, data warehouse automation (DWA). In essence, each stage on this journey depicts an increasing level of automation, using DWA to address the entire process of designing, building, operating and maintaining a data warehouse.

Companies such as WhereScape offer useful tools to automate the data source discovery, design and prototyping phases of projects. Additionally, advanced automation solutions with an integrated development environment (IDE) targeted to your data platform can eliminate the majority of traditional hand-coding required and dramatically streamline and accelerate the development, deployment, and operation of data infrastructure projects.

A DWA tool automates the transformation of the data structures of the various sources to the optimized model of the Data Vault and populates the target tables with the appropriate data. This approach creates necessary indexes and cleanses and combines source data to create the basis for the analysis to address the business need.

Shifting to operations

The clear aim here is to automate and speed deployment in an agile environment to reduce human error across the full lifecycle.

Having deployed the system to production, the next—and ongoing—task is to schedule, execute, and monitor the continuing process of loading and transforming data into the data warehouse. In this phase, jobs consist of a sequence of interdependent tasks. To ensure that data consistency is maintained, if a task fails during execution, then all

downstream dependent tasks are halted. When the problem has been resolved, the job is restarted and will pick up from where it left off and continue through to completion. From an operational point of view, given potential interdependencies of data across these systems, it makes sense to manage this ensemble as a single, logical environment.

The smooth, ongoing daily operation of the entire data warehouse environment is a fundamental prerequisite to its acceptance by users and its overall value to the business.

Maintaining with agility

In more traditional IT projects, when a successful system is tested, deployed and running daily, its developers can sit back and take a well-deserved rest. Developers of today’s data warehouses have no such luxury. To make life easier, leverage and apply agility whenever possible.

Now, ongoing digitalization of business is driving ever-higher demands for new and fresh data. Some people think a data lake filled with every conceivable sort of raw, loosely managed data will address these needs. That approach may work for non-critical, externally sourced social media and Internet of Things data. However, it doesn’t help with historical and real-time data.

Fortunately, the agile and automated characteristics of the Data Vault / DWA approach applies also to the maintenance phase. In fact, it may be argued that these characteristics are even more important in this phase.

Automation = agility

At this point, widespread automation is essential for agility because it increases developer productivity, reduces cycle times, and eliminates many types of coding errors. Another key factor in ensuring agility in the maintenance phase is the ongoing and committed involvement of business people. An automated, template approach to the entire design, build and deployment process allows business users to be involved continuously and intimately during every stage of development and maintenance of the data warehouse and marts.

With maintenance, we come to the end of our journey through the land of automating warehouses, marts, lakes, and vaults of data. At each step of the way, combining the use of the Data Vault approach with DWA tools simplifies technical procedures and eases the business path to data-driven decision-making.