A star schema is built in a database in order to enable faster retrieval from the database. The query is run over a select set of data rather than over an OLTP setup. Thus, segregation between an OTLP and online analytical processing (OLAP) is required for building a star schema.
This star schema in database case study at Birla SunLife Insurance involves the handling of information related to policy and customer in the system. The core OLTP system, Ingenieum, handles day to day transactions as a master system along with the receipt writer and channel management system. The consolidated data sets come together into the database that hosts different schemas including the star schema.
Identify the source
One needs to know the source of the OLTP systems to build a robust architecture that can support schemas and run queries. For our star schema in database scenario we have DB2, SQL Server and Oracle as OLTP databases and Teradata as an OLAP system.
At the end of the timeline that has been fixed, there will be no more transactions in the OLTP. Data transfer from OLTP to landing schema is achieved using an ETL tool or a script that processes batch files of data, which are targeted to the data warehouse where the multiple schema structures will work with these files.
The star schema in database will depend on the deliverables that are expected in terms of reporting, analysis and performance trending. The core structure of the star schema comprises basic fact and dimension tables.
Build fact and dimension tables
The fact table holds the measures (facts). These are numeric and additive across some or all of the dimensions. For example, sales figures are numeric and users can look at total sales for a product, or category, or subcategory, and by any time period.
While the dimension tables are short and fat, the fact tables are generally long and skinny. They are long because they can hold the number of records represented by the product of the counts in all the dimension tables.
For example, consider the simplified star schema in database depicted in Figure 1. In this schema, there are product, time and store dimensions. With 10 years of daily data, 200 stores and 500 products, there is a potential of 365,000,000 records (3650 days * 200 stores * 500 products). Adding another dimension, such as 10,000 customers, will increase the number of records by 10,000.
Figure 1. Star schema in database.
The primary key in the fact table is made up of foreign keys that have migrated from the dimension tables. These fields are typically integer values. In addition, the measures are also numeric. Therefore, the size of each record is generally much narrower than those in the dimension tables.
The granularity, or frequency, of the data is determined by the lowest level of granularity of each dimension table. For example, a fact table may store weekly or monthly totals for individual products. The lower the granularity, the more records that will exist in the fact table. The granularity also determines how far users can drill down without returning to the base, transaction-level data.
A major benefit of the star schema in the database is that the low-level transactions may be summarized to the fact table grain. This speeds the queries performed as part of the decision support process. However, the aggregation or summarization of the fact table is not always done if cubes are being built.
Fact table size
The previous section discussed how 500 products sold in 200 stores and tracked for 10 years could produce 365,000,000 records in a fact table with a daily grain. This is the maximum size for the table. A star schema in a database cannot store a zero value unless it has some significance. So, if a product did not sell at a particular store for a particular day, the system would not store a zero value.
Even though the fact table is sparsely populated, it still holds the vast majority of the records in the database and is responsible for almost all of the disk space used. The lower the granularity, the larger the fact table. In the previous example, moving from a daily to weekly grain would reduce the potential number of records to just about 52,000,000 records.
The data types for the fields in the fact table help keep it as small as possible. In most fact tables, all of the fields are numeric, which can require less storage space than the long descriptions we find in the dimension tables.
Finally, be aware that each added dimension can greatly increase the size of the fact table. If just one dimension was added to the previous example that included 20 possible values, the potential number of records would reach 7.3 billion.
With a star schema in database, changing of database attributes is an issue. For example, in the StoreDimension in Figure 1, if regions, territories or zones of stores are changed for any reason, the numbers will not be accurate for historical sales. By simply updating the region for a store, the total sales for that region will appear as if the current structure has always been true. Thus, the business would have "lost" history.
In fact, the company might want to see what the sales would have been had this store been in that other region in prior years. More often, businesses do not want to change the historical data. The typical approach is to create a new record for the store, which contains the new region, but leaves the old store record (and therefore the old regional sales data) intact. This prevents companies from comparing this store’s current sales to its historical sales unless the previous StoreID is preserved. The answer is to keep the existing StoreName (the primary key from the source system) on both records and instead add BeginDate and EndDate fields to indicate when each record is active. The StoreID is a surrogate key, so each record has a different StoreID but the same StoreName, allowing data to be examined for the store across time regardless of its reporting structure.
The data in the fact table is already aggregated to the fact table's grain. Users often ask for aggregated values at higher levels. For example, they may want to sum sales to a monthly or quarterly number. In addition, users may be looking for a total at a product or category level.
These numbers can be calculated using a standard SQL statement, but such calculations takes time. Hence, with star schema in database, it is better to store higher-level aggregations by pre-calculating them and storing them in the fact table. This requires that the lowest-level records have special values. For example, a TimeDimension record that actually holds weekly totals might have a 9 in the DayOfWeek field to indicate that this particular record holds the total for the week.
A second approach is to build another fact table but at the weekly grain. This works well for storing data summarized at various levels, but the problem comes into play when examining the number of possible tables needed. To summarize at the weekly, monthly, quarterly, and yearly levels by product, four tables are needed in addition to the "real", or daily, fact table. However any other combination, such as weekly totals by product subcategory or monthly totals by store, would each require its own table.
The above approach has been used in the past, but better alternatives exist. These alternatives usually consist of building a cube structure to hold pre-calculated values. Going beyond star schema in database, cubes were designed to address the issues of calculating aggregations at a variety of levels and responding to queries quickly.
There are pre-built schemas available. While selecting a data warehouse keep this in mind along with functionalites such as running multiple queries and usage by multiple users. Multiple engines and joining of a large number of tables to run a single query must be supported, if such requirements exist in the organization. You may want to test the schemas of your data warehouse to ensure they are in line with your business needs, before making any purchase decision.
About the author: Mohit Tayal is Senior Consultant - Technology at Mindcraft Software Pvt Ltd. Currently he is building a BI solution for Birla Sunlife Insurance Ltd. He has around 4 years of work experience in BI and OLTP databases. Previously he has also worked with Polaris Software Lab Ltd.
(As told to Sharon D'Souza)