Snowflake schema illustrated
Snowflake schema has emerged as a popular schema amongst database and data warehouse design professionals. Learn Snowflake in depth in this tutorial.
In database design, there are several data modeling techniques such as conceptual, logical, and physical. But there are two schema-types that are currently popular amongst software developers and the enterprise data warehouse (EDW) professionals. They are:
In this tutorial, we will look at snowflake schema in detail, explaining how it has evolved from star and exactly where it differs.
Snowflake schema is an extension of star schema in a way; it separates itself from Star when it comes to handling large dimension tables. A star schema focuses on a centralized design with a fact table in it connecting to different dimension tables end to end.
>> Read our tutorial on Star schema
As shown in Figure 1, a Policy sales table connects to different dimension tables following the design pattern of a centralized fact table. If you want to arrive at agent wise policy- sales in terms of sum assured, it will have a ‘join’ to agent table, where we can see that agent table with channel information. If we want to arrive at zonal sales count of Policies, a ‘join’ to branch table is required. (See Figure 1: Star schema explained)
Where does the problem arrive?
The following two things can be observed:
- You need to connect the whole branch table even when you are asking for zone level information only.
- For single zone, for example, say ‘North’, there may be 200 or 300 branches. Therefore, to get North’s count of policies, we need to process all 300 rows!
Deficiency occurs when you are connecting to large dimension tables, resulting in consuming increased execution time of queries in a data warehouse. A large group of users executing similar kind of statements may result in low performance, and the purpose of having an online analytical processing (OLAP) in place fails because of this very reason. Also see: Quick facts on snowflake schema.
Figure 1: Star schema explained (Click on image to enlarge)
What leads to the problem?
If a dimension is sparse, that is, most of the possible values for the dimension have no data, and/ or a dimension has a long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and ‘snow flaking’ may be appropriate.
A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will hence be easier to implement.
>> Also see Quick facts on snowflake schema
Resolution
A snowflake schema applies normalization over a star schema, in which very large dimension tables are normalized into multiple tables. Dimensions with hierarchies can be decomposed into a snowflake structure when you want to avoid joins to big dimension tables when you are using an aggregate of the fact table.
If a fact table contains a ‘1 to many’ relationship with each of the dimensions in a data warehouse schema, then it is appropriate to use a star schema.
Although if a fact table has a ‘many-to-many relationship’ with its dimensions, i.e. many rows in a fact equal many rows in a dimension, then you must resolve this using a snow flake schema where the bridge table contains a unique key to each row in the fact table.
For example, zonal sales or count of policy can easily be derived by having a join between zone table and Zonal sales table avoiding three way join.
Snowflake and Star Schema resources
OLAP and data mining: What’s the difference?
Star schema vs. snowflake schema: Which is better?
Star schema in database: Guide to construction and composition
What are the differences between fact tables and dimension tables in star schemas?
think of a retail system where you need to maintain large dimension tables like ‘Customers’ and join of sales table is taking place with customers for customer id. Having millions of records in ‘sales’ and ‘customers’ will adversely affect the performance when generating sales reports.
Generally, in snowflake, the third normal form is applied to get the separated dimension tables.
Creation of aggregate tables with such small dimension table solves the purpose of having a snowflake in place.
The only disadvantage of snowflake is that the design becomes complicated and increases the number of dimension tables leading to overheads for maintenance.
Figure 1 can be modified following a snowflake structure as shown in Figure 2: Snowflake structure below.
Figure 2: Snowflake schema structure (Click on image to enlarge)
About the author: Mohit Tayal is Senior Consultant - Technology at Mindcraft Software. He has more than four years of experience in BI and OLTP database design and deployment. Currently he is working on a project for BNP Paribas India Services at its Global Equity and Commodity Derivatives unit. Prior to this, he worked on projects at Birla Sunlife Insurance and Polaris Software Lab.