Star schemas and snowflake schemas are the two predominant types of data warehouse schemas.
A data warehouse schema refers to the shape your data takes - how you structure your tables and their mutual relationships within a database or data warehouse.
Since the primary purpose of a data warehouse (and other Online Analytical Processing (OLAP) databases) is to provide a centralized view of all the enterprise data for analytics, data warehouse schemas help us achieve superior analytic results.
How do schemas help analytics? And what are the differences and trade-offs between star and snowflake schemas?
In this article, we compare the two dominant variants of data warehouse schemas and contrast their advantages and disadvantages.
To understand the data modeling behind a star schema, let us look at a retail example. Imagine you are running an international shopping brand and you want to analyze purchases across your physical locations.
You pull out data from your database as an Excel file:
But you soon realize there are too many rows, and the data needs to be cleaned before you can analyze it.
You decide to turn the data into a star schema.
A star schema is a data model that stores information in multiple table types: a single fact table and multiple dimensional tables.
In contrast to the classical database design of normalizing tables, star schemas connect dimensional data with fact data in a shape resembling a star (hence the name), as can be seen from the following diagram:
In the diagram, we see a central fact table (holding all the facts of the sales) and four dimension tables - a separate table describing the customer, date (of purchase), store where the purchase happened, and product purchased.
The fact table is linked via a foreign key relationship to the primary key of each dimension (aka, the id in each dimension table, for example, the customer_id links the customer from the dim_customer table to the fact_sales table).
This type of data modeling allows us to query data faster and with simpler queries than the normalized database design.
A snowflake schema is very similar to the simple star schema above. The main difference is that snowflake schemas split dimensional tables into further dimensional tables (also called lookup tables).
For example, the above diagram would show the customer_country field being split into further dimensional tables:
Each dimension is split until it is normalized - aka, there is no redundancy in the dimensional table, no repetition of values (except for identifier values, such as id’s).
The snowflake schema is a fully normalized data structure. Dimensional hierarchies (such as city > country > region) are stored in separate dimensional tables.
On the other hand, star schema dimensions are denormalized. Denormalization refers to the repeating of the same values within a table.
Star schema stores redundant data in dimension tables, while snowflake schema fully normalizes dimension tables and avoids data redundancy.
For example, a star schema would repeat the values in field customer_address_country for each order from the same country.
The redundancy, or duplicated entries, occurs because of the denormalization vs normalization schema design.
A simple star schema leads to simple query writing. Because the fact table is joined to only one level of dimensional tables, analysts do not need to write multiple joins.
On the other hand, snowflake schemas require a more complex query design. Because of complex relationships between the fact table and its dimensional tables, more joins are needed to link the additional tables. This causes an additional overhead when writing analytical queries.
The query execution time is faster in star schemas. Because they require a single join between a fact and its set of attributes in dimensional tables, a star schema acts almost as a single table for query lookups.
In contrast, snowflake schemas require complex joins of dimensional tables with their own sub-dimensional or supra-dimensional tables. This slows down query processing and can affect other OLAP products such as cube processing.
Star schemas might run queries faster, but they require more storage space than snowflake schemas because of their data redundancy.
Data integrity is more at risk in star schemas than snowflake schemas. Because data is stored redundantly, multiple copies of the same data exist in the star schema’s dimensional tables. This means new inserts, updates, or deletes can compromise the integrity of data.
In contrast, the snowflake schema is less prone to data integrity issues, because it fully normalizes dimensional tables, storing dimension data only once in the appropriate table.
Star schemas are easier to design and set up. Because they are represented by simple relationships, it is easy for a data engineer or data architect to set up an appropriate star schema.
On the other hand, star schemas are harder to maintain than snowflake schemas. As new data is ingested into the data warehouse, star schemas become harder to maintain and check against data integrity violations.
Out of the two types of data warehouse schema, which one should you choose?
On one hand, star schemas are simpler, run queries faster, and are easier to set up.
On the other hand, snowflake schemas are less prone to data integrity issues, are easier to maintain, and utilize less space.
Based on the tradeoffs above, it depends on which advantage (or disadvantage) best suits your business use cases.
Keboola is an end-to-end data operation tool that helps you automate your end-to-end data pipelines.
Designing the right data warehouse schema is hard enough.
Lessen the burden on your engineers by automating all the data extraction, transformation, and loading into databases and data warehouses of your choice that follow schema design.
Try it out.
Keboola offers a no-questions-asked, always-free tier, so you can play around and build your pipelines with a couple of clicks.