Understanding the Snowflake Schema in Data Warehousing: A Detailed Guide

In the realm of data warehousing, choosing the right schema for your database is a fundamental step. Two primary designs are often used in data warehouses: star schema and snowflake schema. While the star schema is simpler and often more efficient for certain queries, the snowflake schema is normalized, meaning it removes redundancy and can result in better performance in some cases. In this blog, we will delve into the details of the snowflake schema, its structure, benefits, and drawbacks.

What is a Snowflake Schema?

link to this section

The snowflake schema is a variant of the star schema used in data warehousing and business intelligence. The term "snowflake" is derived from the schema's resemblance to a snowflake's shape when visualized. The structure extends the star schema through the use of additional normalization.

Normalization is a database design technique that reduces redundancy and improves integrity by organizing fields and tables in a database. It involves dividing a database into two or more tables and defining relationships between the tables.

Structure of a Snowflake Schema

link to this section

In the snowflake schema, each point of the star schema is explored further and broken down into more parts, adding additional dimensions. This makes the diagram resemble a snowflake with more complex branch patterns.

Like the star schema, the snowflake schema also consists of a fact table in the middle, surrounded by dimension tables. However, these dimension tables are further split into additional tables. Each dimension table is normalized, split into more tables, and these tables are then linked by relationships.

Deeper Understanding of Snowflake Schema Components

link to this section

Let's delve deeper into the specific components that make up a snowflake schema.

Fact Table

At the heart of a snowflake schema is the fact table. It contains the business data that is being analyzed, along with foreign keys to the dimension tables. This is where measurements, metrics, and facts of business processes are stored.

Dimension Tables

Surrounding the fact table are multiple dimension tables, which provide the context for the data stored in the fact table. Each dimension table contains a primary key, which corresponds to the foreign keys in the fact table.

In a snowflake schema, these dimension tables are normalized. This means that the data is organized to reduce redundancy, and related data is separated into additional sub-dimension tables. For example, a "Location" dimension table could be split into "City", "State", and "Country" tables.

Relationships

The relationships between the fact table and the dimension tables (and between the dimension tables themselves) are a key feature of the snowflake schema. These relationships allow for complex querying across multiple tables, enabling a wide range of business analytics.

Example of Snowflake Model

link to this section

Let's consider an example of a retail business to design a snowflake schema.

Fact Table: Sales

The Sales table is the central fact table. Each entry represents a unique sale, with details about the sale and references to the associated dimensions: product, location, and time. It's where we keep measures (or facts) we are interested in, such as QuantitySold, TotalSales, and ProfitMargin.

The foreign keys in this table (ProductID, LocationID, TimeID) create connections with the dimension tables, allowing for detailed and varied analysis of the sales data.

SalesID ProductID LocationID TimeID QuantitySold TotalSales ProfitMargin
1 P1 L1 T1 100 500 50
2 P2 L2 T2 200 1000 100

Dimension Table: Product

The Product table contains details about each product sold by the business. ProductID serves as the primary key for unique identification of each product. This table is normalized, with the Category and Manufacturer details being stored in separate tables and linked via foreign keys (CategoryID, ManufacturerID).

ProductID ProductName CategoryID ManufacturerID
P1 Product1 C1 M1
P2 Product2 C2 M2

    Dimension Tables: Category & Manufacturer

    These tables are further details of the Product dimension. They store unique categories and manufacturers associated with the products, reducing redundancy in the Product table.

    CategoryID CategoryName
    C1 Category1
    C2 Category2
    ManufacturerID ManufacturerName
    M1 Manufacturer1
    M2 Manufacturer2

    Dimension Table: Location

    The Location table holds information about where each sale has taken place. Like the Product table, it also includes a foreign key (CityID), which links to the City table, providing more detailed information and reducing redundancy in the Location table.

    LocationID StoreName CityID StoreType
    L1 Store1 C1 Type1
    L2 Store2 C2 Type2

      Dimension Tables: City, Region, and Country

      These tables store hierarchically related information about the physical location of each sale. Each table connects to the next, forming a chain that allows for analysis at different geographical levels. This is a great example of how snowflake schema handles hierarchical data.

      CityID CityName RegionID
      C1 City1 R1
      C2 City2 R2
      RegionID RegionName CountryID
      R1 Region1 CO1
      R2 Region2 CO2
      CountryID CountryName
      CO1 Country1
      CO2 Country2

      Dimension Table: Time

      The Time table serves as a date dimension. It breaks down the date of each sale into logical parts such as Day, Month, Quarter, and Year, which allows for time-based analysis, like monthly sales reports or yearly comparisons.

      TimeID Date Day Month Quarter Year
      T1 01-01-23 1 1 1 2023
      T2 01-02-23 1 2 1 2023

      In the snowflake schema, these tables are all interconnected through the use of foreign keys, which allows for complex queries that span multiple tables. Each dimension table is normalized, meaning that related data is grouped into separate tables, resulting in fewer redundancies and more efficient storage. This also results in a schema that closely mirrors the business processes being modelled, making it easier to understand for those familiar with the business.

      Handling Hierarchies in Snowflake Schema

      link to this section

      The snowflake schema is particularly adept at handling hierarchical data. Because each level of the hierarchy is typically represented by a separate table, it's straightforward to navigate up and down the hierarchy through simple joins. This makes the snowflake schema a good choice for datasets with complex, multi-level hierarchies.

      Impact on Data Warehouse Performance

      link to this section

      The increased number of tables and joins in a snowflake schema can have implications for the performance of the data warehouse. While the reduced redundancy can save on storage and potentially improve query performance for specific scenarios, the added complexity of the schema can slow down queries.

      Furthermore, the normalization of the dimension tables can lead to more disk I/O operations, which could impact performance. Therefore, it's crucial to thoroughly test the performance of a snowflake schema under realistic workloads before deciding to use it.

      Schema Evolution

      link to this section

      Another point to consider is schema evolution. In a snowflake schema, changes to the schema can be more complex to implement, due to the normalization of the dimension tables. For example, adding a new level to a hierarchy can require the creation of a new table, and potentially complex changes to the existing relationships.

      Conclusion

      link to this section

      While the snowflake schema comes with numerous benefits like detailed analysis, reduced redundancy, and efficient storage use, it also presents challenges such as increased complexity and potential performance issues. Whether you should opt for a snowflake schema or not highly depends on your specific business requirements, data size, and the nature of queries your system will handle.