Skip to content

Data Modeling Guide: Treating Data as a Product

Antoni Ivanov edited this page Jul 31, 2023 · 9 revisions
Shows the BI/data journey from sources we ingest data into data lake , then from data lake transform into dimension model and use it to provide business insights

Versatile Data Kit provides data engineers with an all-inclusive toolkit for managing data. It covers capabilities from ingestion to processing and warehousing.

This guide aims to help users convert the raw data they acquire into useful, polished, and functional data products that can deliver key insights.

A basic understanding of SQL, data pipelines, and Versatile Data Kit is assumed for this guide.

What is a Data Lake?

Data Lakes serve as the initial landing zone for ingested data. Data stored here is structured and can be accessed through SQL, yet remains 'raw.' It's stored in a model closely mirroring the source system and at the highest level of granularity.

It lacks any cleansing, business logic or aggregations. To transform this raw data into cleaned and business-oriented data products, we utilize the data warehouse and data marts.

Data Warehouse, Data Marts, and why use them?

What is a Data Warehouse?

A Data Warehouse serves as a repository for refined datasets. These datasets are processed and ready to be easily utilized by businesses for specific use-cases.

What are Data Marts?

Data Marts are essentially subsets of the data warehouse. Each data mart is owned by different teams and oriented towards different business lines, allowing each team to focus on their specific data needs.

Data Lake vs. Data Mart

The fundamental difference between a data lake and a data mart lies in their structure and purpose.

  • Data Lake: This is where vast amounts of raw data are stored. The data in a data lake, such as telemetry data for products or services, is unrefined and does not have a predefined structure.
  • Data Mart: In contrast, a data mart contains a smaller amount of data, which is built from the raw source data stored in the lake. Here, transformations and aggregations are applied to the raw data with a specific use-case in mind. Consequently, data marts provide a single, easily accessible source that can be used later for various purposes including reporting, analytics, data science, and driving data-powered applications.

In the context of a data mart, datasets are treated as products. Like any product, these datasets should be tested, documented, made available, monitored, and operated upon. This is an essential, yet resource-intensive activity. Hence, making strategic decisions regarding the data mart model is of paramount importance.

How to Model Data Marts?

The short answer is that it all depends on the team's use case. Below, we outline some common ways.

Simple tables

You can save smaller (filtered or aggregated) versions of your data lake tables in a data mart. This way your reports will be based on smaller tables, thus improving efficiency and speed. This is a straightforward choice if your tables are unrelated to each other and are frequently used for reporting.

CREATE TABLE IF NOT EXISTS mart_model_example.fact_sales(
    order_id                  INT       COMMENT 'Primary key; Foreign key to mart_model_example.dim_time',
    product_id                INT       COMMENT 'Foreign key to mart_model_example.dim_product',
    customer_id               STRING    COMMENT 'Foreign key to mart_model_example.dim_customer',
    employee_id               INT       COMMENT 'Foreign key to mart_model_example.dim_employee',
    total_sales               BIGINT    COMMENT 'Total sales per order',
    total_quantity            BIGINT    COMMENT 'Total quantity per order',
    pa__arrival_ts            TIMESTAMP COMMENT 'The time at which a record arrived in the data lake'
)
COMMENT 'UK: order_id, provides sales information per order id'

Simple Views

If you are frequently writing the same script, it may be useful to save it as a view. A view is essentially a virtual table from the result of pre-defined (and stored) SQL script. This SQL script can join multiple tables, filter, aggregate, or perform some other logic on one or more tables. The results from a view are updated each time you open/call the view and the results themselves are not physically stored in the database. Views are typically used when working with smaller tables (less than 1 million records/rows). Otherwise, you might consider materializing the view by storing the result as a table.

CREATE VIEW IF NOT EXISTS mart_model_example.vw_fact_sales
AS
    SELECT CAST(order_id AS INTAS order_id,
        TRIM(LOWER(customer_id)) AS customer_id,
        SUM(CAST(total_sale AS INT)) AS total_sales,
        SUM(CAST(quantity AS INT)) AS total_quantity,
    FROM history_staging.store_sales_example
    GROUP BY
        CAST(order_id AS INT),
        TRIM(LOWER(customer_id))

Normalisation

DataCamp defines it as a database design technique, which divides tables into smaller tables and connects them via relationships. The goal is to reduce redundancy and increase data integrity. There are different levels of normalisation based on the desired storage optimisation and data integrity. However, with intense normalisation come query complexity and computational burdens. Queries will become much more complex to write (multiple joins) and will require larger compute. For a more thorough dive into normalisation, please refer to here and here.

Star Schema

Oftentimes the raw data or the business use-case require the use of multiple tables, which are related in nature. In addition, we would like our data to be of use to other teams and should be modeled in a straightforward and logical manner. To accomplish this, we can utilize the dimensional model and the star schema.

Fact table

A fact table contains measures of a certain metric. For example, a store's sales would be a fact table, where each record/row would be a sale made by that store. This type of table changes regularly and is designed to connect to one or more dimension tables.

Dimension table

Dimension tables connect to the fact table to provide more descriptive information about a record. This descriptive information does not change very often or at all. For example, a dimension table can add descriptive information about a product.

The Star schema's dimension tables are extended to one dimension. This means that a dimension table is only related to a fact table and is not further normalised into other dimension tables. Below is a simple example of such a Star schema.

screenshot of a diagram depicting a fact table linked with 4 dimension tables

Using store sales example, a possible Star schema could look something like the following. In this example, you have a fact table that contains the store's sales, and IDs for each attribute (product, order, customer, and employee) that are used to link to the dimension tables to acquire more information, if needed. Everything is one left join away.

star schema example with fact_sales, dim tlme dim_product dim_customer dim_employee

Snowflake schema

The easiest way to think about the Snowflake schema is an extension of the Star schema. This is because the dimension tables are further normalized to save storage by reducing redundancy (and to optimize data integrity).

However, the Snowflake model can also greatly increase the complexity of modeling and querying. In addition, when working with an object storage DB, compute is more expensive than storage. Thus, the Snowflake schema is more computationally burdensome than the Star schema. As such, it is often avoided. Below is a simple example of a Snowflake schema.

Snowflake schema screenshot wiht fact table, dimensions, and subdimensions

Video Webinar on Practical Dimensional Modeling 101

Checkout the video from one of our community meetings here

Data Mart Loading Strategies

With the theory taken care of, it's important to talk about the logistics of building the tables in a data mart. The Versatile Data Kit (VDK) provides users with templates that easily load data into a data mart, based on a table's characteristics.

Typically, users enact the append or overwrite strategies on fact tables and the overwrite or versioned strategies on dimension tables, but that need not be the case for everyone. It depends on the intricacies of your data and use-case.

To get the full list of available Versatile Data Kit templates visit SQL Data Processing Templates Examples

Wrap-up

Congratulations on making it to the end of this tutorial! We've traversed a wide range of topics from understanding the difference between Data Lakes, Data Warehouses, and Data Marts, to learning about various data modeling techniques such as Star and Snowflake schemas.

Conclusion

It is important to ensure that data is ingested into a Data Lake with little to no modifications. Data Marts are used to create a well model dataset that can be treated as a product. Furthermore, the decision to use a specific data modeling technique or a particular data mart loading strategy depends heavily on the specific requirements of your use case. The key is to find a balance between computational efficiency, ease of use, and maintenance that best suits your unique situation.

What's Next?

If you want to go a step further, you can use the Versatile Data Kit to ingest this data into a data lake and then transform and load it into a data mart using the different loading strategies discussed. You can refer to the Ingestion Examples and SQL Data Processing Templates Examples for assistance.

Clone this wiki locally