please note: I will refer to two kinds of tables: a “table” (lowercase) is a general term for a data frame that contains data in rows and column. A “Table” (uppercase) is a specific term for a materialization, a concept that is explored below.
Databases often follow a star schema, where there is a main table, known as a fact table, that can be joined to one or more tables that provide additional information about one of the fields in the fact table. These additional tables are known as dimension tables. I have been working with a data lake that generally follows a star schema.
In this example, I have simplified the situation using a set of fake tables that can be joined on an id
column. The specifics of the tables are unimportant.
I recently had an issue that was hard to investigate. I had three models that I believed all contained data for the same set of records. There was a main model (a fact table) that filtered the data specific to the analysis and two dimensional tables which structured specific components of the data contained in the fact table. The below SQL statements describe the models I created. I use dbt to execute the models. dbt uses references (denoted by ref()
in the below models) to interprete the relationships and dependencies between models. When one model file includes a ref()
statement, this indicates that the model being referenced should be run before the model containing the reference statement to ensure data is transformed in the order the developer intended.
The fact table was created as:
-- fact.sql
SELECT *
FROM database.src_fact
LIMIT 10
The dimension tables are described in the SQL statements below. Each dimension table joined a source dimension table to the fact table and filtered for parcels contained in the fact table. I assumed the JOIN and WHERE statements would ensure that the fact table and dimension tables all contained records pertaining to the same ten records. In the below sections, I will be referring to these three models by the name of the SQL file (i.e., when I reference fact
I am referring to the above model stored in fact.sql
).
-- dimension1.sql
SELECT *
FROM database.src_dimension1
LEFT JOIN {{ ref('fact') }}
ON src_dimension1.id = fact.id
WHERE fact.id IS NOT NULL
-- dimension2.sql
SELECT *
FROM database.src_dimension2
LEFT JOIN {{ ref('fact') }}
ON src_dimension1.id = fact.id
WHERE fact.id IS NOT NULL
What we expect to happen:
In the fact
model, we are selecting a sample of ten rows from the fact table. In the subsequent dimension1
and dimension2
models, we are selecting the data in the dimension tables for the ten rows in the fact model. Because we are matching on parcel ID and only selecting for rows in the dimension table that match a parcel ID in fact
, we expect that every parcel ID in fact
will show up in dimension1
and dimension2
and vis versa. This expectation is known as referential integrity.
What actually happens:
If we extracted the data for these three models into data files and compared the parcel IDs, we would likely find that there are parcel IDs in fact
that do not appear in the dimension models and vis versa. But how can this be? We used the JOIN
and the WHERE
statements to ensure we only selected for parcel IDs that show up in the fact
table?
The answer is model materializations. When you transform data in a database or data lake, you can control how the tables are created; this creation process is called a materialization, which is usually either a Table or a View. a Table is a physical table that is stored as an actual database table (or objects in a data lake). The important part is that there is an actual entity containing the data in a Table. A View is a description of a transformation that has been stored in your database. After creating the examples models above as views, you could then reference them in subsequent database queries to build upon the transformations they describe. However, there are no files underlying the view. When you reference the view in another query, the database engine will execute the transformation stored in the view and return the result.
This execution of the view’s transformation is the crux of the issue in this example. Each time you reference a view by name, the view’s transformation gets run. In the example above, when dimension1
references fact
, the transformation stored in fact
gets run and then dimension1
queries the resulting data. Similarly, when dimension2
is executed, it first executes fact
and queries the resulting data. However, these two executions of fact
are run independent of one another so dimension1
and dimension2
are executing on different versions of the data produced by fact
.
Comparatively, if we had created fact
using a Table materialization, then there would be a single set of files created when the model is executed. All references to fact
would then be executed on those files; dimension1
and dimension2
would have used the same set of ten records (as produced by the LIMIT 10
statement) and we would have referential integrity across the three models.
In dbt, all models are materialized as views by default. To materialize as a table, you set the configuration at the top of the model file. In this example, we could materialize fact
as a Table using the below model configuration.
{{ config(materialized='table') }}
SELECT *
FROM database.table
LIMIT 10
After executing the above model for the fact table and executing the dimension models, all datasets for the three models all contained data for the same set of ten records.