SCD datawarehouse
""" | |
This function is a parameterized SCD merging function. Its capabilities are defined using the Kimball Data Warehousing toolkit (Third Edition) | |
It is common for different attributes of a given dimension to have different tracking techniques | |
## SCD TYPE DEFINITIONS ## | |
Type 0: Retain original | |
----------------------- | |
With type 0, the dimension attribute value never changes, so facts are always grouped | |
by this original value. Type 0 is appropriate for any attribute labeled “original,” such | |
as a customer's original credit score or a durable identifier. It also applies to most | |
attributes in a date dimension. | |
Example | |
------- | |
>>> df_source >>> df_target | |
customer credit_score customer original_credit_score | |
0 C00000001 730 0 C00000001 630 | |
1 C00000001 480 1 C00000001 520 | |
>>> scd_merge(df_source, df_target) | |
customer original_credit_score | |
0 C00000001 630 | |
1 C00000001 520 | |
Note: Though df_source has new values for credit_score, function retains the value found in df_target. Function also relabels the column as it is passed from df_source, signifying it is SCD0 | |
Type 1: Overwrite | |
----------------- | |
With type 1, the old attribute value in the dimension row is overwritten with the new | |
value; type 1 attributes always reflects the most recent assignment, and therefore | |
this technique destroys history. Although this approach is easy to implement and | |
does not create additional dimension rows, you must be careful that aggregate fact | |
tables and OLAP cubes affected by this change are recomputed. | |
Example | |
------- | |
>>> df_source >>> df_target | |
customer credit_score customer current_credit_score | |
0 C00000001 730 0 C00000001 630 | |
1 C00000001 480 1 C00000001 520 | |
>>> scd_merge(df_source, df_target) | |
customer current_credit_score | |
0 C00000001 730 | |
1 C00000001 480 | |
Note: New values for credit score in df_source overwrite the value found in df_target. merge function assigns a tag to the column signifying SCD1 cahracter 'current_' | |
Type 2: Add new row | |
------------------- | |
Type 2 changes add a new row in the dimension with the updated attribute values. | |
This requires generalizing the primary key of the dimension beyond the natural or | |
durable key because there will potentially be multiple rows describing each member. | |
When a new row is created for a dimension member, a new primary surrogate key is | |
assigned and used as a foreign key in all fact tables from the moment of the update | |
until a subsequent change creates a new dimension key and updated dimension row. | |
A minimum of three additional columns should be added to the dimension row | |
with type 2 changes: 1) row effective date or date/time stamp; 2) row expiration | |
date or date/time stamp; and 3) current row indicator. | |
Example | |
------- | |
>>> df_source >>> df_target | |
customer credit_score customer effective_from effective_to is_current credit_score | |
0 C00000001 730 0 C00000001 2022-01-01 9999-12-31 1 630 | |
1 C00000001 480 1 C00000001 2022-01-01 9999-12-31 1 520 | |
>>> scd_merge(df_source, df_target) // date of merge in example is 2022-01-31 | |
customer effective_from effective_to is_current credit_score | |
0 C00000001 2022-01-01 2022-01-30 0 630 | |
1 C00000001 2022-01-31 9999-12-31 1 730 | |
2 C00000001 2022-01-01 9999-12-31 0 520 | |
3 C00000001 2022-01-01 9999-12-31 1 480 | |
Type 3: Add New Attribute | |
------------------------- | |
Type 3 changes add a new attribute in the dimension to preserve the old attribute | |
value; the new value overwrites the main attribute as in a type 1 change. This kind of | |
type 3 change is sometimes called an alternate reality. A business user can group and | |
filter fact data by either the current value or alternate reality. This slowly changing | |
dimension technique is used relatively infrequently. | |
Type 4: Add Mini-Dimension | |
-------------------------- | |
The type 4 technique is used when a group of attributes in a dimension rapidly | |
changes and is split off to a mini-dimension. This situation is sometimes called a | |
rapidly changing monster dimension. Frequently used attributes in multimillion-row | |
dimension tables are mini-dimension design candidates, even if they don't fre- | |
quently change. The type 4 mini-dimension requires its own unique primary key; | |
the primary keys of both the base dimension and mini-dimension are captured in | |
the associated fact tables. | |
Type 5: Add Mini-Dimention and Type 1 Outrigger | |
----------------------------------------------- | |
The type 5 technique is used to accurately preserve historical attribute values, | |
plus report historical facts according to current attribute values. Type 5 builds on | |
the type 4 mini-dimension by also embedding a current type 1 reference to the | |
mini-dimension in the base dimension. This enables the currently-assigned mini- | |
dimension attributes to be accessed along with the others in the base dimension | |
without linking through a fact table. Logically, you'd represent the base dimension | |
and mini-dimension outrigger as a single table in the presentation area. The ETL | |
team must overwrite this type 1 mini-dimension reference whenever the current | |
mini-dimension assignment changes. | |
Type 6: Add Type 1 attributes to Type 2 Dimension | |
------------------------------------------------- | |
Like type 5, type 6 also delivers both historical and current dimension attribute | |
values. Type 6 builds on the type 2 technique by also embedding current type | |
1 versions of the same attributes in the dimension row so that fact rows can be | |
filtered or grouped by either the type 2 attribute value in effect when the measure- | |
ment occurred or the attribute's current value. In this case, the type 1 attribute is | |
systematically overwritten on all rows associated with a particular durable key | |
whenever the attribute is updated. | |
Type 7: Dual Type 1 and Type 2 Dimensions | |
----------------------------------------- | |
Type 7 is the final hybrid technique used to support both as-was and as-is report- | |
ing. A fact table can be accessed through a dimension modeled both as a type 1 | |
dimension showing only the most current attribute values, or as a type 2 dimen- | |
sion showing correct contemporary historical profiles. The same dimension table | |
enables both perspectives. Both the durable key and primary surrogate key of the | |
dimension are placed in the fact table. For the type 1 perspective, the current flag | |
in the dimension is constrained to be current, and the fact table is joined via the | |
durable key. For the type 2 perspective, the current flag is not constrained, and the | |
fact table is joined via the surrogate primary key. These two perspectives would be | |
deployed as separate views to the BI applications. | |
""" | |
def scd_merge(df_source, df_target, table_params: dict): | |
return df_merged |
Comments