Wednesday, October 5, 2022

SCD datawarehouse

 

"""
This function is a parameterized SCD merging function. Its capabilities are defined using the Kimball Data Warehousing toolkit (Third Edition)
It is common for different attributes of a given dimension to have different tracking techniques
## SCD TYPE DEFINITIONS ##
Type 0: Retain original
-----------------------
With type 0, the dimension attribute value never changes, so facts are always grouped
by this original value. Type 0 is appropriate for any attribute labeled “original,” such
as a customer's original credit score or a durable identifier. It also applies to most
attributes in a date dimension.
Example
-------
>>> df_source >>> df_target
customer credit_score customer original_credit_score
0 C00000001 730 0 C00000001 630
1 C00000001 480 1 C00000001 520
>>> scd_merge(df_source, df_target)
customer original_credit_score
0 C00000001 630
1 C00000001 520
Note: Though df_source has new values for credit_score, function retains the value found in df_target. Function also relabels the column as it is passed from df_source, signifying it is SCD0
Type 1: Overwrite
-----------------
With type 1, the old attribute value in the dimension row is overwritten with the new
value; type 1 attributes always reflects the most recent assignment, and therefore
this technique destroys history. Although this approach is easy to implement and
does not create additional dimension rows, you must be careful that aggregate fact
tables and OLAP cubes affected by this change are recomputed.
Example
-------
>>> df_source >>> df_target
customer credit_score customer current_credit_score
0 C00000001 730 0 C00000001 630
1 C00000001 480 1 C00000001 520
>>> scd_merge(df_source, df_target)
customer current_credit_score
0 C00000001 730
1 C00000001 480
Note: New values for credit score in df_source overwrite the value found in df_target. merge function assigns a tag to the column signifying SCD1 cahracter 'current_'
Type 2: Add new row
-------------------
Type 2 changes add a new row in the dimension with the updated attribute values.
This requires generalizing the primary key of the dimension beyond the natural or
durable key because there will potentially be multiple rows describing each member.
When a new row is created for a dimension member, a new primary surrogate key is
assigned and used as a foreign key in all fact tables from the moment of the update
until a subsequent change creates a new dimension key and updated dimension row.
A minimum of three additional columns should be added to the dimension row
with type 2 changes: 1) row effective date or date/time stamp; 2) row expiration
date or date/time stamp; and 3) current row indicator.
Example
-------
>>> df_source >>> df_target
customer credit_score customer effective_from effective_to is_current credit_score
0 C00000001 730 0 C00000001 2022-01-01 9999-12-31 1 630
1 C00000001 480 1 C00000001 2022-01-01 9999-12-31 1 520
>>> scd_merge(df_source, df_target) // date of merge in example is 2022-01-31
customer effective_from effective_to is_current credit_score
0 C00000001 2022-01-01 2022-01-30 0 630
1 C00000001 2022-01-31 9999-12-31 1 730
2 C00000001 2022-01-01 9999-12-31 0 520
3 C00000001 2022-01-01 9999-12-31 1 480
Type 3: Add New Attribute
-------------------------
Type 3 changes add a new attribute in the dimension to preserve the old attribute
value; the new value overwrites the main attribute as in a type 1 change. This kind of
type 3 change is sometimes called an alternate reality. A business user can group and
filter fact data by either the current value or alternate reality. This slowly changing
dimension technique is used relatively infrequently.
Type 4: Add Mini-Dimension
--------------------------
The type 4 technique is used when a group of attributes in a dimension rapidly
changes and is split off to a mini-dimension. This situation is sometimes called a
rapidly changing monster dimension. Frequently used attributes in multimillion-row
dimension tables are mini-dimension design candidates, even if they don't fre-
quently change. The type 4 mini-dimension requires its own unique primary key;
the primary keys of both the base dimension and mini-dimension are captured in
the associated fact tables.
Type 5: Add Mini-Dimention and Type 1 Outrigger
-----------------------------------------------
The type 5 technique is used to accurately preserve historical attribute values,
plus report historical facts according to current attribute values. Type 5 builds on
the type 4 mini-dimension by also embedding a current type 1 reference to the
mini-dimension in the base dimension. This enables the currently-assigned mini-
dimension attributes to be accessed along with the others in the base dimension
without linking through a fact table. Logically, you'd represent the base dimension
and mini-dimension outrigger as a single table in the presentation area. The ETL
team must overwrite this type 1 mini-dimension reference whenever the current
mini-dimension assignment changes.
Type 6: Add Type 1 attributes to Type 2 Dimension
-------------------------------------------------
Like type 5, type 6 also delivers both historical and current dimension attribute
values. Type 6 builds on the type 2 technique by also embedding current type
1 versions of the same attributes in the dimension row so that fact rows can be
filtered or grouped by either the type 2 attribute value in effect when the measure-
ment occurred or the attribute's current value. In this case, the type 1 attribute is
systematically overwritten on all rows associated with a particular durable key
whenever the attribute is updated.
Type 7: Dual Type 1 and Type 2 Dimensions
-----------------------------------------
Type 7 is the final hybrid technique used to support both as-was and as-is report-
ing. A fact table can be accessed through a dimension modeled both as a type 1
dimension showing only the most current attribute values, or as a type 2 dimen-
sion showing correct contemporary historical profiles. The same dimension table
enables both perspectives. Both the durable key and primary surrogate key of the
dimension are placed in the fact table. For the type 1 perspective, the current flag
in the dimension is constrained to be current, and the fact table is joined via the
durable key. For the type 2 perspective, the current flag is not constrained, and the
fact table is joined via the surrogate primary key. These two perspectives would be
deployed as separate views to the BI applications.
"""
def scd_merge(df_source, df_target, table_params: dict):
return df_merged

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...