Skip to content

Commit

Permalink
✨ feat( DataExpert.io ): Day 2 lecture
Browse files Browse the repository at this point in the history
  • Loading branch information
glopez-dev committed Nov 22, 2024
1 parent c8ea1a7 commit 66a8949
Show file tree
Hide file tree
Showing 14 changed files with 565 additions and 82 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ author: Gabriel LOPEZ
title: (DataExpert.io) Bootcamp - Day 1 - Lecture
---

> Today lecture with deal with **complex data types** and **cumulation**
Today lecture will deal with **complex data types** and **cumulation**

### What is a dimension ?
## What is a dimension ?

Dimensions are the attributes of an entity

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
---
date: '2024-11-17T19:21:32+01:00'
draft: false
title: (DataExpert.io) Bootcamp - Day 2 - Lecture
---
Today's lecture deals with **Slowly Changing Dimensions** and **Idempotency**.

> **Slowly changing dimensions** = An attribute that drifts over time
*Example:* Your favorite food

## Idempotency

You need to model slowly dimensions the right way because they impact idempotency.

> **Idempotent** = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.
> **Idempotent pipeline** = The ability for your data pipeline to produce the same results whether it's running in production or in backfill.
It's a very important property of pipelines in order to enforce data quality

If you can build **idempotent** pipelines you are going to be a much better Data Engineer.

**Your pipelines should produce the same result :**
- Regardless the day you run it
- Regardless of many times you run it
- Regardless of the hour that you run it

> Data Engineers mistakes a lot these three characteristics of a pipeline.
**Why is it hard to troubleshoot a non-idempotent pipeline ?** >> It doesn't actually fails, it just produces incorrect data silently.

> Zach prefers to say "non reproducible" data whether "incorrect"
If your data isn't idempotent downstream data will also be (there's propagation of the problem)

## What can make a pipeline not idempotent ?

**Using `INSERT INTO` without TRUNCATE**
- Creates duplication
- Your pipeline doesn't produces the same data regardless how many times it's run
- `INSERT INTO` should never be used as a Data Engineer
- `MERGE` and `INSERT OVERWRITE` are better ways to go
- `MERGE` avoids duplication


**Using `start_date >` without a corresponding `end_date <`**
- While the time passes further new runs of the pipeline are going to add data from longer time periods.
- You need to use start and end dates to process only a "window"
- This can also cause out of memory exceptions when you backfill

**Not using a full set of partitions sensors**
- Your pipeline might run with an incomplete set of inputs
- You have to check for the full sets of input needed by the pipeline
- Avoid to fire the pipeline to early in production

**Not using `depends_on_past` (Airflow term) for cumulatives pipelines**
- Most pipeline can run in parallel
- Cumulatives pipelines have to run sequentially

> The production and back-fill behavior of a pipeline have to be the same.
> **SCD** = Slowly Changing Dimensions
**Relying on the "latest" partition of a not properly modeled SCD table**
- Cumulatives pipelines amplifies this bug
- The only exception for using "latest" partition is back-filling data
- The SCD table has to be properly modeled though

**Relying on the "latest" partition in general**

> You can't do back-filling without idempotency
> Unit tests doesn't guarantees the pipeline idempotency.
## Should you model as Slowly Changing Dimensions ?

There is also a concept of **rapidly changing dimensions**.

> Example: The heart rate
**The creator of Airflow hates SCDs :**
- He defends the idea of *functional data engineering*
- The pipeline is a chain of pure functions
- As the storage is cheaper he prefers storing each version of the data no matter duplication.
- Daily dimensions values over SCDs

### Three different ways to model your dimensions

#### Latest snapshot
You just store the current value.

**Weakness :** If you have SCD and you only hold-on to the latest value the pipeline is not idempotent.

You are going to back-fill values with the latest data version sometimes with a huge time gap.

That's where daily snapshot is better for back-filling

#### Daily / Monthly / Yearly snapshot

#### Slowly Changing Dimensions
A way to collapse daily snapshot based on whether or not the data changed day over day.

>Example : One row saying you have this age for this time frame (1 years) instead of 365 rows with the same age.
The slower the dimension changes the better is the compression.

> Saving 7 rows (a week) with compression is less interesting than saving 365 rows (a year).
Keep this in mind when choosing your modelling.

## How can I model dimensions that change ?

There are three ways to model SCD :

1. Singular snapshot
2. Daily partitioned snapshots (Easy and intuitive way to go)
3. SCD types 1, 2, 3 (three subways)

> **Warning :** Never back-fill with only the latest snapshot
## The types of SCD
### Type 0
Not changing dimension

### Type 1
- You only care about the latest value
- This breaks your pipeline idempotency
- For OLTP it can be fine

### Type 2
You care about what the value was in a certain time frame.

> `start_date` and `end_date`
Current values usually have either an `end_date` that is `NULL` or far in the future

**Benefits :**
- Only type of SCD that is purely idempotent.
- Full change history

**Drawback :**
- There are multiple rows for the same dimension
- You'll have to filter by date


### Type 3
You only care about "original" and "current" values

**Benefits :**
- You only have 1 row per dimension
- No row filtering

**Drawbacks :**
- You loose the history in between "original" and "current".
- It may not store when the value changed unless you have a `current_change_date`
- It is not idempotent

> There are others types of SCD models that are less used.

## Loading Type 2 SCDs

Two ways to load a table with SCDs :
1. Load the entire history in one query
- Can be okay if the dataset is small.
2. Incrementally load the data after the previous SCD is generated
- We want the production data to be incremental so we don't process the all history each time it changes.

> Remember that all your pipelines don't have to be perfect
## Additional resources
> https://airbyte.com/data-engineering-resources/idempotency-in-data-pipelines
14 changes: 6 additions & 8 deletions hugo/public/categories/index.html
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
<!DOCTYPE html>
<html><head lang="en">
<html><head lang="en"><script src="/blog/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=blog/livereload" data-no-instant defer></script>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Categories - Gabriel Study Blog</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="" />
<meta property="og:image" content=""/>
<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/categories/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="https://glopez.github.io/categories/">
<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/blog/categories/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="http://localhost:1313/blog/categories/">
<meta property="og:site_name" content="Gabriel Study Blog">
<meta property="og:title" content="Categories">
<meta property="og:locale" content="en">
Expand All @@ -15,11 +15,11 @@
<meta name="twitter:title" content="Categories">


<link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
<link href="http://localhost:1313/blog/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">



<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/blog/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />



Expand All @@ -32,7 +32,7 @@
<body>
<div class="content"><header>
<div class="main">
<a href="https://glopez.github.io/">Gabriel Study Blog</a>
<a href="http://localhost:1313/">Gabriel Study Blog</a>
</div>
<nav>

Expand Down Expand Up @@ -61,8 +61,6 @@ <h1 class="page-title">All tags</h1>
href="https://github.com/athul/archie">Archie Theme</a> | Built with <a href="https://gohugo.io">Hugo</a>
</div>
</footer>


</div>
</body>
</html>
4 changes: 2 additions & 2 deletions hugo/public/categories/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Categories on Gabriel Study Blog</title>
<link>https://glopez.github.io/categories/</link>
<link>http://localhost:1313/blog/categories/</link>
<description>Recent content in Categories on Gabriel Study Blog</description>
<generator>Hugo</generator>
<language>en</language>
<atom:link href="https://glopez.github.io/categories/index.xml" rel="self" type="application/rss+xml" />
<atom:link href="http://localhost:1313/blog/categories/index.xml" rel="self" type="application/rss+xml" />
</channel>
</rss>
46 changes: 32 additions & 14 deletions hugo/public/index.html
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
<!DOCTYPE html>
<html>
<head lang="en">
<head lang="en"><script src="/blog/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=blog/livereload" data-no-instant defer></script>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Gabriel Study Blog | Home </title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="" />
<meta property="og:image" content=""/>
<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="https://glopez.github.io/">
<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/blog/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="http://localhost:1313/blog/">
<meta property="og:site_name" content="Gabriel Study Blog">
<meta property="og:title" content="Gabriel Study Blog">
<meta property="og:locale" content="en">
Expand All @@ -16,11 +16,11 @@
<meta name="twitter:title" content="Gabriel Study Blog">


<link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
<link href="http://localhost:1313/blog/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">



<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/blog/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />



Expand All @@ -35,7 +35,7 @@
<div class="content">
<header>
<div class="main">
<a href="https://glopez.github.io/">Gabriel Study Blog</a>
<a href="http://localhost:1313/">Gabriel Study Blog</a>
</div>
<nav>

Expand All @@ -50,14 +50,34 @@


<section class="list-item">
<h1 class="title"><a href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
<time>Nov 15, 2024</time>
<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">(DataExpert.io) Bootcamp - Day 2 - Lecture</a></h1>
<time>Nov 17, 2024</time>
<br><div class="description">

<blockquote>
<p>Today lecture with deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
<p>Today&rsquo;s lecture deals with <strong>Slowly Changing Dimensions</strong> and <strong>Idempotency</strong>.</p>
<blockquote>
<p><strong>Slowly changing dimensions</strong> = An attribute that drifts over time</p>
</blockquote>
<p><em>Example:</em> Your favorite food</p>
<h2 id="idempotency">Idempotency</h2>
<p>You need to model slowly dimensions the right way because they impact idempotency.</p>
<blockquote>
<p><strong>Idempotent</strong> = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.</p>
</blockquote>
<h3 id="what-is-a-dimension-">What is a dimension ?</h3>
<blockquote>
<p><strong>Idempotent pipeline</strong> = The ability for your data pipeline to produce the same results whether it&rsquo;s running in production or in backfill.</p>&hellip;

</div>
<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">Read more ⟶</a>
</section>

<section class="list-item">
<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
<time>Nov 15, 2024</time>
<br><div class="description">

<p>Today lecture will deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
<h2 id="what-is-a-dimension-">What is a dimension ?</h2>
<p>Dimensions are the attributes of an entity</p>
<ul>
<li>Some dimensions are identifiers</li>
Expand Down Expand Up @@ -85,7 +105,7 @@ <h2 id="knowing-your-consumer">Knowing your consumer</h2>
<p>Who is going to consume the data ?</p>&hellip;

</div>
<a class="readmore" href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
</section>


Expand All @@ -100,8 +120,6 @@ <h2 id="knowing-your-consumer">Knowing your consumer</h2>
</div>
</footer>



</div>

</body>
Expand Down
19 changes: 13 additions & 6 deletions hugo/public/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,25 @@
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Gabriel Study Blog</title>
<link>https://glopez.github.io/</link>
<link>http://localhost:1313/blog/</link>
<description>Recent content on Gabriel Study Blog</description>
<generator>Hugo</generator>
<language>en</language>
<lastBuildDate>Fri, 15 Nov 2024 17:46:24 +0100</lastBuildDate>
<atom:link href="https://glopez.github.io/index.xml" rel="self" type="application/rss+xml" />
<lastBuildDate>Sun, 17 Nov 2024 19:21:32 +0100</lastBuildDate>
<atom:link href="http://localhost:1313/blog/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>(DataExpert.io) Bootcamp - Day 2 - Lecture</title>
<link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</link>
<pubDate>Sun, 17 Nov 2024 19:21:32 +0100</pubDate>
<guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</guid>
<description>&lt;p&gt;Today&amp;rsquo;s lecture deals with &lt;strong&gt;Slowly Changing Dimensions&lt;/strong&gt; and &lt;strong&gt;Idempotency&lt;/strong&gt;.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Slowly changing dimensions&lt;/strong&gt; = An attribute that drifts over time&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; Your favorite food&lt;/p&gt;&#xA;&lt;h2 id=&#34;idempotency&#34;&gt;Idempotency&lt;/h2&gt;&#xA;&lt;p&gt;You need to model slowly dimensions the right way because they impact idempotency.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent&lt;/strong&gt; = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent pipeline&lt;/strong&gt; = The ability for your data pipeline to produce the same results whether it&amp;rsquo;s running in production or in backfill.&lt;/p&gt;</description>
</item>
<item>
<title>(DataExpert.io) Bootcamp - Day 1 - Lecture</title>
<link>https://glopez.github.io/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</link>
<link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</link>
<pubDate>Fri, 15 Nov 2024 17:46:24 +0100</pubDate>
<guid>https://glopez.github.io/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</guid>
<description>&lt;blockquote&gt;&#xA;&lt;p&gt;Today lecture with deal with &lt;strong&gt;complex data types&lt;/strong&gt; and &lt;strong&gt;cumulation&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;h3 id=&#34;what-is-a-dimension-&#34;&gt;What is a dimension ?&lt;/h3&gt;&#xA;&lt;p&gt;Dimensions are the attributes of an entity&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Some dimensions are identifiers&lt;/li&gt;&#xA;&lt;li&gt;Some dimensions are just attributes&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;Dimensions come in two flavors (generally) :&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Slowly changing (time dependent)&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Makes things harder to model&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;/li&gt;&#xA;&lt;li&gt;Fixed (doesn&amp;rsquo;t change over time)&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;topics-of-the-day-index&#34;&gt;Topics of the day (index)&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Knowing your data consumer&lt;/li&gt;&#xA;&lt;li&gt;OLTP vs OLAP modelling&lt;/li&gt;&#xA;&lt;li&gt;Cumulative table design&lt;/li&gt;&#xA;&lt;li&gt;The compactness vs usability tradeoff&lt;/li&gt;&#xA;&lt;li&gt;Temporal cardinality explosion&lt;/li&gt;&#xA;&lt;li&gt;Run-length encoding compression gotchas&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;knowing-your-consumer&#34;&gt;Knowing your consumer&lt;/h2&gt;&#xA;&lt;p&gt;Who is going to consume the data ?&lt;/p&gt;</description>
<guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</guid>
<description>&lt;p&gt;Today lecture will deal with &lt;strong&gt;complex data types&lt;/strong&gt; and &lt;strong&gt;cumulation&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;what-is-a-dimension-&#34;&gt;What is a dimension ?&lt;/h2&gt;&#xA;&lt;p&gt;Dimensions are the attributes of an entity&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Some dimensions are identifiers&lt;/li&gt;&#xA;&lt;li&gt;Some dimensions are just attributes&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;Dimensions come in two flavors (generally) :&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Slowly changing (time dependent)&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Makes things harder to model&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;/li&gt;&#xA;&lt;li&gt;Fixed (doesn&amp;rsquo;t change over time)&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;topics-of-the-day-index&#34;&gt;Topics of the day (index)&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Knowing your data consumer&lt;/li&gt;&#xA;&lt;li&gt;OLTP vs OLAP modelling&lt;/li&gt;&#xA;&lt;li&gt;Cumulative table design&lt;/li&gt;&#xA;&lt;li&gt;The compactness vs usability tradeoff&lt;/li&gt;&#xA;&lt;li&gt;Temporal cardinality explosion&lt;/li&gt;&#xA;&lt;li&gt;Run-length encoding compression gotchas&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;knowing-your-consumer&#34;&gt;Knowing your consumer&lt;/h2&gt;&#xA;&lt;p&gt;Who is going to consume the data ?&lt;/p&gt;</description>
</item>
</channel>
</rss>
6 changes: 3 additions & 3 deletions hugo/public/page/1/index.html
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>https://glopez.github.io/</title>
<link rel="canonical" href="https://glopez.github.io/">
<title>http://localhost:1313/blog/</title>
<link rel="canonical" href="http://localhost:1313/blog/">
<meta name="robots" content="noindex">
<meta charset="utf-8">
<meta http-equiv="refresh" content="0; url=https://glopez.github.io/">
<meta http-equiv="refresh" content="0; url=http://localhost:1313/blog/">
</head>
</html>
Loading

0 comments on commit 66a8949

Please sign in to comment.