✨ feat( DataExpert.io ): Day 2 lecture

glopez-dev · Nov 22, 2024 · 66a8949 · 66a8949
1 parent c8ea1a7
commit 66a8949
Show file tree

Hide file tree

Showing 14 changed files with 565 additions and 82 deletions.
diff --git a/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture.md b/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture.md
@@ -5,9 +5,9 @@ author: Gabriel LOPEZ
 title: (DataExpert.io) Bootcamp - Day 1 - Lecture 
 ---
 
-> Today lecture with deal with **complex data types** and **cumulation**
+Today lecture will deal with **complex data types** and **cumulation**
 
-### What is a dimension ?
+## What is a dimension ?
 
 Dimensions are the attributes of an entity
 

diff --git a/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture.md b/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture.md
@@ -0,0 +1,173 @@
+---
+date: '2024-11-17T19:21:32+01:00'
+draft: false 
+title: (DataExpert.io) Bootcamp - Day 2 - Lecture
+---
+Today's lecture deals with **Slowly Changing Dimensions** and **Idempotency**.
+
+> **Slowly changing dimensions** = An attribute that drifts over time
+
+*Example:* Your favorite food 
+
+## Idempotency
+
+You need to model slowly dimensions the right way because they impact idempotency.
+
+> **Idempotent** = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.
+
+> **Idempotent pipeline** = The ability for your data pipeline to produce the same results whether it's running in production or in backfill.
+
+It's a very important property of pipelines in order to enforce data quality
+
+If you can build **idempotent** pipelines you are going to be a much better Data Engineer.
+
+**Your pipelines should produce the same result :**
+- Regardless the day you run it 
+- Regardless of many times you run it 
+- Regardless of the hour that you run it 
+
+> Data Engineers mistakes a lot these three characteristics of a pipeline. 
+
+**Why is it hard to troubleshoot a non-idempotent pipeline ?** >> It doesn't actually fails, it just produces incorrect data silently.
+
+> Zach prefers to say "non reproducible" data whether "incorrect"
+
+If your data isn't idempotent downstream data will also be (there's propagation of the problem) 
+
+## What can make a pipeline not idempotent ?
+
+**Using `INSERT INTO` without TRUNCATE**
+- Creates duplication
+- Your pipeline doesn't produces the same data regardless how many times it's run
+- `INSERT INTO` should never be used as a Data Engineer
+-  `MERGE` and `INSERT OVERWRITE` are better ways to go
+	- `MERGE` avoids duplication
+
+
+**Using `start_date >` without a corresponding `end_date <`**
+- While the time passes further new runs of the pipeline are going to add data from longer time periods.
+- You need to use start and end dates to process only a "window"
+- This can also cause out of memory exceptions when you backfill
+
+**Not using a full set of partitions sensors**
+- Your pipeline might run with an incomplete set of inputs
+- You have to check for the full sets of input needed by the pipeline
+- Avoid to fire the pipeline to early in production
+
+**Not using `depends_on_past` (Airflow term) for cumulatives pipelines**
+- Most pipeline can run in parallel
+- Cumulatives pipelines have to run sequentially
+
+> The production and back-fill behavior of a pipeline have to be the same.
+
+> **SCD** = Slowly Changing Dimensions
+
+**Relying on the "latest" partition of a not properly modeled SCD table**
+- Cumulatives pipelines amplifies this bug
+- The only exception for using "latest" partition is back-filling data 
+- The SCD table has to be properly modeled though
+
+**Relying on the "latest" partition in general**
+
+> You can't do back-filling without idempotency
+
+> Unit tests doesn't guarantees the pipeline idempotency.
+
+## Should you model as Slowly Changing Dimensions ?
+
+There is also a concept of **rapidly changing dimensions**.
+
+> Example: The heart rate
+
+**The creator of Airflow hates SCDs :**
+- He defends the idea of *functional data engineering*
+	- The pipeline is a chain of pure functions
+- As the storage is cheaper he prefers storing each version of the data no matter duplication.
+	- Daily dimensions values over SCDs
+
+### Three different ways to model your dimensions
+
+#### Latest snapshot
+You just store the current value.
+
+**Weakness :** If you have SCD and you only hold-on to the latest value the pipeline is not idempotent.
+
+You are going to back-fill  values with the latest data version sometimes with a huge time gap.
+
+That's where daily snapshot is better for back-filling
+
+#### Daily / Monthly / Yearly snapshot
+
+#### Slowly Changing Dimensions
+A way to collapse daily snapshot based on whether or not the data changed day over day.
+
+>Example : One row saying you have this age for this time frame (1 years) instead of 365 rows with the same age.
+
+The slower the dimension changes the better is the compression.	
+
+> Saving 7 rows (a week) with compression is less interesting than saving 365 rows (a year).
+
+Keep this in mind when choosing your modelling.
+
+## How can I model dimensions that change ?
+
+There are three ways to model SCD :
+
+1. Singular snapshot
+2. Daily partitioned snapshots (Easy and intuitive way to go)
+3. SCD types 1, 2, 3 (three subways)
+
+> **Warning :** Never back-fill with only the latest snapshot
+
+## The types of SCD
+### Type 0
+Not changing dimension 
+
+### Type 1
+- You only care about the latest value
+- This breaks your pipeline idempotency
+- For OLTP it can be fine
+
+### Type 2
+You care about what the value was in a certain time frame.
+
+> `start_date` and `end_date`
+
+Current values usually have either an `end_date` that is `NULL` or far in the future
+
+**Benefits :**
+- Only type of SCD that is purely idempotent.
+- Full change history
+
+**Drawback :**
+- There are multiple rows for the same dimension
+	- You'll have to filter by date
+
+
+### Type 3 
+You only care about "original" and "current" values
+
+**Benefits :**
+- You only have 1 row per dimension
+- No row filtering
+
+**Drawbacks :**
+- You loose the history in between "original" and "current".
+- It may not store when the value changed unless you have a `current_change_date`
+- It is not idempotent
+
+> There are others types of SCD models that are less used.
+
+
+## Loading Type 2 SCDs
+
+Two ways to load a table with SCDs :
+1. Load the entire history in one query
+	- Can be okay if the dataset is small.	
+2. Incrementally load the data after the previous SCD is generated
+	- We want the production data to be incremental so we don't process the all history each time it changes.
+
+> Remember that all your pipelines don't have to be perfect
+
+## Additional resources
+> https://airbyte.com/data-engineering-resources/idempotency-in-data-pipelines 
diff --git a/hugo/public/categories/index.html b/hugo/public/categories/index.html
@@ -1,11 +1,11 @@
 <!DOCTYPE html>
-<html><head lang="en">
+<html><head lang="en"><script src="/blog/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=blog/livereload" data-no-instant defer></script>
 	<meta charset="utf-8" />
 	<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Categories - Gabriel Study Blog</title><meta name="viewport" content="width=device-width, initial-scale=1">
 	<meta name="description" content="" />
 	<meta property="og:image" content=""/>
-	<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/categories/index.xml" title="Gabriel Study Blog" />
-	<meta property="og:url" content="https://glopez.github.io/categories/">
+	<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/blog/categories/index.xml" title="Gabriel Study Blog" />
+	<meta property="og:url" content="http://localhost:1313/blog/categories/">
   <meta property="og:site_name" content="Gabriel Study Blog">
   <meta property="og:title" content="Categories">
   <meta property="og:locale" content="en">
@@ -15,11 +15,11 @@
   <meta name="twitter:title" content="Categories">
 
 
-        <link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
+        <link href="http://localhost:1313/blog/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
 
 
 
-	<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
+	<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/blog/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
 
 
 
@@ -32,7 +32,7 @@
 <body>
         <div class="content"><header>
 	<div class="main">
-		<a href="https://glopez.github.io/">Gabriel Study Blog</a>
+		<a href="http://localhost:1313/">Gabriel Study Blog</a>
 	</div>
 	<nav>
 
@@ -61,8 +61,6 @@ <h1 class="page-title">All tags</h1>
       href="https://github.com/athul/archie">Archie Theme</a> | Built with <a href="https://gohugo.io">Hugo</a>
   </div>
 </footer>
-
-
 </div>
     </body>
 </html>
diff --git a/hugo/public/categories/index.xml b/hugo/public/categories/index.xml
@@ -2,10 +2,10 @@
 <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
   <channel>
     <title>Categories on Gabriel Study Blog</title>
-    <link>https://glopez.github.io/categories/</link>
+    <link>http://localhost:1313/blog/categories/</link>
     <description>Recent content in Categories on Gabriel Study Blog</description>
     <generator>Hugo</generator>
     <language>en</language>
-    <atom:link href="https://glopez.github.io/categories/index.xml" rel="self" type="application/rss+xml" />
+    <atom:link href="http://localhost:1313/blog/categories/index.xml" rel="self" type="application/rss+xml" />
   </channel>
 </rss>
diff --git a/hugo/public/index.html b/hugo/public/index.html
@@ -1,12 +1,12 @@
 <!DOCTYPE html>
 <html>
-	<head lang="en">
+	<head lang="en"><script src="/blog/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=blog/livereload" data-no-instant defer></script>
 	<meta charset="utf-8" />
 	<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Gabriel Study Blog | Home </title><meta name="viewport" content="width=device-width, initial-scale=1">
 	<meta name="description" content="" />
 	<meta property="og:image" content=""/>
-	<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/index.xml" title="Gabriel Study Blog" />
-	<meta property="og:url" content="https://glopez.github.io/">
+	<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/blog/index.xml" title="Gabriel Study Blog" />
+	<meta property="og:url" content="http://localhost:1313/blog/">
   <meta property="og:site_name" content="Gabriel Study Blog">
   <meta property="og:title" content="Gabriel Study Blog">
   <meta property="og:locale" content="en">
@@ -16,11 +16,11 @@
   <meta name="twitter:title" content="Gabriel Study Blog">
 
 
-        <link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
+        <link href="http://localhost:1313/blog/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
 
 
 
-	<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
+	<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/blog/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
 
 
 
@@ -35,7 +35,7 @@
 		<div class="content">
 			<header>
 	<div class="main">
-		<a href="https://glopez.github.io/">Gabriel Study Blog</a>
+		<a href="http://localhost:1313/">Gabriel Study Blog</a>
 	</div>
 	<nav>
 
@@ -50,14 +50,34 @@
 
 
 				<section class="list-item">
-					<h1 class="title"><a href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
-					<time>Nov 15, 2024</time>
+					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">(DataExpert.io) Bootcamp - Day 2 - Lecture</a></h1>
+					<time>Nov 17, 2024</time>
 					<br><div class="description">
 
-	<blockquote>
-<p>Today lecture with deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
+	<p>Today&rsquo;s lecture deals with <strong>Slowly Changing Dimensions</strong> and <strong>Idempotency</strong>.</p>
+<blockquote>
+<p><strong>Slowly changing dimensions</strong> = An attribute that drifts over time</p>
+</blockquote>
+<p><em>Example:</em> Your favorite food</p>
+<h2 id="idempotency">Idempotency</h2>
+<p>You need to model slowly dimensions the right way because they impact idempotency.</p>
+<blockquote>
+<p><strong>Idempotent</strong> = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.</p>
 </blockquote>
-<h3 id="what-is-a-dimension-">What is a dimension ?</h3>
+<blockquote>
+<p><strong>Idempotent pipeline</strong> = The ability for your data pipeline to produce the same results whether it&rsquo;s running in production or in backfill.</p>&hellip;
+
+</div>
+					<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">Read more ⟶</a>
+				</section>
+
+				<section class="list-item">
+					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
+					<time>Nov 15, 2024</time>
+					<br><div class="description">
+
+	<p>Today lecture will deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
+<h2 id="what-is-a-dimension-">What is a dimension ?</h2>
 <p>Dimensions are the attributes of an entity</p>
 <ul>
 <li>Some dimensions are identifiers</li>
@@ -85,7 +105,7 @@ <h2 id="knowing-your-consumer">Knowing your consumer</h2>
 <p>Who is going to consume the data ?</p>&hellip;
 
 </div>
-					<a class="readmore" href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
+					<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
 				</section>
 
 
@@ -100,8 +120,6 @@ <h2 id="knowing-your-consumer">Knowing your consumer</h2>
   </div>
 </footer>
 
-
-
 		</div>
 
 	</body>

diff --git a/hugo/public/index.xml b/hugo/public/index.xml
@@ -2,18 +2,25 @@
 <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
   <channel>
     <title>Gabriel Study Blog</title>
-    <link>https://glopez.github.io/</link>
+    <link>http://localhost:1313/blog/</link>
     <description>Recent content on Gabriel Study Blog</description>
     <generator>Hugo</generator>
     <language>en</language>
-    <lastBuildDate>Fri, 15 Nov 2024 17:46:24 +0100</lastBuildDate>
-    <atom:link href="https://glopez.github.io/index.xml" rel="self" type="application/rss+xml" />
+    <lastBuildDate>Sun, 17 Nov 2024 19:21:32 +0100</lastBuildDate>
+    <atom:link href="http://localhost:1313/blog/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>(DataExpert.io) Bootcamp - Day 2 - Lecture</title>
+      <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</link>
+      <pubDate>Sun, 17 Nov 2024 19:21:32 +0100</pubDate>
+      <guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</guid>
+      <description>&lt;p&gt;Today&amp;rsquo;s lecture deals with &lt;strong&gt;Slowly Changing Dimensions&lt;/strong&gt; and &lt;strong&gt;Idempotency&lt;/strong&gt;.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Slowly changing dimensions&lt;/strong&gt; = An attribute that drifts over time&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; Your favorite food&lt;/p&gt;&#xA;&lt;h2 id=&#34;idempotency&#34;&gt;Idempotency&lt;/h2&gt;&#xA;&lt;p&gt;You need to model slowly dimensions the right way because they impact idempotency.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent&lt;/strong&gt; = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent pipeline&lt;/strong&gt; = The ability for your data pipeline to produce the same results whether it&amp;rsquo;s running in production or in backfill.&lt;/p&gt;</description>
+    </item>
     <item>
       <title>(DataExpert.io) Bootcamp - Day 1 - Lecture</title>
-      <link>https://glopez.github.io/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</link>
+      <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</link>
       <pubDate>Fri, 15 Nov 2024 17:46:24 +0100</pubDate>
-      <guid>https://glopez.github.io/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</guid>
-      <description>&lt;blockquote&gt;&#xA;&lt;p&gt;Today lecture with deal with &lt;strong&gt;complex data types&lt;/strong&gt; and &lt;strong&gt;cumulation&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;h3 id=&#34;what-is-a-dimension-&#34;&gt;What is a dimension ?&lt;/h3&gt;&#xA;&lt;p&gt;Dimensions are the attributes of an entity&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Some dimensions are identifiers&lt;/li&gt;&#xA;&lt;li&gt;Some dimensions are just attributes&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;Dimensions come in two flavors (generally) :&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Slowly changing (time dependent)&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Makes things harder to model&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;/li&gt;&#xA;&lt;li&gt;Fixed (doesn&amp;rsquo;t change over time)&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;topics-of-the-day-index&#34;&gt;Topics of the day (index)&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Knowing your data consumer&lt;/li&gt;&#xA;&lt;li&gt;OLTP vs OLAP modelling&lt;/li&gt;&#xA;&lt;li&gt;Cumulative table design&lt;/li&gt;&#xA;&lt;li&gt;The compactness vs usability tradeoff&lt;/li&gt;&#xA;&lt;li&gt;Temporal cardinality explosion&lt;/li&gt;&#xA;&lt;li&gt;Run-length encoding compression gotchas&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;knowing-your-consumer&#34;&gt;Knowing your consumer&lt;/h2&gt;&#xA;&lt;p&gt;Who is going to consume the data ?&lt;/p&gt;</description>
+      <guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</guid>
+      <description>&lt;p&gt;Today lecture will deal with &lt;strong&gt;complex data types&lt;/strong&gt; and &lt;strong&gt;cumulation&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;what-is-a-dimension-&#34;&gt;What is a dimension ?&lt;/h2&gt;&#xA;&lt;p&gt;Dimensions are the attributes of an entity&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Some dimensions are identifiers&lt;/li&gt;&#xA;&lt;li&gt;Some dimensions are just attributes&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;Dimensions come in two flavors (generally) :&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Slowly changing (time dependent)&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Makes things harder to model&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;/li&gt;&#xA;&lt;li&gt;Fixed (doesn&amp;rsquo;t change over time)&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;topics-of-the-day-index&#34;&gt;Topics of the day (index)&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Knowing your data consumer&lt;/li&gt;&#xA;&lt;li&gt;OLTP vs OLAP modelling&lt;/li&gt;&#xA;&lt;li&gt;Cumulative table design&lt;/li&gt;&#xA;&lt;li&gt;The compactness vs usability tradeoff&lt;/li&gt;&#xA;&lt;li&gt;Temporal cardinality explosion&lt;/li&gt;&#xA;&lt;li&gt;Run-length encoding compression gotchas&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;knowing-your-consumer&#34;&gt;Knowing your consumer&lt;/h2&gt;&#xA;&lt;p&gt;Who is going to consume the data ?&lt;/p&gt;</description>
     </item>
   </channel>
 </rss>
diff --git a/hugo/public/page/1/index.html b/hugo/public/page/1/index.html
@@ -1,10 +1,10 @@
 <!DOCTYPE html>
 <html lang="en">
   <head>
-    <title>https://glopez.github.io/</title>
-    <link rel="canonical" href="https://glopez.github.io/">
+    <title>http://localhost:1313/blog/</title>
+    <link rel="canonical" href="http://localhost:1313/blog/">
     <meta name="robots" content="noindex">
     <meta charset="utf-8">
-    <meta http-equiv="refresh" content="0; url=https://glopez.github.io/">
+    <meta http-equiv="refresh" content="0; url=http://localhost:1313/blog/">
   </head>
 </html>