✨ feat( DataExpert.io ):

- Day 1 lecture
glopez-dev · Nov 22, 2024 · e80fc02 · e80fc02
1 parent 7f96162
commit e80fc02
Show file tree

Hide file tree

Showing 22 changed files with 3,007 additions and 30 deletions.
diff --git a/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture.md b/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture.md
@@ -0,0 +1,241 @@
+---
+date: 2024-11-15T17:46:24+01:00
+draft: false
+author: Gabriel LOPEZ
+title: (DataExpert.io) Bootcamp - Day 1 - Lecture 
+---
+
+> Today lecture with deal with **complex data types** and **cumulation**
+
+### What is a dimension ?
+
+Dimensions are the attributes of an entity
+
+- Some dimensions are identifiers
+- Some dimensions are just attributes
+
+Dimensions come in two flavors (generally) :
+- Slowly changing (time dependent)
+	- Makes things harder to model
+- Fixed (doesn't change over time)
+
+## Topics of the day (index)
+- Knowing your data consumer
+- OLTP vs OLAP modelling
+- Cumulative table design
+- The compactness vs usability tradeoff
+- Temporal cardinality explosion
+- Run-length encoding compression gotchas
+
+## Knowing your consumer
+Who is going to consume the data ?
+
+**Data analyst / Data scientist :**
+- Data should be easy to query
+- Not many complex data types
+- OLAP cube ?
+
+**Other data engineers :**
+- Data should be compact
+- Probably harder to query
+- Nested types are okay
+- Master dataset >> consumed by others D.E
+
+**M.L models :**
+- Most models use identifiers and primitive types columns
+- "Flat data"
+
+**Customers :**
+- Easy data interpretation
+- Data visualization
+
+## OLTP vs Master Data vs OLAP
+
+**OLTP** >> On Line Transaction Processing
+- Optimizes for low-latency, low volume queries
+- Mostly outside of Data Engineering realm
+- It is how Software Engineers model their data to make their online systems run quickly
+- 3NF, deduplication
+
+**OLAP** >> On Line Analytical Processing
+- Optimizes for large volume, `GROUP BY` queries, minimize `JOIN`s
+- Most common data modelling for D.E
+- Big JOINs are slow
+
+OLAP usually looks at a big chunk of the dataset while OLTP looks at one record
+
+**Master Data** 
+- Middle ground between OLTP and OLAP
+- Transactional layer > Master Data layer > Analytical layer
+
+**Mismatching the needs leads to less business value**
+- Biggest D.E problems occurs when data is modelled for the wrong user
+Symptoms :
+- Analytical modelling used as Transaction >> Online app will be slow
+- Transactional modelling for analytics >> Lot of JOINs
+- That's were master data middle role helps a lot
+
+**OLTP and OLAP is a continuum**
+
+![oltp_olap_continuum](/images/oltp_olap_continuum.png)
+
+- OLAP Cube 
+	- often reffered as "slice and dice"
+	- flattened data
+- Metrics 
+	- aggregates even more 
+	- OLAP cube to one value
+
+*The continuum is like "distillation" you go from lots of productions database to one metric*
+
+Understanding these patterns in data modelling will simplify the D.E life
+
+## Cumulative Table Design
+> https://github.com/DataExpert-io/cumulative-table-design
+
+This design produces tables that can provide efficient analyses on arbitrarly large (up to thousands of days) timeframes.
+
+We initially build our **daily** metrics table that is at the grain of whatever our entity is. This data is derived from whatever event sources we have upstream.
+
+After we have our daily metrics, we `FULL OUTER JOIN` yesterday's *cumulative table* with today's daily data and build our metric arrays for each user. This allows us to bring the new history in without having to scan all of it. **(a big performance boost)**
+
+These metric arrays allow us to easily answer queries about the history of all users using things like `ARRAY_SUM` to calculate whatever metric we want on whatever time frame the array allows.
+
+> The longer the time frame of your analysis, the more critical this pattern becomes!!
+
+**It answers common problems when you build master data :**
+- All users may not always showup
+- You still want a complete history
+- Holding on all the dimensions that ever existed (history)
+
+**Example :**
+- Two Dataframes (yesterday and today)
+- `FULL OUTER JOIN` the Dataframes together
+- `COALESCE` values to keep everything around
+- Hang onto all of history
+
+### Use cases
+-  Growth analytics
+- State transition tracking
+	- Growth accounting : 
+		- Active yesterday but not today (churned)
+		- Inactive yesterday but active today (resurected)
+		- Not existing yesterday but exists today (new)	
+
+### Stengths
+- Ability do to historical analysis
+	- No need of GROUP BY
+- Scalable queries
+
+### Drawbacks
+
+#### Sequential Backfilling 
+*Backfilling* means populating historical data retroactively. In this design, you can't load historical data in any random order - you must process it sequentially (day by day)
+#### Handling *Personal Identifiable Information* (PII)
+ If **PII** needs to be updated or deleted (e.g., for privacy regulations like GDPR):
+- You can't just update the latest record
+- You need to update/remove PII from the entire history in the arrays
+- This could break the historical analysis capabilities
+
+## Compactness vs Usability tradeoff
+
+**The most usable tables usually :**
+- Have no complex data types
+- Easily can be manipulated with `WHERE` and `GROUP BY`
+- More analytics focused
+
+**The most compact tables :**
+- ID + blob of bytes
+- Use compression codex
+- Not human readable
+- Ex : for transmission over Network 
+- More Software engineering focused
+
+**Middleground tables :**
+- Use complex data types (ARRAY, MAP, STRUCTS)
+- Querying is thickier
+- Data is more compact
+
+**Where to use each type of table ?**
+Most compact :
+- Online systems
+- When latency and volumes matters a lot
+- Highly technical consumers
+
+ Middle-ground :
+- Upstream staging / master data
+- Mostly D.E consumers
+
+ Most usable :
+- OLAP cubes
+- Analytics consumers
+
+### Struct vs Array vs Map
+#### Struct
+- Almost like a table inside a table
+- Keys are rigidly defined
+- Compression is good
+- Values can be any type
+#### Map
+- Keys are loosely defined
+- Compression is okay
+- Values have to be the same type
+#### Array
+- Ordered values
+- Values are all the same type
+- Values can be another complex type (Struct, Map)
+
+## Temporal Cardinality Explosions of Dimensions
+
+> One of the most impactful things Zach worked on.
+
+When you add a temporal aspect to your dimensions and the *cardinality* increases by at least one order of magnitude.
+
+**Example :** 
+- Airbnb has around 6 millions listings
+- We want to know the nightly pricing and availability of each night for the next year
+	- 365 days * 6 million listings ~ 2 billions nights
+- Should this dataset be :
+	- Listing-level with an array of 365 nights (6 million rows) ?
+	- Listing night level with 2 billion rows ?
+- If you do the sorting right Parquet will keep these two datasets about the same size
+
+### Badness of denormalized temporal dimensions
+- If we choose the Listing night level Spark `JOIN` shuffling is gonna mess with the sorting
+- Other run-length encoding is not going to compress that down as much
+### Run-Length Encoding (RLE) compression
+Probably **the most important compression technique in Big Data** right now.
+
+> It's why Parquet file has become so successful
+
+Shuffle can ruin *RLE* >> **Be careful!**
+
+ > Shuffle happens in distributed environments when you use `JOIN ` and `GROUP BY`.
+
+Big thing about *RLE* is that duplicated data gets nullified
+
+After a join, Spark may mix up the ordering of the rows and ruin RLE compression.
+
+| season | player_name     | age | height | college         |
+|--------|----------------|-----|--------|----------------|
+| 1996   | A.C. Green     | 33  | 6-9    | Oregon State   |
+| 1996   | Michael Jordan | 34  | 6-6    | North Carolina |
+| 1997   | A.C. Green     | 34  | 6-9    | Oregon State   |
+| 1997   | Michael Jordan | 35  | 6-6    | North Carolina |
+
+
+#### Two ways to solve the problem :
+- One way to go is to re-sort the dataset after a join 
+	- Zach is not about that; you should sort your data once
+- Store everything in an array
+	- A row with player name and seasons in an array
+	- `JOIN` on `player name`
+	- Explode the seasons array after the join 
+	- It keeps the sorting because rows with the same player name are kept together after the explode
+
+Leveraging this for master data can be powerful, especially for downstream data engineers
+- You model data the right way so they can't make that mistake
+
+**Spark shuffles are a big thing to watch out for**
+
+
diff --git a/hugo/public/categories/index.html b/hugo/public/categories/index.html
@@ -1,11 +1,11 @@
 <!DOCTYPE html>
-<html><head lang="en"><script src="/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=livereload" data-no-instant defer></script>
+<html><head lang="en">
 	<meta charset="utf-8" />
 	<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Categories - Gabriel Study Blog</title><meta name="viewport" content="width=device-width, initial-scale=1">
 	<meta name="description" content="" />
 	<meta property="og:image" content=""/>
-	<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/categories/index.xml" title="Gabriel Study Blog" />
-	<meta property="og:url" content="http://localhost:1313/categories/">
+	<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/categories/index.xml" title="Gabriel Study Blog" />
+	<meta property="og:url" content="https://glopez.github.io/categories/">
   <meta property="og:site_name" content="Gabriel Study Blog">
   <meta property="og:title" content="Categories">
   <meta property="og:locale" content="en">
@@ -15,11 +15,11 @@
   <meta name="twitter:title" content="Categories">
 
 
-        <link href="http://localhost:1313/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
+        <link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
 
 
 
-	<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
+	<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
 
 
 
@@ -32,7 +32,7 @@
 <body>
         <div class="content"><header>
 	<div class="main">
-		<a href="http://localhost:1313/">Gabriel Study Blog</a>
+		<a href="https://glopez.github.io/">Gabriel Study Blog</a>
 	</div>
 	<nav>
 
@@ -61,6 +61,8 @@ <h1 class="page-title">All tags</h1>
       href="https://github.com/athul/archie">Archie Theme</a> | Built with <a href="https://gohugo.io">Hugo</a>
   </div>
 </footer>
+
+
 </div>
     </body>
 </html>
diff --git a/hugo/public/categories/index.xml b/hugo/public/categories/index.xml
@@ -2,10 +2,10 @@
 <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
   <channel>
     <title>Categories on Gabriel Study Blog</title>
-    <link>http://localhost:1313/categories/</link>
+    <link>https://glopez.github.io/categories/</link>
     <description>Recent content in Categories on Gabriel Study Blog</description>
     <generator>Hugo</generator>
     <language>en</language>
-    <atom:link href="http://localhost:1313/categories/index.xml" rel="self" type="application/rss+xml" />
+    <atom:link href="https://glopez.github.io/categories/index.xml" rel="self" type="application/rss+xml" />
   </channel>
 </rss>
diff --git a/hugo/public/images/oltp_olap_continuum.png b/hugo/public/images/oltp_olap_continuum.png
diff --git a/hugo/public/index.html b/hugo/public/index.html
@@ -1,12 +1,12 @@
 <!DOCTYPE html>
 <html>
-	<head lang="en"><script src="/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=livereload" data-no-instant defer></script>
+	<head lang="en">
 	<meta charset="utf-8" />
 	<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Gabriel Study Blog | Home </title><meta name="viewport" content="width=device-width, initial-scale=1">
 	<meta name="description" content="" />
 	<meta property="og:image" content=""/>
-	<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/index.xml" title="Gabriel Study Blog" />
-	<meta property="og:url" content="http://localhost:1313/">
+	<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/index.xml" title="Gabriel Study Blog" />
+	<meta property="og:url" content="https://glopez.github.io/">
   <meta property="og:site_name" content="Gabriel Study Blog">
   <meta property="og:title" content="Gabriel Study Blog">
   <meta property="og:locale" content="en">
@@ -16,11 +16,11 @@
   <meta name="twitter:title" content="Gabriel Study Blog">
 
 
-        <link href="http://localhost:1313/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
+        <link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
 
 
 
-	<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
+	<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
 
 
 
@@ -35,7 +35,7 @@
 		<div class="content">
 			<header>
 	<div class="main">
-		<a href="http://localhost:1313/">Gabriel Study Blog</a>
+		<a href="https://glopez.github.io/">Gabriel Study Blog</a>
 	</div>
 	<nav>
 
@@ -49,6 +49,45 @@
 
 
 
+				<section class="list-item">
+					<h1 class="title"><a href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
+					<time>Nov 15, 2024</time>
+					<br><div class="description">
+
+	<blockquote>
+<p>Today lecture with deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
+</blockquote>
+<h3 id="what-is-a-dimension-">What is a dimension ?</h3>
+<p>Dimensions are the attributes of an entity</p>
+<ul>
+<li>Some dimensions are identifiers</li>
+<li>Some dimensions are just attributes</li>
+</ul>
+<p>Dimensions come in two flavors (generally) :</p>
+<ul>
+<li>Slowly changing (time dependent)
+<ul>
+<li>Makes things harder to model</li>
+</ul>
+</li>
+<li>Fixed (doesn&rsquo;t change over time)</li>
+</ul>
+<h2 id="topics-of-the-day-index">Topics of the day (index)</h2>
+<ul>
+<li>Knowing your data consumer</li>
+<li>OLTP vs OLAP modelling</li>
+<li>Cumulative table design</li>
+<li>The compactness vs usability tradeoff</li>
+<li>Temporal cardinality explosion</li>
+<li>Run-length encoding compression gotchas</li>
+</ul>
+<h2 id="knowing-your-consumer">Knowing your consumer</h2>
+<p>Who is going to consume the data ?</p>&hellip;
+
+</div>
+					<a class="readmore" href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
+				</section>
+
 
 
 
@@ -61,6 +100,8 @@
   </div>
 </footer>
 
+
+
 		</div>
 
 	</body>