Skip to content

Commit

Permalink
✨ feat( DataExpert.io ):
Browse files Browse the repository at this point in the history
- Day 1 lecture
  • Loading branch information
glopez-dev committed Nov 22, 2024
1 parent 7f96162 commit e80fc02
Show file tree
Hide file tree
Showing 22 changed files with 3,007 additions and 30 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
---
date: 2024-11-15T17:46:24+01:00
draft: false
author: Gabriel LOPEZ
title: (DataExpert.io) Bootcamp - Day 1 - Lecture
---

> Today lecture with deal with **complex data types** and **cumulation**
### What is a dimension ?

Dimensions are the attributes of an entity

- Some dimensions are identifiers
- Some dimensions are just attributes

Dimensions come in two flavors (generally) :
- Slowly changing (time dependent)
- Makes things harder to model
- Fixed (doesn't change over time)

## Topics of the day (index)
- Knowing your data consumer
- OLTP vs OLAP modelling
- Cumulative table design
- The compactness vs usability tradeoff
- Temporal cardinality explosion
- Run-length encoding compression gotchas

## Knowing your consumer
Who is going to consume the data ?

**Data analyst / Data scientist :**
- Data should be easy to query
- Not many complex data types
- OLAP cube ?

**Other data engineers :**
- Data should be compact
- Probably harder to query
- Nested types are okay
- Master dataset >> consumed by others D.E

**M.L models :**
- Most models use identifiers and primitive types columns
- "Flat data"

**Customers :**
- Easy data interpretation
- Data visualization

## OLTP vs Master Data vs OLAP

**OLTP** >> On Line Transaction Processing
- Optimizes for low-latency, low volume queries
- Mostly outside of Data Engineering realm
- It is how Software Engineers model their data to make their online systems run quickly
- 3NF, deduplication

**OLAP** >> On Line Analytical Processing
- Optimizes for large volume, `GROUP BY` queries, minimize `JOIN`s
- Most common data modelling for D.E
- Big JOINs are slow

OLAP usually looks at a big chunk of the dataset while OLTP looks at one record

**Master Data**
- Middle ground between OLTP and OLAP
- Transactional layer > Master Data layer > Analytical layer

**Mismatching the needs leads to less business value**
- Biggest D.E problems occurs when data is modelled for the wrong user
Symptoms :
- Analytical modelling used as Transaction >> Online app will be slow
- Transactional modelling for analytics >> Lot of JOINs
- That's were master data middle role helps a lot

**OLTP and OLAP is a continuum**

![oltp_olap_continuum](/images/oltp_olap_continuum.png)

- OLAP Cube
- often reffered as "slice and dice"
- flattened data
- Metrics
- aggregates even more
- OLAP cube to one value

*The continuum is like "distillation" you go from lots of productions database to one metric*

Understanding these patterns in data modelling will simplify the D.E life

## Cumulative Table Design
> https://github.com/DataExpert-io/cumulative-table-design
This design produces tables that can provide efficient analyses on arbitrarly large (up to thousands of days) timeframes.

We initially build our **daily** metrics table that is at the grain of whatever our entity is. This data is derived from whatever event sources we have upstream.

After we have our daily metrics, we `FULL OUTER JOIN` yesterday's *cumulative table* with today's daily data and build our metric arrays for each user. This allows us to bring the new history in without having to scan all of it. **(a big performance boost)**

These metric arrays allow us to easily answer queries about the history of all users using things like `ARRAY_SUM` to calculate whatever metric we want on whatever time frame the array allows.

> The longer the time frame of your analysis, the more critical this pattern becomes!!
**It answers common problems when you build master data :**
- All users may not always showup
- You still want a complete history
- Holding on all the dimensions that ever existed (history)

**Example :**
- Two Dataframes (yesterday and today)
- `FULL OUTER JOIN` the Dataframes together
- `COALESCE` values to keep everything around
- Hang onto all of history

### Use cases
- Growth analytics
- State transition tracking
- Growth accounting :
- Active yesterday but not today (churned)
- Inactive yesterday but active today (resurected)
- Not existing yesterday but exists today (new)

### Stengths
- Ability do to historical analysis
- No need of GROUP BY
- Scalable queries

### Drawbacks

#### Sequential Backfilling
*Backfilling* means populating historical data retroactively. In this design, you can't load historical data in any random order - you must process it sequentially (day by day)
#### Handling *Personal Identifiable Information* (PII)
If **PII** needs to be updated or deleted (e.g., for privacy regulations like GDPR):
- You can't just update the latest record
- You need to update/remove PII from the entire history in the arrays
- This could break the historical analysis capabilities

## Compactness vs Usability tradeoff

**The most usable tables usually :**
- Have no complex data types
- Easily can be manipulated with `WHERE` and `GROUP BY`
- More analytics focused

**The most compact tables :**
- ID + blob of bytes
- Use compression codex
- Not human readable
- Ex : for transmission over Network
- More Software engineering focused

**Middleground tables :**
- Use complex data types (ARRAY, MAP, STRUCTS)
- Querying is thickier
- Data is more compact

**Where to use each type of table ?**
Most compact :
- Online systems
- When latency and volumes matters a lot
- Highly technical consumers

Middle-ground :
- Upstream staging / master data
- Mostly D.E consumers

Most usable :
- OLAP cubes
- Analytics consumers

### Struct vs Array vs Map
#### Struct
- Almost like a table inside a table
- Keys are rigidly defined
- Compression is good
- Values can be any type
#### Map
- Keys are loosely defined
- Compression is okay
- Values have to be the same type
#### Array
- Ordered values
- Values are all the same type
- Values can be another complex type (Struct, Map)

## Temporal Cardinality Explosions of Dimensions

> One of the most impactful things Zach worked on.
When you add a temporal aspect to your dimensions and the *cardinality* increases by at least one order of magnitude.

**Example :**
- Airbnb has around 6 millions listings
- We want to know the nightly pricing and availability of each night for the next year
- 365 days * 6 million listings ~ 2 billions nights
- Should this dataset be :
- Listing-level with an array of 365 nights (6 million rows) ?
- Listing night level with 2 billion rows ?
- If you do the sorting right Parquet will keep these two datasets about the same size

### Badness of denormalized temporal dimensions
- If we choose the Listing night level Spark `JOIN` shuffling is gonna mess with the sorting
- Other run-length encoding is not going to compress that down as much
### Run-Length Encoding (RLE) compression
Probably **the most important compression technique in Big Data** right now.

> It's why Parquet file has become so successful
Shuffle can ruin *RLE* >> **Be careful!**

> Shuffle happens in distributed environments when you use `JOIN ` and `GROUP BY`.
Big thing about *RLE* is that duplicated data gets nullified

After a join, Spark may mix up the ordering of the rows and ruin RLE compression.

| season | player_name | age | height | college |
|--------|----------------|-----|--------|----------------|
| 1996 | A.C. Green | 33 | 6-9 | Oregon State |
| 1996 | Michael Jordan | 34 | 6-6 | North Carolina |
| 1997 | A.C. Green | 34 | 6-9 | Oregon State |
| 1997 | Michael Jordan | 35 | 6-6 | North Carolina |


#### Two ways to solve the problem :
- One way to go is to re-sort the dataset after a join
- Zach is not about that; you should sort your data once
- Store everything in an array
- A row with player name and seasons in an array
- `JOIN` on `player name`
- Explode the seasons array after the join
- It keeps the sorting because rows with the same player name are kept together after the explode

Leveraging this for master data can be powerful, especially for downstream data engineers
- You model data the right way so they can't make that mistake

**Spark shuffles are a big thing to watch out for**


14 changes: 8 additions & 6 deletions hugo/public/categories/index.html
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
<!DOCTYPE html>
<html><head lang="en"><script src="/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=livereload" data-no-instant defer></script>
<html><head lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Categories - Gabriel Study Blog</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="" />
<meta property="og:image" content=""/>
<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/categories/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="http://localhost:1313/categories/">
<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/categories/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="https://glopez.github.io/categories/">
<meta property="og:site_name" content="Gabriel Study Blog">
<meta property="og:title" content="Categories">
<meta property="og:locale" content="en">
Expand All @@ -15,11 +15,11 @@
<meta name="twitter:title" content="Categories">


<link href="http://localhost:1313/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
<link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">



<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />



Expand All @@ -32,7 +32,7 @@
<body>
<div class="content"><header>
<div class="main">
<a href="http://localhost:1313/">Gabriel Study Blog</a>
<a href="https://glopez.github.io/">Gabriel Study Blog</a>
</div>
<nav>

Expand Down Expand Up @@ -61,6 +61,8 @@ <h1 class="page-title">All tags</h1>
href="https://github.com/athul/archie">Archie Theme</a> | Built with <a href="https://gohugo.io">Hugo</a>
</div>
</footer>


</div>
</body>
</html>
4 changes: 2 additions & 2 deletions hugo/public/categories/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Categories on Gabriel Study Blog</title>
<link>http://localhost:1313/categories/</link>
<link>https://glopez.github.io/categories/</link>
<description>Recent content in Categories on Gabriel Study Blog</description>
<generator>Hugo</generator>
<language>en</language>
<atom:link href="http://localhost:1313/categories/index.xml" rel="self" type="application/rss+xml" />
<atom:link href="https://glopez.github.io/categories/index.xml" rel="self" type="application/rss+xml" />
</channel>
</rss>
Binary file added hugo/public/images/oltp_olap_continuum.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 47 additions & 6 deletions hugo/public/index.html
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
<!DOCTYPE html>
<html>
<head lang="en"><script src="/livereload.js?mindelay=10&amp;v=2&amp;port=1313&amp;path=livereload" data-no-instant defer></script>
<head lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Gabriel Study Blog | Home </title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="" />
<meta property="og:image" content=""/>
<link rel="alternate" type="application/rss+xml" href="http://localhost:1313/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="http://localhost:1313/">
<link rel="alternate" type="application/rss+xml" href="https://glopez.github.io/index.xml" title="Gabriel Study Blog" />
<meta property="og:url" content="https://glopez.github.io/">
<meta property="og:site_name" content="Gabriel Study Blog">
<meta property="og:title" content="Gabriel Study Blog">
<meta property="og:locale" content="en">
Expand All @@ -16,11 +16,11 @@
<meta name="twitter:title" content="Gabriel Study Blog">


<link href="http://localhost:1313/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">
<link href="https://glopez.github.io/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" rel="stylesheet">



<link rel="stylesheet" type="text/css" media="screen" href="http://localhost:1313/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://glopez.github.io/css/main.5cebd7d4fb2b97856af8d32a6def16164fcf7d844e98e236fcb3559655020373.css" />



Expand All @@ -35,7 +35,7 @@
<div class="content">
<header>
<div class="main">
<a href="http://localhost:1313/">Gabriel Study Blog</a>
<a href="https://glopez.github.io/">Gabriel Study Blog</a>
</div>
<nav>

Expand All @@ -49,6 +49,45 @@



<section class="list-item">
<h1 class="title"><a href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
<time>Nov 15, 2024</time>
<br><div class="description">

<blockquote>
<p>Today lecture with deal with <strong>complex data types</strong> and <strong>cumulation</strong></p>
</blockquote>
<h3 id="what-is-a-dimension-">What is a dimension ?</h3>
<p>Dimensions are the attributes of an entity</p>
<ul>
<li>Some dimensions are identifiers</li>
<li>Some dimensions are just attributes</li>
</ul>
<p>Dimensions come in two flavors (generally) :</p>
<ul>
<li>Slowly changing (time dependent)
<ul>
<li>Makes things harder to model</li>
</ul>
</li>
<li>Fixed (doesn&rsquo;t change over time)</li>
</ul>
<h2 id="topics-of-the-day-index">Topics of the day (index)</h2>
<ul>
<li>Knowing your data consumer</li>
<li>OLTP vs OLAP modelling</li>
<li>Cumulative table design</li>
<li>The compactness vs usability tradeoff</li>
<li>Temporal cardinality explosion</li>
<li>Run-length encoding compression gotchas</li>
</ul>
<h2 id="knowing-your-consumer">Knowing your consumer</h2>
<p>Who is going to consume the data ?</p>&hellip;

</div>
<a class="readmore" href="/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">Read more ⟶</a>
</section>




Expand All @@ -61,6 +100,8 @@
</div>
</footer>



</div>

</body>
Expand Down
Loading

0 comments on commit e80fc02

Please sign in to comment.