From 39761d7afdff340361c3fe6ffcd1e63d1441aaab Mon Sep 17 00:00:00 2001
From: harvard-edge <khoshnevis.naeem@gmail.com>
Date: Sun, 2 Feb 2025 14:38:17 +0000
Subject: [PATCH] Push dev branch build

---
 docs/contents/core/training/training.html | 28 +++++++++++------------
 1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/docs/contents/core/training/training.html b/docs/contents/core/training/training.html
index 31e9d0f1..57d9b62b 100644
--- a/docs/contents/core/training/training.html
+++ b/docs/contents/core/training/training.html
@@ -822,10 +822,10 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="ai-training-systems"><spa
 <section id="evolution-of-systems" class="level3 page-columns page-full" data-number="8.2.1">
 <h3 data-number="8.2.1" class="anchored" data-anchor-id="evolution-of-systems"><span class="header-section-number">8.2.1</span> Evolution of Systems</h3>
 <p>Computing system architectures have evolved through distinct generations, with each new era building upon previous advances while introducing specialized optimizations for emerging application requirements (<a href="#fig-evolution-systems" class="quarto-xref">Figure&nbsp;<span>8.1</span></a>). This progression demonstrates how hardware adaptation to application needs shapes modern machine learning systems.</p>
-<div id="fig-evolution-systems" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-evolution-systems" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-evolution-systems-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-evolution-systems" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-evolution-systems-1"><a href="#fig-evolution-systems-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
+<div class="sourceCode" id="fig-evolution-systems" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-evolution-systems-1"><a href="#fig-evolution-systems-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
 <span id="fig-evolution-systems-2"><a href="#fig-evolution-systems-2" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=0pt,xscale=2]</span>
 <span id="fig-evolution-systems-3"><a href="#fig-evolution-systems-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\usetikzlibrary</span>{intersections}</span>
 <span id="fig-evolution-systems-4"><a href="#fig-evolution-systems-4" aria-hidden="true" tabindex="-1"></a><span class="fu">\tikzset</span>{</span>
@@ -1777,10 +1777,10 @@ <h2 data-number="8.5" class="anchored" data-anchor-id="training-pipeline-optimiz
 <section id="prefetching-and-overlapping" class="level3 page-columns page-full" data-number="8.5.1">
 <h3 data-number="8.5.1" class="anchored" data-anchor-id="prefetching-and-overlapping"><span class="header-section-number">8.5.1</span> Prefetching and Overlapping</h3>
 <p>Training machine learning models involves significant data movement between storage, memory, and computational units. The data pipeline consists of sequential transfers: from disk storage to CPU memory, CPU memory to GPU memory, and through the GPU processing units. In standard implementations, each transfer must complete before the next begins, as shown in <a href="#fig-fetching-naive" class="quarto-xref">Figure&nbsp;<span>8.7</span></a>, resulting in computational inefficiencies.</p>
-<div id="fig-fetching-naive" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-fetching-naive" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-fetching-naive-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-fetching-naive" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-fetching-naive-1"><a href="#fig-fetching-naive-1" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=0pt]</span>
+<div class="sourceCode" id="fig-fetching-naive" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-fetching-naive-1"><a href="#fig-fetching-naive-1" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=0pt]</span>
 <span id="fig-fetching-naive-2"><a href="#fig-fetching-naive-2" aria-hidden="true" tabindex="-1"></a><span class="fu">\usetikzlibrary</span>{arrows, arrows.meta, positioning,calc,intersections}</span>
 <span id="fig-fetching-naive-3"><a href="#fig-fetching-naive-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\tikzset</span>{</span>
 <span id="fig-fetching-naive-4"><a href="#fig-fetching-naive-4" aria-hidden="true" tabindex="-1"></a>  Box/.style={inner xsep=2pt,</span>
@@ -1879,10 +1879,10 @@ <h3 data-number="8.5.1" class="anchored" data-anchor-id="prefetching-and-overlap
 <div class="no-row-height column-margin column-container"><div id="ref-tensorflow_data_2015" class="csl-entry" role="listitem">
 Abadi, Martín, Ashish Agarwal, Paul Barham, et al. 2015. <span>“TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.”</span> Google Brain.
 </div></div><p>Overlapping builds upon prefetching by coordinating multiple pipeline stages to execute concurrently. The system processes the current batch while simultaneously preparing future batches through data loading and preprocessing operations. This coordination establishes a continuous data flow through the training pipeline, as illustrated in <a href="#fig-fetching-optimized" class="quarto-xref">Figure&nbsp;<span>8.8</span></a>.</p>
-<div id="fig-fetching-optimized" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-fetching-optimized" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-fetching-optimized-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-fetching-optimized" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-fetching-optimized-1"><a href="#fig-fetching-optimized-1" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=0pt]</span>
+<div class="sourceCode" id="fig-fetching-optimized" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-fetching-optimized-1"><a href="#fig-fetching-optimized-1" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=0pt]</span>
 <span id="fig-fetching-optimized-2"><a href="#fig-fetching-optimized-2" aria-hidden="true" tabindex="-1"></a><span class="fu">\usetikzlibrary</span>{arrows, arrows.meta, positioning,calc,intersections}</span>
 <span id="fig-fetching-optimized-3"><a href="#fig-fetching-optimized-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\tikzset</span>{</span>
 <span id="fig-fetching-optimized-4"><a href="#fig-fetching-optimized-4" aria-hidden="true" tabindex="-1"></a>  Box/.style={inner xsep=0pt,</span>
@@ -2127,10 +2127,10 @@ <h3 data-number="8.5.2" class="anchored" data-anchor-id="mixed-precision-trainin
 </div></div><p>A neural network trained in FP32 requires 4 bytes per parameter, while both FP16 and bfloat16 use 2 bytes. For a model with <span class="math inline">\(10^9\)</span> parameters, this reduction cuts memory usage from 4 GB to 2 GB. This memory reduction enables larger batch sizes and deeper architectures on the same hardware.</p>
 <p>The numerical precision differences between these formats shape their use cases. FP32 represents numbers from approximately <span class="math inline">\(\pm1.18 \times 10^{-38}\)</span> to <span class="math inline">\(\pm3.4 \times 10^{38}\)</span> with 7 decimal digits of precision. FP16 ranges from <span class="math inline">\(\pm6.10 \times 10^{-5}\)</span> to <span class="math inline">\(\pm65,504\)</span> with 3-4 decimal digits of precision. Bfloat16, developed by Google Brain, maintains the same dynamic range as FP32 (<span class="math inline">\(\pm1.18 \times 10^{-38}\)</span> to <span class="math inline">\(\pm3.4 \times 10^{38}\)</span>) but with reduced precision (3-4 decimal digits). This range preservation makes bfloat16 particularly suited for deep learning training, as it handles large and small gradients more effectively than FP16.</p>
 <p>The hybrid approach proceeds in three main phases, as illustrated in <a href="#fig-mixed-precision" class="quarto-xref">Figure&nbsp;<span>8.9</span></a>. During the forward pass, input data converts to reduced precision (FP16 or bfloat16), and matrix multiplications execute in this format, including activation function computations. In the gradient computation phase, the backward pass calculates gradients in reduced precision, but results are stored in FP32 master weights. Finally, during weight updates, the optimizer updates the main weights in FP32, and these updated weights convert back to reduced precision for the next forward pass.</p>
-<div id="fig-mixed-precision" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-mixed-precision" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-mixed-precision-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-mixed-precision" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-mixed-precision-1"><a href="#fig-mixed-precision-1" aria-hidden="true" tabindex="-1"></a><span class="fu">\scalebox</span>{0.8}{</span>
+<div class="sourceCode" id="fig-mixed-precision" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-mixed-precision-1"><a href="#fig-mixed-precision-1" aria-hidden="true" tabindex="-1"></a><span class="fu">\scalebox</span>{0.8}{</span>
 <span id="fig-mixed-precision-2"><a href="#fig-mixed-precision-2" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=7mm]</span>
 <span id="fig-mixed-precision-3"><a href="#fig-mixed-precision-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col1}{RGB}{128, 179, 255}</span>
 <span id="fig-mixed-precision-4"><a href="#fig-mixed-precision-4" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col2}{RGB}{255, 255, 128}</span>
@@ -2564,10 +2564,10 @@ <h4 class="anchored" data-anchor-id="mathematical-foundations-1">Mathematical Fo
 <section id="mechanics-2" class="level4">
 <h4 class="anchored" data-anchor-id="mechanics-2">Mechanics</h4>
 <p>The process of data parallelism can be broken into a series of distinct steps, each with its role in ensuring the system operates efficiently. These steps are illustrated in <a href="#fig-data-parallelism" class="quarto-xref">Figure&nbsp;<span>8.11</span></a>.</p>
-<div id="fig-data-parallelism" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-data-parallelism" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-data-parallelism-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-data-parallelism" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-data-parallelism-1"><a href="#fig-data-parallelism-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
+<div class="sourceCode" id="fig-data-parallelism" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-data-parallelism-1"><a href="#fig-data-parallelism-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
 <span id="fig-data-parallelism-2"><a href="#fig-data-parallelism-2" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=19mm]</span>
 <span id="fig-data-parallelism-3"><a href="#fig-data-parallelism-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\usetikzlibrary</span>{calc}</span>
 <span id="fig-data-parallelism-4"><a href="#fig-data-parallelism-4" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col1}{RGB}{128, 179, 255}</span>
@@ -2710,10 +2710,10 @@ <h3 data-number="8.6.2" class="anchored" data-anchor-id="model-parallelism"><spa
 <section id="mechanics-3" class="level4 page-columns page-full">
 <h4 class="anchored" data-anchor-id="mechanics-3">Mechanics</h4>
 <p>Model parallelism divides neural networks across multiple computing devices, with each device computing a distinct portion of the model’s operations. This division allows training of models whose parameter counts exceed single-device memory capacity. The technique encompasses device coordination, data flow management, and gradient computation across distributed model segments. The mechanics of model parallelism are illustrated in <a href="#fig-model-parallelism" class="quarto-xref">Figure&nbsp;<span>8.12</span></a>. These steps are described next:</p>
-<div id="fig-model-parallelism" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-model-parallelism" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-model-parallelism-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-model-parallelism" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-model-parallelism-1"><a href="#fig-model-parallelism-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
+<div class="sourceCode" id="fig-model-parallelism" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-model-parallelism-1"><a href="#fig-model-parallelism-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
 <span id="fig-model-parallelism-2"><a href="#fig-model-parallelism-2" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=17mm]</span>
 <span id="fig-model-parallelism-3"><a href="#fig-model-parallelism-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col4}{RGB}{240,240,255}</span>
 <span id="fig-model-parallelism-4"><a href="#fig-model-parallelism-4" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col7}{RGB}{158,122,230}</span>
@@ -3102,10 +3102,10 @@ <h3 data-number="8.6.6" class="anchored" data-anchor-id="comparison-1"><span cla
 </figure>
 </div>
 <p><a href="#fig-parallelism-flowchart" class="quarto-xref">Figure&nbsp;<span>8.15</span></a> provides a general guideline for selecting parallelism strategies in distributed training systems. While the chart offers a structured decision-making process based on model size, dataset size, and scaling constraints, it is intentionally simplified. Real-world scenarios often involve additional complexities such as hardware heterogeneity, communication bandwidth, and workload imbalance, which may influence the choice of parallelism techniques. The chart is best viewed as a foundational tool for understanding the trade-offs and decision points in parallelism strategy selection. Practitioners should consider this guideline as a starting point and adapt it to the specific requirements and constraints of their systems to achieve optimal performance.</p>
-<div id="fig-parallelism-flowchart" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-env="figure" data-fig-pos="htb">
+<div id="fig-parallelism-flowchart" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="htb" data-fig-env="figure">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-parallelism-flowchart-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<div class="sourceCode" id="fig-parallelism-flowchart" data-fig-env="figure" data-fig-pos="htb"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-parallelism-flowchart-1"><a href="#fig-parallelism-flowchart-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
+<div class="sourceCode" id="fig-parallelism-flowchart" data-fig-pos="htb" data-fig-env="figure"><pre class="sourceCode tikz code-with-copy"><code class="sourceCode latex"><span id="fig-parallelism-flowchart-1"><a href="#fig-parallelism-flowchart-1" aria-hidden="true" tabindex="-1"></a><span class="co">%%| fig-width: "100%"</span></span>
 <span id="fig-parallelism-flowchart-2"><a href="#fig-parallelism-flowchart-2" aria-hidden="true" tabindex="-1"></a><span class="kw">\begin</span>{<span class="ex">tikzpicture</span>}[font=<span class="fu">\small\sf</span>,node distance=11mm]</span>
 <span id="fig-parallelism-flowchart-3"><a href="#fig-parallelism-flowchart-3" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col2}{RGB}{255, 255, 128}</span>
 <span id="fig-parallelism-flowchart-4"><a href="#fig-parallelism-flowchart-4" aria-hidden="true" tabindex="-1"></a><span class="fu">\definecolor</span>{col5}{RGB}{170,170,51}</span>