From 39761d7afdff340361c3fe6ffcd1e63d1441aaab Mon Sep 17 00:00:00 2001 From: harvard-edge Date: Sun, 2 Feb 2025 14:38:17 +0000 Subject: [PATCH] Push dev branch build --- docs/contents/core/training/training.html | 28 +++++++++++------------ 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/contents/core/training/training.html b/docs/contents/core/training/training.html index 31e9d0f1..57d9b62b 100644 --- a/docs/contents/core/training/training.html +++ b/docs/contents/core/training/training.html @@ -822,10 +822,10 @@

8.2.1 Evolution of Systems

Computing system architectures have evolved through distinct generations, with each new era building upon previous advances while introducing specialized optimizations for emerging application requirements (Figure 8.1). This progression demonstrates how hardware adaptation to application needs shapes modern machine learning systems.

-
+
-
%%| fig-width: "100%"
+
%%| fig-width: "100%"
 \begin{tikzpicture}[font=\small\sf,node distance=0pt,xscale=2]
 \usetikzlibrary{intersections}
 \tikzset{
@@ -1777,10 +1777,10 @@ 

8.5.1 Prefetching and Overlapping

Training machine learning models involves significant data movement between storage, memory, and computational units. The data pipeline consists of sequential transfers: from disk storage to CPU memory, CPU memory to GPU memory, and through the GPU processing units. In standard implementations, each transfer must complete before the next begins, as shown in Figure 8.7, resulting in computational inefficiencies.

-
+
-
\begin{tikzpicture}[font=\small\sf,node distance=0pt]
+
\begin{tikzpicture}[font=\small\sf,node distance=0pt]
 \usetikzlibrary{arrows, arrows.meta, positioning,calc,intersections}
 \tikzset{
   Box/.style={inner xsep=2pt,
@@ -1879,10 +1879,10 @@ 

Abadi, Martín, Ashish Agarwal, Paul Barham, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” Google Brain.

Overlapping builds upon prefetching by coordinating multiple pipeline stages to execute concurrently. The system processes the current batch while simultaneously preparing future batches through data loading and preprocessing operations. This coordination establishes a continuous data flow through the training pipeline, as illustrated in Figure 8.8.

-
+
-
\begin{tikzpicture}[font=\small\sf,node distance=0pt]
+
\begin{tikzpicture}[font=\small\sf,node distance=0pt]
 \usetikzlibrary{arrows, arrows.meta, positioning,calc,intersections}
 \tikzset{
   Box/.style={inner xsep=0pt,
@@ -2127,10 +2127,10 @@ 

\(10^9\) parameters, this reduction cuts memory usage from 4 GB to 2 GB. This memory reduction enables larger batch sizes and deeper architectures on the same hardware.

The numerical precision differences between these formats shape their use cases. FP32 represents numbers from approximately \(\pm1.18 \times 10^{-38}\) to \(\pm3.4 \times 10^{38}\) with 7 decimal digits of precision. FP16 ranges from \(\pm6.10 \times 10^{-5}\) to \(\pm65,504\) with 3-4 decimal digits of precision. Bfloat16, developed by Google Brain, maintains the same dynamic range as FP32 (\(\pm1.18 \times 10^{-38}\) to \(\pm3.4 \times 10^{38}\)) but with reduced precision (3-4 decimal digits). This range preservation makes bfloat16 particularly suited for deep learning training, as it handles large and small gradients more effectively than FP16.

The hybrid approach proceeds in three main phases, as illustrated in Figure 8.9. During the forward pass, input data converts to reduced precision (FP16 or bfloat16), and matrix multiplications execute in this format, including activation function computations. In the gradient computation phase, the backward pass calculates gradients in reduced precision, but results are stored in FP32 master weights. Finally, during weight updates, the optimizer updates the main weights in FP32, and these updated weights convert back to reduced precision for the next forward pass.

-
+
-
\scalebox{0.8}{
+
\scalebox{0.8}{
 \begin{tikzpicture}[font=\small\sf,node distance=7mm]
 \definecolor{col1}{RGB}{128, 179, 255}
 \definecolor{col2}{RGB}{255, 255, 128}
@@ -2564,10 +2564,10 @@ 

Mathematical Fo

Mechanics

The process of data parallelism can be broken into a series of distinct steps, each with its role in ensuring the system operates efficiently. These steps are illustrated in Figure 8.11.

-
+
-
%%| fig-width: "100%"
+
%%| fig-width: "100%"
 \begin{tikzpicture}[font=\small\sf,node distance=19mm]
 \usetikzlibrary{calc}
 \definecolor{col1}{RGB}{128, 179, 255}
@@ -2710,10 +2710,10 @@ 

Mechanics

Model parallelism divides neural networks across multiple computing devices, with each device computing a distinct portion of the model’s operations. This division allows training of models whose parameter counts exceed single-device memory capacity. The technique encompasses device coordination, data flow management, and gradient computation across distributed model segments. The mechanics of model parallelism are illustrated in Figure 8.12. These steps are described next:

-
+
-
%%| fig-width: "100%"
+
%%| fig-width: "100%"
 \begin{tikzpicture}[font=\small\sf,node distance=17mm]
 \definecolor{col4}{RGB}{240,240,255}
 \definecolor{col7}{RGB}{158,122,230}
@@ -3102,10 +3102,10 @@ 

Figure 8.15 provides a general guideline for selecting parallelism strategies in distributed training systems. While the chart offers a structured decision-making process based on model size, dataset size, and scaling constraints, it is intentionally simplified. Real-world scenarios often involve additional complexities such as hardware heterogeneity, communication bandwidth, and workload imbalance, which may influence the choice of parallelism techniques. The chart is best viewed as a foundational tool for understanding the trade-offs and decision points in parallelism strategy selection. Practitioners should consider this guideline as a starting point and adapt it to the specific requirements and constraints of their systems to achieve optimal performance.

-
+
-
%%| fig-width: "100%"
+
%%| fig-width: "100%"
 \begin{tikzpicture}[font=\small\sf,node distance=11mm]
 \definecolor{col2}{RGB}{255, 255, 128}
 \definecolor{col5}{RGB}{170,170,51}