From f797af85c758a1dca64db8f75a32db037be33097 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Mon, 12 Feb 2024 22:41:51 +0100
Subject: [PATCH 01/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../docs/how_the_optimizing_compiler_works.md | 217 ++++++++++++++++++
 1 file changed, 217 insertions(+)
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works.md

diff --git a/site/content/docs/how_the_optimizing_compiler_works.md b/site/content/docs/how_the_optimizing_compiler_works.md
new file mode 100644
index 0000000000..f7ad669a87
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works.md
@@ -0,0 +1,217 @@
+What is a JIT compiler?
+=======================
+
+In general, when we talk about a Just-In-Time (JIT) compiler, we mean a compilation technique that spares cycles at build-time, trading it for run-time. In other words, when a language is JIT-compiled, we usually mean that compilation will happen during run-time. Furthermore, when we use the term JIT-compilation, we also often mean is that, because compilation happens _during run-time_, we can use information that we have collected during execution to direct the compilation process: these types of JIT-compilers are often referred to as **tracing-JITs**.
+
+Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**, **load-time** compiler. That is, a compiler that, indeed, performs compilation at run-time, but only when a WebAssembly module is loaded; it currently does not collect or leverage any information during the execution of the Wasm binary itself.
+
+It is important to make such a distinction, because a Just-In-Time compiler may not be an optimizing compiler, and an optimizing compiler may not be a tracing JIT. In fact, the compiler that wazero shipped before the introduction of the new compiler architecture performed code generation at load-time, but did not perform any optimization.
+
+# What is an Optimizing Compiler?
+
+Wazero supports an _optimizing_ compiler in the style of other optimizing compilers out there, such as LLVM's or V8's. Traditionally an optimizing compiler performs compilation in a number of steps.
+
+Compare this to the **old compiler**, where compilation happens in one step or two, depending on how you count:
+
+
+```goat
+            Input         +---------------+     +---------------+
+         Wasm Binary ---->| DecodeModule  |---->| CompileModule |----> wazero IR
+                          +---------------+     +---------------+
+```
+
+That is, the module is (1) validated then (2) translated to an Intermediate Representation (IR).
+The wazero IR can then be executed directly (in the case of the interpreter) or it can be further processed and translated into native code by the compiler. This compiler performs a straightforward translation from the IR to native code, without any further passes. The wazero IR is not intended for further processing beyond immediate execution or straightforward translation.
+
+```goat
+                        +----   wazero IR    ----+
+                        |                        |
+                        v                        v
+                +--------------+         +--------------+
+                |   Compiler   |         | Interpreter  |- - -  executable
+                +--------------+         +--------------+
+                        |
+             +----------+---------+
+             |                    |
+             v                    v
+        +---------+          +---------+
+        |  ARM64  |          |  AMD64  |
+        | Backend |          | Backend |    - - - - - - - - -   executable
+        +---------+          +---------+
+```
+
+
+Validation and translation to an IR in a compiler are usually called the **front-end** part of a compiler, while code-generation occurs in what we call the **back-end** of a compiler. The front-end is the part of a compiler that is closer to the input, and it generally indicates machine-independent processing, such as parsing and static validation. The back-end is the part of a compiler that is closer to the output, and it generally includes machine-specific procedures, such as code-generation.
+
+In the **optimizing** compiler, we still decode and translate Wasm binaries to an intermediate representation in the front-end, but we use a textbook representation called an **SSA** or "Static Single-Assignment Form", that is intended for further transformation.
+
+The benefit of choosing an IR that is meant for transformation is that a lot of optimization passes can apply directly to the IR, and thus be machine-independent. Then the back-end can be relatively simpler, in that it will only have to deal with machine-specific concerns.
+
+The wazero optimizing compiler implements the following compilation passes:
+
+* Front-End:
+  - Translation to SSA
+  - Optimization
+
+* Back-End:
+  - Instruction Selection
+  - Registry Allocation
+  - Finalization and Encoding
+
+```goat
+              Input          +-------------------+      +-------------------+
+           Wasm Binary   --->|   DecodeModule    |----->|   CompileModule   |--+
+                             +-------------------+      +-------------------+  |
+                    +----------------------------------------------------------+
+                    |
+                    |  +---------------+            +---------------+
+                    +->|   Front-End   |----------->|   Back-End    |
+                       +---------------+            +---------------+
+                               |                            |
+                               v                            v
+                              SSA                 Instruction Selection
+                               |                            |
+                               v                            v
+                         Optimization              Registry Allocation
+                                                            |
+                                                            v
+                                                  Finalization/Encoding
+```
+
+## Front-End: Translation to SSA
+
+We mentioned earlier that wazero uses an internal representation called an "SSA" form or "Static Single-Assignment" form,
+but we never explained what that is.
+
+In short terms, every program, or, in our case, every Wasm function, can be translated in a control-flow graph.
+The control-flow graph is a directed graph where each node is a sequence of statements that do not contain a control flow instruction,
+called a **basic block**. Instead, control-flow instructions are translated into edges.
+
+For instance, take the following implementation of the `abs` function:
+
+```wasm
+(module
+  (func (;0;) (param i32) (result i32)
+     (if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
+        (then
+            (i32.sub (i32.const 0) (local.get 0)))
+        (else
+            (local.get 0))
+     )
+  )
+  (export "f" (func 0))
+)
+```
+
+This is translated to the following block diagram:
+
+```goat
+               +---------------------------------------------+
+               |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
+               |    v3:i32 = Iconst_32 0x0                   |
+               |    v4:i32 = Icmp lt_s, v2, v3               |
+               |    Brz v4, blk2                             |
+               |    Jump blk1                                |
+               +---------------------------------------------+
+                                      |
+                                      |
+                      +---(v4 != 0)---+--(v4 == 0)----+
+                      |                               |
+                      v                               v
+        +---------------------------+   +---------------------------+
+        |blk1: () <-- (blk0)        |   |blk2: () <-- (blk0)        |
+        |    v6:i32 = Iconst_32 0x0 |   |    Jump blk3, v2          |
+        |    v7:i32 = Isub v6, v2   |   |                           |
+        |    Jump blk3, v7          |   |                           |
+        +---------------------------+   +---------------------------+
+                      |                               |
+                      |                               |
+                      +-{v5 := v7}----+---{v5 := v2}--+
+                                      |
+                                      v
+                      +------------------------------+
+                      |blk3: (v5:i32) <-- (blk1,blk2)|
+                      |    Jump blk_ret, v5          |
+                      +------------------------------+
+                                      |
+                                 {return v5}
+                                      |
+                                      v
+```
+
+We use the ["block argument" variant of SSA][ssa-blocks], which is also the same representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block takes a list of arguments. Each block ends with a jump instruction with an optional list of arguments; these arguments, are assigned to the target block's arguments like a function.
+
+Consider the first block `blk0`.
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+    v3:i32 = Iconst_32 0x0
+    v4:i32 = Icmp lt_s, v2, v3
+    Brz v4, blk2
+    Jump blk1
+```
+
+You will notice that, compared to the original function, it takes two extra parameters (`exec_ctx` and `module_ctx`). It then takes one parameter `v2`, corresponding to the function parameter, and it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to `blk2` if `v4` is zero, otherwise it jumps to `blk1`.
+
+You might also have noticed that the instructions do not correspond strictly to  the original Wasm opcodes. This is because, similarly to the wazero IR used by the old compiler, this is a custom IR. You will also notice that, _on the right-hand side of the assignments_ of any statement, no name occurs _twice_: this is why this form is called **single-assignment**.
+
+Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
+
+```
+blk1: ()
+    ...
+	Jump blk3, v7
+
+blk2: ()
+	Jump blk3, v2
+
+blk3: (v5:i32)
+    ...
+```
+
+`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`, depending on the originating block. If you are familiar with the traditional representation of an SSA form, you will recognize that the role of block arguments is equivalent to the role of the *Phi (Φ) function*, a special function that returns a different value depending on the incoming edge; e.g., in this case: `v5 := Φ(v7, v2)`.
+
+
+## Front-End: Optimization
+
+The SSA form makes it easier to perform a number of optimizations. For instance, we can perform constant propagation, dead code elimination, and common subexpression elimination. These optimizations either act upon the instructions within a basic block, or they act upon the control-flow graph as a whole.
+
+On a high, level, consider the following basic block, derived from the previous example:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    v2:i32 = Iconst_32 -5
+    v3:i32 = Iconst_32  0
+    v4:i32 = Icmp lt_s, v2, v3
+    Brz v4, blk2
+    Jump blk1
+```
+
+It is pretty easy to see that the comparison in `v4` can be replaced by a constant `1`, because the comparison is between two constant values (-5, 0). Therefore, the block can be rewritten as such:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    v4:i32 = Iconst_32 1
+    Brz v4, blk2
+    Jump blk1
+```
+
+However, we can now also see that the branch is always taken, and that the block `blk2` is never executed, so even the branch instruction and the constant definition `v4` can be removed:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    Jump blk1
+```
+
+This is a simple example of constant propagation and dead code elimination occurring within a basic block. However, now  `blk2` is unreachable, because there is no other edge in the edge that points to it; thus it can be removed from the control-flow graph. This is an example of dead-code elimination that occurs at the control-flow graph level.
+
+In practice, because WebAssembly is a compilation target, these simple optimizations are often unnecessary. The optimization passes implemented in wazero are also work-in-progress and, at the time of writing, further work is expected to implement more advanced optimizations.
+
+<!-- say more about block layout etc... -->
+
+## Back-End
+
+...
+
+[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
+[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes

From eb16e379dc3c0989d29aac971c354ca68b4e1130 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 09:43:42 +0100
Subject: [PATCH 02/24] reformat

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../docs/how_the_optimizing_compiler_works.md | 128 ++++++++++++++----
 1 file changed, 100 insertions(+), 28 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works.md b/site/content/docs/how_the_optimizing_compiler_works.md
index f7ad669a87..95a4e48afd 100644
--- a/site/content/docs/how_the_optimizing_compiler_works.md
+++ b/site/content/docs/how_the_optimizing_compiler_works.md
@@ -1,17 +1,35 @@
 What is a JIT compiler?
 =======================
 
-In general, when we talk about a Just-In-Time (JIT) compiler, we mean a compilation technique that spares cycles at build-time, trading it for run-time. In other words, when a language is JIT-compiled, we usually mean that compilation will happen during run-time. Furthermore, when we use the term JIT-compilation, we also often mean is that, because compilation happens _during run-time_, we can use information that we have collected during execution to direct the compilation process: these types of JIT-compilers are often referred to as **tracing-JITs**.
-
-Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**, **load-time** compiler. That is, a compiler that, indeed, performs compilation at run-time, but only when a WebAssembly module is loaded; it currently does not collect or leverage any information during the execution of the Wasm binary itself.
-
-It is important to make such a distinction, because a Just-In-Time compiler may not be an optimizing compiler, and an optimizing compiler may not be a tracing JIT. In fact, the compiler that wazero shipped before the introduction of the new compiler architecture performed code generation at load-time, but did not perform any optimization.
+In general, when we talk about a Just-In-Time (JIT) compiler, we mean a
+compilation technique that spares cycles at build-time, trading it for run-time.
+In other words, when a language is JIT-compiled, we usually mean that
+compilation will happen during run-time. Furthermore, when we use the term
+JIT-compilation, we also often mean is that, because compilation happens _during
+run-time_, we can use information that we have collected during execution to
+direct the compilation process: these types of JIT-compilers are often referred
+to as **tracing-JITs**.
+
+Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**,
+**load-time** compiler. That is, a compiler that, indeed, performs compilation
+at run-time, but only when a WebAssembly module is loaded; it currently does not
+collect or leverage any information during the execution of the Wasm binary
+itself.
+
+It is important to make such a distinction, because a Just-In-Time compiler may
+not be an optimizing compiler, and an optimizing compiler may not be a tracing
+JIT. In fact, the compiler that wazero shipped before the introduction of the
+new compiler architecture performed code generation at load-time, but did not
+perform any optimization.
 
 # What is an Optimizing Compiler?
 
-Wazero supports an _optimizing_ compiler in the style of other optimizing compilers out there, such as LLVM's or V8's. Traditionally an optimizing compiler performs compilation in a number of steps.
+Wazero supports an _optimizing_ compiler in the style of other optimizing
+compilers out there, such as LLVM's or V8's. Traditionally an optimizing
+compiler performs compilation in a number of steps.
 
-Compare this to the **old compiler**, where compilation happens in one step or two, depending on how you count:
+Compare this to the **old compiler**, where compilation happens in one step or
+two, depending on how you count:
 
 
 ```goat
@@ -20,8 +38,13 @@ Compare this to the **old compiler**, where compilation happens in one step or t
                           +---------------+     +---------------+
 ```
 
-That is, the module is (1) validated then (2) translated to an Intermediate Representation (IR).
-The wazero IR can then be executed directly (in the case of the interpreter) or it can be further processed and translated into native code by the compiler. This compiler performs a straightforward translation from the IR to native code, without any further passes. The wazero IR is not intended for further processing beyond immediate execution or straightforward translation.
+That is, the module is (1) validated then (2) translated to an Intermediate
+Representation (IR).  The wazero IR can then be executed directly (in the case
+of the interpreter) or it can be further processed and translated into native
+code by the compiler. This compiler performs a straightforward translation from
+the IR to native code, without any further passes. The wazero IR is not intended
+for further processing beyond immediate execution or straightforward
+translation.
 
 ```goat
                         +----   wazero IR    ----+
@@ -29,7 +52,7 @@ The wazero IR can then be executed directly (in the case of the interpreter) or
                         v                        v
                 +--------------+         +--------------+
                 |   Compiler   |         | Interpreter  |- - -  executable
-                +--------------+         +--------------+
+               +--------------+         +--------------+
                         |
              +----------+---------+
              |                    |
@@ -41,11 +64,23 @@ The wazero IR can then be executed directly (in the case of the interpreter) or
 ```
 
 
-Validation and translation to an IR in a compiler are usually called the **front-end** part of a compiler, while code-generation occurs in what we call the **back-end** of a compiler. The front-end is the part of a compiler that is closer to the input, and it generally indicates machine-independent processing, such as parsing and static validation. The back-end is the part of a compiler that is closer to the output, and it generally includes machine-specific procedures, such as code-generation.
+Validation and translation to an IR in a compiler are usually called the
+**front-end** part of a compiler, while code-generation occurs in what we call
+the **back-end** of a compiler. The front-end is the part of a compiler that is
+closer to the input, and it generally indicates machine-independent processing,
+such as parsing and static validation. The back-end is the part of a compiler
+that is closer to the output, and it generally includes machine-specific
+procedures, such as code-generation.
 
-In the **optimizing** compiler, we still decode and translate Wasm binaries to an intermediate representation in the front-end, but we use a textbook representation called an **SSA** or "Static Single-Assignment Form", that is intended for further transformation.
+In the **optimizing** compiler, we still decode and translate Wasm binaries to
+an intermediate representation in the front-end, but we use a textbook
+representation called an **SSA** or "Static Single-Assignment Form", that is
+intended for further transformation.
 
-The benefit of choosing an IR that is meant for transformation is that a lot of optimization passes can apply directly to the IR, and thus be machine-independent. Then the back-end can be relatively simpler, in that it will only have to deal with machine-specific concerns.
+The benefit of choosing an IR that is meant for transformation is that a lot of
+optimization passes can apply directly to the IR, and thus be
+machine-independent. Then the back-end can be relatively simpler, in that it
+will only have to deal with machine-specific concerns.
 
 The wazero optimizing compiler implements the following compilation passes:
 
@@ -80,12 +115,15 @@ The wazero optimizing compiler implements the following compilation passes:
 
 ## Front-End: Translation to SSA
 
-We mentioned earlier that wazero uses an internal representation called an "SSA" form or "Static Single-Assignment" form,
-but we never explained what that is.
+We mentioned earlier that wazero uses an internal representation called an "SSA"
+form or "Static Single-Assignment" form, but we never explained what that is.
 
-In short terms, every program, or, in our case, every Wasm function, can be translated in a control-flow graph.
-The control-flow graph is a directed graph where each node is a sequence of statements that do not contain a control flow instruction,
-called a **basic block**. Instead, control-flow instructions are translated into edges.
+In short terms, every program, or, in our case, every Wasm function, can be
+translated in a control-flow graph.
+The control-flow graph is a directed graph where each node is a sequence of
+statements that do not contain a control flow instruction,
+called a **basic block**. Instead, control-flow instructions are translated into
+edges.
 
 For instance, take the following implementation of the `abs` function:
 
@@ -139,7 +177,11 @@ This is translated to the following block diagram:
                                       v
 ```
 
-We use the ["block argument" variant of SSA][ssa-blocks], which is also the same representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block takes a list of arguments. Each block ends with a jump instruction with an optional list of arguments; these arguments, are assigned to the target block's arguments like a function.
+We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
+representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
+takes a list of arguments. Each block ends with a jump instruction with an
+optional list of arguments; these arguments, are assigned to the target block's
+arguments like a function.
 
 Consider the first block `blk0`.
 
@@ -151,9 +193,18 @@ blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
     Jump blk1
 ```
 
-You will notice that, compared to the original function, it takes two extra parameters (`exec_ctx` and `module_ctx`). It then takes one parameter `v2`, corresponding to the function parameter, and it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to `blk2` if `v4` is zero, otherwise it jumps to `blk1`.
+You will notice that, compared to the original function, it takes two extra
+parameters (`exec_ctx` and `module_ctx`). It then takes one parameter `v2`,
+corresponding to the function parameter, and it defines two variables `v3`,
+`v4`. `v3` is the constant 0, `v4` is the result of comparing `v2` to `v3` using
+the `i32.lt_s` instruction. Then, it branches to `blk2` if `v4` is zero,
+otherwise it jumps to `blk1`.
 
-You might also have noticed that the instructions do not correspond strictly to  the original Wasm opcodes. This is because, similarly to the wazero IR used by the old compiler, this is a custom IR. You will also notice that, _on the right-hand side of the assignments_ of any statement, no name occurs _twice_: this is why this form is called **single-assignment**.
+You might also have noticed that the instructions do not correspond strictly to
+the original Wasm opcodes. This is because, similarly to the wazero IR used by
+the old compiler, this is a custom IR. You will also notice that, _on the
+right-hand side of the assignments_ of any statement, no name occurs _twice_:
+this is why this form is called **single-assignment**.
 
 Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
 
@@ -169,14 +220,24 @@ blk3: (v5:i32)
     ...
 ```
 
-`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`, depending on the originating block. If you are familiar with the traditional representation of an SSA form, you will recognize that the role of block arguments is equivalent to the role of the *Phi (Φ) function*, a special function that returns a different value depending on the incoming edge; e.g., in this case: `v5 := Φ(v7, v2)`.
+`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
+to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
+depending on the originating block. If you are familiar with the traditional
+representation of an SSA form, you will recognize that the role of block
+arguments is equivalent to the role of the *Phi (Φ) function*, a special
+function that returns a different value depending on the incoming edge; e.g., in
+this case: `v5 := Φ(v7, v2)`.
 
 
 ## Front-End: Optimization
 
-The SSA form makes it easier to perform a number of optimizations. For instance, we can perform constant propagation, dead code elimination, and common subexpression elimination. These optimizations either act upon the instructions within a basic block, or they act upon the control-flow graph as a whole.
+The SSA form makes it easier to perform a number of optimizations. For instance,
+we can perform constant propagation, dead code elimination, and common
+subexpression elimination. These optimizations either act upon the instructions
+within a basic block, or they act upon the control-flow graph as a whole.
 
-On a high, level, consider the following basic block, derived from the previous example:
+On a high, level, consider the following basic block, derived from the previous
+example:
 
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
@@ -187,7 +248,9 @@ blk0: (exec_ctx:i64, module_ctx:i64)
     Jump blk1
 ```
 
-It is pretty easy to see that the comparison in `v4` can be replaced by a constant `1`, because the comparison is between two constant values (-5, 0). Therefore, the block can be rewritten as such:
+It is pretty easy to see that the comparison in `v4` can be replaced by a
+constant `1`, because the comparison is between two constant values (-5, 0).
+Therefore, the block can be rewritten as such:
 
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
@@ -196,16 +259,25 @@ blk0: (exec_ctx:i64, module_ctx:i64)
     Jump blk1
 ```
 
-However, we can now also see that the branch is always taken, and that the block `blk2` is never executed, so even the branch instruction and the constant definition `v4` can be removed:
+However, we can now also see that the branch is always taken, and that the block
+`blk2` is never executed, so even the branch instruction and the constant
+definition `v4` can be removed:
 
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
     Jump blk1
 ```
 
-This is a simple example of constant propagation and dead code elimination occurring within a basic block. However, now  `blk2` is unreachable, because there is no other edge in the edge that points to it; thus it can be removed from the control-flow graph. This is an example of dead-code elimination that occurs at the control-flow graph level.
+This is a simple example of constant propagation and dead code elimination
+occurring within a basic block. However, now  `blk2` is unreachable, because
+there is no other edge in the edge that points to it; thus it can be removed
+from the control-flow graph. This is an example of dead-code elimination that
+occurs at the control-flow graph level.
 
-In practice, because WebAssembly is a compilation target, these simple optimizations are often unnecessary. The optimization passes implemented in wazero are also work-in-progress and, at the time of writing, further work is expected to implement more advanced optimizations.
+In practice, because WebAssembly is a compilation target, these simple
+optimizations are often unnecessary. The optimization passes implemented in
+wazero are also work-in-progress and, at the time of writing, further work is
+expected to implement more advanced optimizations.
 
 <!-- say more about block layout etc... -->
 

From 2dd2b3be49bf105f4fa9866c9b513cb537c7fc0a Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 10:23:18 +0100
Subject: [PATCH 03/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../docs/how_the_optimizing_compiler_works.md | 73 +++++++++++++++----
 1 file changed, 57 insertions(+), 16 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works.md b/site/content/docs/how_the_optimizing_compiler_works.md
index 95a4e48afd..36ede72f36 100644
--- a/site/content/docs/how_the_optimizing_compiler_works.md
+++ b/site/content/docs/how_the_optimizing_compiler_works.md
@@ -113,17 +113,26 @@ The wazero optimizing compiler implements the following compilation passes:
                                                   Finalization/Encoding
 ```
 
+Like the other engines, the implementation can be found under `engine`, specifically
+in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
+where the implementation of the interface `wasm.Engine` is found.
+
+All the passes can be dumped to the console for debugging, by enabling, the build-time
+flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
+by default and should only be enabled during debugging. These may also change in the future.
+
+In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
+so we will omit the prefix.
+
 ## Front-End: Translation to SSA
 
 We mentioned earlier that wazero uses an internal representation called an "SSA"
 form or "Static Single-Assignment" form, but we never explained what that is.
 
 In short terms, every program, or, in our case, every Wasm function, can be
-translated in a control-flow graph.
-The control-flow graph is a directed graph where each node is a sequence of
-statements that do not contain a control flow instruction,
-called a **basic block**. Instead, control-flow instructions are translated into
-edges.
+translated in a control-flow graph. The control-flow graph is a directed graph where
+each node is a sequence of statements that do not contain a control flow instruction,
+called a **basic block**. Instead, control-flow instructions are translated into edges.
 
 For instance, take the following implementation of the `abs` function:
 
@@ -179,9 +188,9 @@ This is translated to the following block diagram:
 
 We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
 representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
-takes a list of arguments. Each block ends with a jump instruction with an
-optional list of arguments; these arguments, are assigned to the target block's
-arguments like a function.
+takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
+Jump, etc...) with an optional list of arguments; these arguments are assigned
+to the target block's arguments like a function.
 
 Consider the first block `blk0`.
 
@@ -194,17 +203,24 @@ blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
 ```
 
 You will notice that, compared to the original function, it takes two extra
-parameters (`exec_ctx` and `module_ctx`). It then takes one parameter `v2`,
-corresponding to the function parameter, and it defines two variables `v3`,
-`v4`. `v3` is the constant 0, `v4` is the result of comparing `v2` to `v3` using
-the `i32.lt_s` instruction. Then, it branches to `blk2` if `v4` is zero,
-otherwise it jumps to `blk1`.
+parameters (`exec_ctx` and `module_ctx`):
+
+1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
+   in the face of traps or for host function calls.
+2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
+   to access memory. It is also used during host function calls.
+
+It then takes one parameter `v2`, corresponding to the function parameter, and
+it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
+comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
+`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
 
 You might also have noticed that the instructions do not correspond strictly to
 the original Wasm opcodes. This is because, similarly to the wazero IR used by
-the old compiler, this is a custom IR. You will also notice that, _on the
-right-hand side of the assignments_ of any statement, no name occurs _twice_:
-this is why this form is called **single-assignment**.
+the old compiler, this is a custom IR.
+
+You will also notice that, _on the right-hand side of the assignments_ of any statement,
+no name occurs _twice_: this is why this form is called **single-assignment**.
 
 Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
 
@@ -228,6 +244,20 @@ arguments is equivalent to the role of the *Phi (Φ) function*, a special
 function that returns a different value depending on the incoming edge; e.g., in
 this case: `v5 := Φ(v7, v2)`.
 
+### Code
+
+The relevant APIs can be found under sub-package `ssa` and `frontend`.
+
+- Basic Blocks are represented by the type `ssa.Block` (`ssa/basic_block.go`).
+- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
+  in the context of `wasm.Engine.CompileModule()`, more specifically in the method
+  `frontend.Compiler.LowerToSSA()`.
+- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
+  more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
+
+### Debug Flags
+
+To dump the SSA form to the console, you can set the flag `wazevoapi.PrintSSA`.
 
 ## Front-End: Optimization
 
@@ -281,6 +311,17 @@ expected to implement more advanced optimizations.
 
 <!-- say more about block layout etc... -->
 
+
+### Code
+
+	ssaBuilder.RunPasses()
+
+### Debug Flags
+
+    wazevoapi.PrintOptimizedSSA
+
+
+
 ## Back-End
 
 ...

From 7abb04367a76ea2ca71db974cd3bace54565675d Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 10:33:06 +0100
Subject: [PATCH 04/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../docs/how_the_optimizing_compiler_works.md | 20 ++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works.md b/site/content/docs/how_the_optimizing_compiler_works.md
index 36ede72f36..e2a02707ee 100644
--- a/site/content/docs/how_the_optimizing_compiler_works.md
+++ b/site/content/docs/how_the_optimizing_compiler_works.md
@@ -257,7 +257,7 @@ The relevant APIs can be found under sub-package `ssa` and `frontend`.
 
 ### Debug Flags
 
-To dump the SSA form to the console, you can set the flag `wazevoapi.PrintSSA`.
+`wazevoapi.PrintSSA` dumps the SSA form to the console.
 
 ## Front-End: Optimization
 
@@ -314,11 +314,25 @@ expected to implement more advanced optimizations.
 
 ### Code
 
-	ssaBuilder.RunPasses()
+Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
+pass is just a function that takes a ssa builder as a parameter.
+
+Passes iterate over the basic blocks, and, for each basic block, they iterate
+over the instructions. Each pass may mutate the basic block by modifying the instructions
+it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
+blocks).
+
+Currently, there are dead-code elimination passes:
+
+- `passDeadCodeEliminationOpt` acting at instruction-level.
+- `passDeadBlockEliminationOpt` acting at the block-level.
+
+There are also simple constant folding passes such as `passNopInstElimination`, which
+folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
 
 ### Debug Flags
 
-    wazevoapi.PrintOptimizedSSA
+`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
 
 
 

From e0711af4dc6276e9e15d749e1a06225aaeb16591 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 11:30:16 +0100
Subject: [PATCH 05/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../_index.md                                 | 135 ++++++++++++
 .../backend.md                                |   9 +
 .../frontend.md}                              | 207 +++++++-----------
 3 files changed, 224 insertions(+), 127 deletions(-)
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/_index.md
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/backend.md
 rename site/content/docs/{how_the_optimizing_compiler_works.md => how_the_optimizing_compiler_works/frontend.md} (58%)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/_index.md b/site/content/docs/how_the_optimizing_compiler_works/_index.md
new file mode 100644
index 0000000000..fdacf8150c
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/_index.md
@@ -0,0 +1,135 @@
++++
+title = "How the Optimizing Compiler Works"
+layout = "single"
++++
+
+What is a JIT compiler?
+-----------------------
+
+In general, when we talk about a Just-In-Time (JIT) compiler, we mean a
+compilation technique that spares cycles at build-time, trading it for run-time.
+In other words, when a language is JIT-compiled, we usually mean that
+compilation will happen during run-time. Furthermore, when we use the term
+JIT-compilation, we also often mean is that, because compilation happens _during
+run-time_, we can use information that we have collected during execution to
+direct the compilation process: these types of JIT-compilers are often referred
+to as **tracing-JITs**.
+
+Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**,
+**load-time** compiler. That is, a compiler that, indeed, performs compilation
+at run-time, but only when a WebAssembly module is loaded; it currently does not
+collect or leverage any information during the execution of the Wasm binary
+itself.
+
+It is important to make such a distinction, because a Just-In-Time compiler may
+not be an optimizing compiler, and an optimizing compiler may not be a tracing
+JIT. In fact, the compiler that wazero shipped before the introduction of the
+new compiler architecture performed code generation at load-time, but did not
+perform any optimization.
+
+What is an Optimizing Compiler?
+-------------------------------
+
+Wazero supports an _optimizing_ compiler in the style of other optimizing
+compilers out there, such as LLVM's or V8's. Traditionally an optimizing
+compiler performs compilation in a number of steps.
+
+Compare this to the **old compiler**, where compilation happens in one step or
+two, depending on how you count:
+
+
+```goat
+    Input         +---------------+     +---------------+
+ Wasm Binary ---->| DecodeModule  |---->| CompileModule |----> wazero IR
+                  +---------------+     +---------------+
+```
+
+That is, the module is (1) validated then (2) translated to an Intermediate
+Representation (IR).  The wazero IR can then be executed directly (in the case
+of the interpreter) or it can be further processed and translated into native
+code by the compiler. This compiler performs a straightforward translation from
+the IR to native code, without any further passes. The wazero IR is not intended
+for further processing beyond immediate execution or straightforward
+translation.
+
+```goat
+                +----   wazero IR    ----+
+                |                        |
+                v                        v
+        +--------------+         +--------------+
+        |   Compiler   |         | Interpreter  |- - -  executable
+        +--------------+         +--------------+
+                |
+     +----------+---------+
+     |                    |
+     v                    v
++---------+          +---------+
+|  ARM64  |          |  AMD64  |
+| Backend |          | Backend |    - - - - - - - - -   executable
++---------+          +---------+
+```
+
+
+Validation and translation to an IR in a compiler are usually called the
+**front-end** part of a compiler, while code-generation occurs in what we call
+the **back-end** of a compiler. The front-end is the part of a compiler that is
+closer to the input, and it generally indicates machine-independent processing,
+such as parsing and static validation. The back-end is the part of a compiler
+that is closer to the output, and it generally includes machine-specific
+procedures, such as code-generation.
+
+In the **optimizing** compiler, we still decode and translate Wasm binaries to
+an intermediate representation in the front-end, but we use a textbook
+representation called an **SSA** or "Static Single-Assignment Form", that is
+intended for further transformation.
+
+The benefit of choosing an IR that is meant for transformation is that a lot of
+optimization passes can apply directly to the IR, and thus be
+machine-independent. Then the back-end can be relatively simpler, in that it
+will only have to deal with machine-specific concerns.
+
+The wazero optimizing compiler implements the following compilation passes:
+
+* Front-End:
+  - Translation to SSA
+  - Optimization
+
+* Back-End:
+  - Instruction Selection
+  - Registry Allocation
+  - Finalization and Encoding
+
+```goat
+     Input          +-------------------+      +-------------------+
+  Wasm Binary   --->|   DecodeModule    |----->|   CompileModule   |--+
+                    +-------------------+      +-------------------+  |
+           +----------------------------------------------------------+
+           |
+           |  +---------------+            +---------------+
+           +->|   Front-End   |----------->|   Back-End    |
+              +---------------+            +---------------+
+                      |                            |
+                      v                            v
+                     SSA                 Instruction Selection
+                      |                            |
+                      v                            v
+                Optimization              Registry Allocation
+                      |                            |
+                      v                            v
+                Block Layout             Finalization/Encoding
+```
+
+Like the other engines, the implementation can be found under `engine`, specifically
+in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
+where the implementation of the interface `wasm.Engine` is found.
+
+All the passes can be dumped to the console for debugging, by enabling, the build-time
+flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
+by default and should only be enabled during debugging. These may also change in the future.
+
+In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
+so we will omit the prefix.
+
+<hr>
+
+* Next Section: [Front-End](frontend/)
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
new file mode 100644
index 0000000000..e5213dafb6
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -0,0 +1,9 @@
++++
+title = "How the Optimizing Compiler Works: Front-End"
+layout = "single"
++++
+
+## Back-End
+
+...
+
diff --git a/site/content/docs/how_the_optimizing_compiler_works.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
similarity index 58%
rename from site/content/docs/how_the_optimizing_compiler_works.md
rename to site/content/docs/how_the_optimizing_compiler_works/frontend.md
index e2a02707ee..0bc96364a1 100644
--- a/site/content/docs/how_the_optimizing_compiler_works.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -1,128 +1,7 @@
-What is a JIT compiler?
-=======================
-
-In general, when we talk about a Just-In-Time (JIT) compiler, we mean a
-compilation technique that spares cycles at build-time, trading it for run-time.
-In other words, when a language is JIT-compiled, we usually mean that
-compilation will happen during run-time. Furthermore, when we use the term
-JIT-compilation, we also often mean is that, because compilation happens _during
-run-time_, we can use information that we have collected during execution to
-direct the compilation process: these types of JIT-compilers are often referred
-to as **tracing-JITs**.
-
-Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**,
-**load-time** compiler. That is, a compiler that, indeed, performs compilation
-at run-time, but only when a WebAssembly module is loaded; it currently does not
-collect or leverage any information during the execution of the Wasm binary
-itself.
-
-It is important to make such a distinction, because a Just-In-Time compiler may
-not be an optimizing compiler, and an optimizing compiler may not be a tracing
-JIT. In fact, the compiler that wazero shipped before the introduction of the
-new compiler architecture performed code generation at load-time, but did not
-perform any optimization.
-
-# What is an Optimizing Compiler?
-
-Wazero supports an _optimizing_ compiler in the style of other optimizing
-compilers out there, such as LLVM's or V8's. Traditionally an optimizing
-compiler performs compilation in a number of steps.
-
-Compare this to the **old compiler**, where compilation happens in one step or
-two, depending on how you count:
-
-
-```goat
-            Input         +---------------+     +---------------+
-         Wasm Binary ---->| DecodeModule  |---->| CompileModule |----> wazero IR
-                          +---------------+     +---------------+
-```
-
-That is, the module is (1) validated then (2) translated to an Intermediate
-Representation (IR).  The wazero IR can then be executed directly (in the case
-of the interpreter) or it can be further processed and translated into native
-code by the compiler. This compiler performs a straightforward translation from
-the IR to native code, without any further passes. The wazero IR is not intended
-for further processing beyond immediate execution or straightforward
-translation.
-
-```goat
-                        +----   wazero IR    ----+
-                        |                        |
-                        v                        v
-                +--------------+         +--------------+
-                |   Compiler   |         | Interpreter  |- - -  executable
-               +--------------+         +--------------+
-                        |
-             +----------+---------+
-             |                    |
-             v                    v
-        +---------+          +---------+
-        |  ARM64  |          |  AMD64  |
-        | Backend |          | Backend |    - - - - - - - - -   executable
-        +---------+          +---------+
-```
-
-
-Validation and translation to an IR in a compiler are usually called the
-**front-end** part of a compiler, while code-generation occurs in what we call
-the **back-end** of a compiler. The front-end is the part of a compiler that is
-closer to the input, and it generally indicates machine-independent processing,
-such as parsing and static validation. The back-end is the part of a compiler
-that is closer to the output, and it generally includes machine-specific
-procedures, such as code-generation.
-
-In the **optimizing** compiler, we still decode and translate Wasm binaries to
-an intermediate representation in the front-end, but we use a textbook
-representation called an **SSA** or "Static Single-Assignment Form", that is
-intended for further transformation.
-
-The benefit of choosing an IR that is meant for transformation is that a lot of
-optimization passes can apply directly to the IR, and thus be
-machine-independent. Then the back-end can be relatively simpler, in that it
-will only have to deal with machine-specific concerns.
-
-The wazero optimizing compiler implements the following compilation passes:
-
-* Front-End:
-  - Translation to SSA
-  - Optimization
-
-* Back-End:
-  - Instruction Selection
-  - Registry Allocation
-  - Finalization and Encoding
-
-```goat
-              Input          +-------------------+      +-------------------+
-           Wasm Binary   --->|   DecodeModule    |----->|   CompileModule   |--+
-                             +-------------------+      +-------------------+  |
-                    +----------------------------------------------------------+
-                    |
-                    |  +---------------+            +---------------+
-                    +->|   Front-End   |----------->|   Back-End    |
-                       +---------------+            +---------------+
-                               |                            |
-                               v                            v
-                              SSA                 Instruction Selection
-                               |                            |
-                               v                            v
-                         Optimization              Registry Allocation
-                                                            |
-                                                            v
-                                                  Finalization/Encoding
-```
-
-Like the other engines, the implementation can be found under `engine`, specifically
-in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
-where the implementation of the interface `wasm.Engine` is found.
-
-All the passes can be dumped to the console for debugging, by enabling, the build-time
-flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
-by default and should only be enabled during debugging. These may also change in the future.
-
-In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
-so we will omit the prefix.
++++
+title = "How the Optimizing Compiler Works: Front-End"
+layout = "single"
++++
 
 ## Front-End: Translation to SSA
 
@@ -335,10 +214,84 @@ folds and delete instructions that are essentially no-ops (e.g. shifting by a 0
 `wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
 
 
+## Front-End: Block Layout
+
+As we have seen earlier, the SSA form instructions are contained within basic
+blocks, and the basic blocks are connected by edges of the control-flow graph.
+However, machine code is not laid out in a graph, but it is just a linear
+sequence of instructions.
+
+Thus, the last step of the front-end is to lay out the basic blocks in a linear
+sequence. Because each basic block, by design, ends with a control-flow
+instruction, one of the goals of the block layout phase is to maximize the number of
+**fall-through opportunities**. A fall-through opportunity occurs when a block ends
+with a jump instruction whose target is exactly the next block in the
+sequence. In order to maximize the number of fall-through opportunities, the
+block layout phase might reorder the basic blocks in the control-flow graph,
+and transform the control-flow instructions. For instance, it might _invert_
+some branching conditions.
+
+The end goal is to effectively minimize the number of jumps and branches in
+the machine code that will be generated later.
+
+
+### Critical Edges
+
+Special attention must be taken when a basic block has multiple predecessors,
+i.e., when it has multiple incoming edges. In particular, an edge between two
+basic blocks is called a **critical edge** when, at the same time:
+- the predecessor has multiple successors **and**
+- the successor has multiple predecessors.
+
+For instance, in the example below the edge between `BB0` and `BB3`
+is a critical edge.
+
+```goat
+┌───────┐    ┌───────┐
+│  BB0  │━┓  │  BB1  │
+└───────┘ ┃  └───────┘
+    │     ┃      │
+    ▼     ┃      ▼
+┌───────┐ ┃  ┌───────┐
+│  BB2  │ ┗━▶│  BB3  │
+└───────┘    └───────┘
+```
+
+In these cases the critical edge is split by introducing a new basic block,
+called a **trampoline**, where the critical edge was.
+
+```goat
+┌───────┐            ┌───────┐
+│  BB0  │──────┐     │  BB1  │
+└───────┘      ▼     └───────┘
+    │    ┌──────────┐    │
+    │    │trampoline│    │
+    ▼    └──────────┘    ▼
+┌───────┐      │     ┌───────┐
+│  BB2  │      └────▶│  BB3  │
+└───────┘            └───────┘
+```
+
+For more details on critical edges read more at
+
+- https://en.wikipedia.org/wiki/Control-flow_graph
+- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
+
+
+### Code
+
+`ssa.Builder.LayoutBlocks()` implements the block layout phase.
+
+### Debug Flags
+
+- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
+- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
+  such as inverting branching conditions or splitting critical edges.
 
-## Back-End
+<hr>
 
-...
+* Next Section: [Back-End](../backend/)
+* Previous Section: [How the Optimizing Compiler Works](../)
 
 [ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
 [llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes

From 9b8d529ab79f3d3315e9d050b3badd9a693a7393 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 14:11:23 +0100
Subject: [PATCH 06/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 47 ++++++++++++++++++-
 .../frontend.md                               |  4 +-
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index e5213dafb6..4d298e3735 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -1,9 +1,52 @@
 +++
-title = "How the Optimizing Compiler Works: Front-End"
+title = "How the Optimizing Compiler Works: Back-End"
 layout = "single"
 +++
 
-## Back-End
+## Back-End: Instruction Selection
+
+- Mapping between higher-level SSA IR and machine instructions.
+- Mention virtual registers.
+- Note about "peephole" optimizations happening e.g. when mapping a Value
+to a register (e.g. use address modes instead of loading a pointed value
+into a register).
+
+### Code
+
+...
+
+### Debug Flags
+
+- `wazevoapi.PrintSSAToBackendIRLowering`
+
+## Back-End: Register Allocation
+
+Partially architecture independent. Explain the algorithm etc.
+
+### Code
+
+...
+
+### Debug Flags
+
+- `wazevoapi.RegAllocLoggingEnabled`
+- `wazevoapi.PrintRegisterAllocated`
+
+## Back-End: Finalization and Encoding
+
+...
+
+### Code
 
 ...
 
+### Debug Flags
+
+- `wazevoapi.PrintFinalizedMachineCode`
+- `wazevoapi.PrintMachineCodeHexPerFunction`
+- `wazevoapi.printMachineCodeHexPerFunctionUnmodified`
+- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable`
+
+<hr>
+
+* Previous Section: [Front-End](../frontend/)
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
index 0bc96364a1..94dc48dd7f 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/frontend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -136,7 +136,9 @@ The relevant APIs can be found under sub-package `ssa` and `frontend`.
 
 ### Debug Flags
 
-`wazevoapi.PrintSSA` dumps the SSA form to the console.
+- `wazevoapi.PrintSSA` dumps the SSA form to the console.
+- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
+  opcodes and SSA instructions to the console.
 
 ## Front-End: Optimization
 

From d219e75dbde5f54bacce721242909ae758f2871a Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 18:13:06 +0100
Subject: [PATCH 07/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 96 ++++++++++++++++---
 .../frontend.md                               | 59 ++++++++++--
 2 files changed, 136 insertions(+), 19 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 4d298e3735..033fa014e7 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -3,25 +3,99 @@ title = "How the Optimizing Compiler Works: Back-End"
 layout = "single"
 +++
 
-## Back-End: Instruction Selection
+In this section we will discuss the phases in the back-end of the optimizing compiler:
 
-- Mapping between higher-level SSA IR and machine instructions.
-- Mention virtual registers.
-- Note about "peephole" optimizations happening e.g. when mapping a Value
-to a register (e.g. use address modes instead of loading a pointed value
-into a register).
+- [Instruction Selection](#instruction-selection)
+- [Register Allocation](#register-allocation)
+- [Finalization and Encoding](#finalization-and-encoding)
+
+Each section will include a brief explanation of the phase, references to the code that implements the phase,
+and a description of the debug flags that can be used to inspect that phase. Please notice that,
+since the implementation of the back-end is architecture-specific, the code might be different for each architecture.
 
 ### Code
 
-...
+The higher-level entry-point to the back-end is the `backend.Compiler.Compile(context.Context)` method.
+This method executes, in turn, the following methods in the same type:
+
+- `backend.Compiler.Lower()` (instruction selection)
+- `backend.Compiler.RegAlloc()` (register allocation)
+- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
+
+## Instruction Selection
+
+The instruction selection phase is responsible for mapping the higher-level SSA instructions
+to arch-specific instructions. Each SSA instruction is translated to one or more machine instructions.
+
+Each target architecture comes with a different number of registers, some of them are general purpose,
+others might be specific to certain instructions. In general, we can expect to have a set of registers
+for integer computations, another set for floating point computations, a set for vector (SIMD) computations,
+and some specific special-purpose registers (e.g. stack pointers, program counters, status flags, etc.)
+
+In addition, some registers might be reserved by the Go runtime or the Operating System for specific purposes,
+so they should be handled with special care.
+
+At this point in the compilation process we do not want to deal with all that. Instead, we assume
+that we have a potentially infinite number of *virtual registers* of each type at our disposal.
+The next phase, the register allocation phase, will map these virtual registers to the actual
+registers of the target architecture.
+
+### Operands and Addressing Modes
+
+As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and then use that virtual register
+as one of the arguments of the machine instruction that we will generate. However, usually instructions
+are able to address more than just registers: an *operand* might be able to represent a memory address,
+or an immediate value (i.e. a constant value that is encoded as part of the instruction itself).
+
+For these reasons, instead of mapping each `ssa.Value` to a virtual register (`regalloc.VReg`),
+we map each `ssa.Value` to an architecture-specific `operand` type.
+
+During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as an argument of the instruction,
+in the simplest case, the `operand` might be mapped to a virtual register, in other cases, the
+`operand` might be mapped to a memory address, or an immediate value. Sometimes this makes it possible to
+replace several SSA instructions with a single machine instruction, by folding the addressing mode into the
+instruction itself.
+
+For instance, consider the following SSA instructions:
+
+```
+    v4:i32 = Const 0x9
+    v6:i32 = Load v5, 0x4
+    v7:i32 = Iadd v6, v4
+```
+
+In the `amd64` architecture, the `add` instruction adds the second operand to the first operand,
+and assigns the result to the second operand. So assuming that `r4`, `v5`, `v6`, and `v7` are mapped
+respectively to the virtual registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
+instruction on `amd64` might look like this:
+
+```asm
+    ;; AT&T syntax
+    add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
+    mov %r4?, %r7?     ;; move the result from `r4?` to `r7?`
+```
+
+Notice how the load from memory has been folded into an operand of the `add` instruction. This transformation
+is possible when the value produced by the instruction being folded is not referenced by other instructions
+and the instructions belong to the same `InstructionGroupID` (see [Front-End: Optimization](../frontend/#optimization)).
+
+### Code
+
+`backend.Machine` is the interface to the backend. It has a methods to translate (lower) the IR to machine code.
+Again, as seen earlier in the front-end, the term *lowering* is used to indicate translation from a higher-level
+representation to a lower-level representation.
+
+`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an SSA instruction to machine code.
+Machine-specific implementations of this method can be found in package `backend/isa/<arch>`
+where `<arch>` is either `amd64` or `arm64`.
 
 ### Debug Flags
 
-- `wazevoapi.PrintSSAToBackendIRLowering`
+`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the lowered arch-specific instructions.
 
-## Back-End: Register Allocation
+## Register Allocation
 
-Partially architecture independent. Explain the algorithm etc.
+Partially architecture independent. Explain how it works etc.
 
 ### Code
 
@@ -32,7 +106,7 @@ Partially architecture independent. Explain the algorithm etc.
 - `wazevoapi.RegAllocLoggingEnabled`
 - `wazevoapi.PrintRegisterAllocated`
 
-## Back-End: Finalization and Encoding
+## Finalization and Encoding
 
 ...
 
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
index 94dc48dd7f..616f166a23 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/frontend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -3,7 +3,17 @@ title = "How the Optimizing Compiler Works: Front-End"
 layout = "single"
 +++
 
-## Front-End: Translation to SSA
+In this section we will discuss the phases in the front-end of the optimizing compiler:
+
+- [Translation to SSA](#translation-to-ssa)
+- [Optimization](#optimization)
+- [Block Layout](#block-layout)
+
+Every section includes an explanation of the phase; the subsection **Code**
+will include high-level pointers to functions and packages; the subsection **Debug Flags**
+indicates the flags that can be used to enable advanced logging of the phase.
+
+## Translation to SSA
 
 We mentioned earlier that wazero uses an internal representation called an "SSA"
 form or "Static Single-Assignment" form, but we never explained what that is.
@@ -126,6 +136,8 @@ this case: `v5 := Φ(v7, v2)`.
 ### Code
 
 The relevant APIs can be found under sub-package `ssa` and `frontend`.
+In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
+because such transformations usually correspond to targeting a lower abstraction level.
 
 - Basic Blocks are represented by the type `ssa.Block` (`ssa/basic_block.go`).
 - The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
@@ -134,13 +146,42 @@ The relevant APIs can be found under sub-package `ssa` and `frontend`.
 - The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
   more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
 
+#### Instructions and Values
+
+An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
+consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
+instructions may not produce any value (for instance, a `Jump` instruction).
+An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
+to represent the result of an instruction, or the input to an instruction.
+
+For instance:
+
+```
+blk1: () <-- (blk0)
+    v6:i32 = Iconst_32 0x0
+    v7:i32 = Isub v6, v2
+    Jump blk3, v7
+```
+
+`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
+and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
+such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
+
+- `i32`: 32-bit integer
+- `i64`: 64-bit integer
+- `f32`: 32-bit floating point
+- `f64`: 64-bit floating point
+- `v128`: 128-bit SIMD vector
+
+Values and instructions are both allocated from pools to minimize memory allocations.
+
 ### Debug Flags
 
 - `wazevoapi.PrintSSA` dumps the SSA form to the console.
 - `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
   opcodes and SSA instructions to the console.
 
-## Front-End: Optimization
+## Optimization
 
 The SSA form makes it easier to perform a number of optimizations. For instance,
 we can perform constant propagation, dead code elimination, and common
@@ -190,9 +231,6 @@ optimizations are often unnecessary. The optimization passes implemented in
 wazero are also work-in-progress and, at the time of writing, further work is
 expected to implement more advanced optimizations.
 
-<!-- say more about block layout etc... -->
-
-
 ### Code
 
 Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
@@ -203,10 +241,15 @@ over the instructions. Each pass may mutate the basic block by modifying the ins
 it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
 blocks).
 
-Currently, there are dead-code elimination passes:
+Currently, there are two dead-code elimination passes:
 
-- `passDeadCodeEliminationOpt` acting at instruction-level.
 - `passDeadBlockEliminationOpt` acting at the block-level.
+- `passDeadCodeEliminationOpt` acting at instruction-level.
+
+Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
+instruction. This is used to determine whether a sequence of instructions can be
+replaced by a single machine instruction during the back-end phase. For more details,
+see also the relevant documentation in `ssa/instructions.go`
 
 There are also simple constant folding passes such as `passNopInstElimination`, which
 folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
@@ -216,7 +259,7 @@ folds and delete instructions that are essentially no-ops (e.g. shifting by a 0
 `wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
 
 
-## Front-End: Block Layout
+## Block Layout
 
 As we have seen earlier, the SSA form instructions are contained within basic
 blocks, and the basic blocks are connected by edges of the control-flow graph.

From 7bba47e6b06904f26537cc39e502b83c9e0ed230 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Tue, 13 Feb 2024 18:33:38 +0100
Subject: [PATCH 08/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 61 +++++++++++++++++--
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 033fa014e7..83d20579b8 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -95,20 +95,71 @@ where `<arch>` is either `amd64` or `arm64`.
 
 ## Register Allocation
 
-Partially architecture independent. Explain how it works etc.
+**TODO: Not finished.**
+
+The register allocation phase is responsible for mapping the potentially infinite number of virtual registers
+to the actual registers of the target architecture. Because the number of real registers is limited,
+the register allocation phase might need to "spill" some of the virtual registers to memory; that is, it might
+store their content, and then load them back into a register when they are needed.
+
+The register allocation procedure is implemented in sub-phases:
+
+- `livenessAnalysis(f)` collects the liveness information for each virtual register. The algorithm is described
+  in [Chapter 9.2 of The SSA Book](https://pfalcon.github.io/ssabook/latest/book-full.pdf).
+
+- `alloc(f)` allocates registers for the given function. The algorithm is derived from
+  [the Go compiler's allocator](https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go)
+
+  - In short, this is just a linear scan register allocation procedure, where each block inherits the
+    register allocation state from one of its predecessors. Each block inherits the selected state and
+    starts allocation from there.
+
+  - If there's a discrepancy in the end states between predecessors, adjustments are made to ensure consistency after
+    allocation is done (which we call "fixing merge state").
+
+  - The spill instructions (store into the dedicated slots) are inserted after all the allocations and fixing
+    merge states. That is because at the point, we all know where the reloads happen, and therefore we can
+    know the best place to spill the values. More precisely, the spill happens in the block that is
+    the lowest common ancestor of all the blocks that reloads the value.
+
+  All of these logics are almost the same as Go's compiler which has a dedicated description in the source file ^^.
+
+#### References
+
+- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
+- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
+- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
+- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
+- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
+
 
 ### Code
 
-...
+The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the interfaces in `regalloc/api.go`.
 
 ### Debug Flags
 
-- `wazevoapi.RegAllocLoggingEnabled`
-- `wazevoapi.PrintRegisterAllocated`
+- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register allocation procedure.
+- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register allocation result.
 
 ## Finalization and Encoding
 
-...
+**TODO: Not finished.**
+
+### PostRegAlloc:
+
+* setup prologue of the function
+* inserts epilogue of the function
+* machine-specific custom logic (e.g. post-regalloc lowering)
+
+### Encoding:
+
+* encodes the low-level instructions into bytes
+
+### Other
+
+- MMap code segment
+- resolve relocations
 
 ### Code
 

From 04376f819c3cbe1ba1149928e4bbed71570d7575 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Wed, 14 Feb 2024 15:22:00 +0100
Subject: [PATCH 09/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 231 ++++++++++++++----
 .../frontend.md                               |  25 ++
 2 files changed, 206 insertions(+), 50 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 83d20579b8..c6298500ed 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -9,14 +9,17 @@ In this section we will discuss the phases in the back-end of the optimizing com
 - [Register Allocation](#register-allocation)
 - [Finalization and Encoding](#finalization-and-encoding)
 
-Each section will include a brief explanation of the phase, references to the code that implements the phase,
-and a description of the debug flags that can be used to inspect that phase. Please notice that,
-since the implementation of the back-end is architecture-specific, the code might be different for each architecture.
+Each section will include a brief explanation of the phase, references to the
+code that implements the phase, and a description of the debug flags that can
+be used to inspect that phase.  Please notice that, since the implementation of
+the back-end is architecture-specific, the code might be different for each
+architecture.
 
 ### Code
 
-The higher-level entry-point to the back-end is the `backend.Compiler.Compile(context.Context)` method.
-This method executes, in turn, the following methods in the same type:
+The higher-level entry-point to the back-end is the
+`backend.Compiler.Compile(context.Context)` method.  This method executes, in
+turn, the following methods in the same type:
 
 - `backend.Compiler.Lower()` (instruction selection)
 - `backend.Compiler.RegAlloc()` (register allocation)
@@ -24,37 +27,46 @@ This method executes, in turn, the following methods in the same type:
 
 ## Instruction Selection
 
-The instruction selection phase is responsible for mapping the higher-level SSA instructions
-to arch-specific instructions. Each SSA instruction is translated to one or more machine instructions.
+The instruction selection phase is responsible for mapping the higher-level SSA
+instructions to arch-specific instructions. Each SSA instruction is translated
+to one or more machine instructions.
 
-Each target architecture comes with a different number of registers, some of them are general purpose,
-others might be specific to certain instructions. In general, we can expect to have a set of registers
-for integer computations, another set for floating point computations, a set for vector (SIMD) computations,
-and some specific special-purpose registers (e.g. stack pointers, program counters, status flags, etc.)
+Each target architecture comes with a different number of registers, some of
+them are general purpose, others might be specific to certain instructions. In
+general, we can expect to have a set of registers for integer computations,
+another set for floating point computations, a set for vector (SIMD)
+computations, and some specific special-purpose registers (e.g. stack pointers,
+program counters, status flags, etc.)
 
-In addition, some registers might be reserved by the Go runtime or the Operating System for specific purposes,
-so they should be handled with special care.
+In addition, some registers might be reserved by the Go runtime or the
+Operating System for specific purposes, so they should be handled with special
+care.
 
-At this point in the compilation process we do not want to deal with all that. Instead, we assume
-that we have a potentially infinite number of *virtual registers* of each type at our disposal.
-The next phase, the register allocation phase, will map these virtual registers to the actual
-registers of the target architecture.
+At this point in the compilation process we do not want to deal with all that.
+Instead, we assume that we have a potentially infinite number of *virtual
+registers* of each type at our disposal. The next phase, the register
+allocation phase, will map these virtual registers to the actual registers of
+the target architecture.
 
 ### Operands and Addressing Modes
 
-As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and then use that virtual register
-as one of the arguments of the machine instruction that we will generate. However, usually instructions
-are able to address more than just registers: an *operand* might be able to represent a memory address,
-or an immediate value (i.e. a constant value that is encoded as part of the instruction itself).
+As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
+then use that virtual register as one of the arguments of the machine
+instruction that we will generate. However, usually instructions are able to
+address more than just registers: an *operand* might be able to represent a
+memory address, or an immediate value (i.e. a constant value that is encoded as
+part of the instruction itself).
 
-For these reasons, instead of mapping each `ssa.Value` to a virtual register (`regalloc.VReg`),
-we map each `ssa.Value` to an architecture-specific `operand` type.
+For these reasons, instead of mapping each `ssa.Value` to a virtual register
+(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
+`operand` type.
 
-During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as an argument of the instruction,
-in the simplest case, the `operand` might be mapped to a virtual register, in other cases, the
-`operand` might be mapped to a memory address, or an immediate value. Sometimes this makes it possible to
-replace several SSA instructions with a single machine instruction, by folding the addressing mode into the
-instruction itself.
+During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
+an argument of the instruction, in the simplest case, the `operand` might be
+mapped to a virtual register, in other cases, the `operand` might be mapped to
+a memory address, or an immediate value. Sometimes this makes it possible to
+replace several SSA instructions with a single machine instruction, by folding
+the addressing mode into the instruction itself.
 
 For instance, consider the following SSA instructions:
 
@@ -64,9 +76,10 @@ For instance, consider the following SSA instructions:
     v7:i32 = Iadd v6, v4
 ```
 
-In the `amd64` architecture, the `add` instruction adds the second operand to the first operand,
-and assigns the result to the second operand. So assuming that `r4`, `v5`, `v6`, and `v7` are mapped
-respectively to the virtual registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
+In the `amd64` architecture, the `add` instruction adds the second operand to
+the first operand, and assigns the result to the second operand. So assuming
+that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
+registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
 instruction on `amd64` might look like this:
 
 ```asm
@@ -75,40 +88,118 @@ instruction on `amd64` might look like this:
     mov %r4?, %r7?     ;; move the result from `r4?` to `r7?`
 ```
 
-Notice how the load from memory has been folded into an operand of the `add` instruction. This transformation
-is possible when the value produced by the instruction being folded is not referenced by other instructions
-and the instructions belong to the same `InstructionGroupID` (see [Front-End: Optimization](../frontend/#optimization)).
+Notice how the load from memory has been folded into an operand of the `add`
+instruction. This transformation is possible when the value produced by the
+instruction being folded is not referenced by other instructions and the
+instructions belong to the same `InstructionGroupID` (see [Front-End:
+Optimization](../frontend/#optimization)).
+
+### Example
+
+At the end of the instruction selection phase, the basic blocks of our `abs`
+function will look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+	mov x130?, x2
+	subs wzr, w130?, #0x0
+	b.ge L2
+L3 (SSA Block: blk1):
+	mov x136?, xzr
+	sub w134?, w136?, w130?
+	mov x135?, x134?
+	b L4
+L2 (SSA Block: blk2):
+	mov x135?, x130?
+L4 (SSA Block: blk3):
+	mov x0, x135?
+	ret
+```
+
+Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
+These are labels that are used to mark the beginning of each basic block, and
+they are the target for branching instructions such as `b` and `b.ge`.
 
 ### Code
 
-`backend.Machine` is the interface to the backend. It has a methods to translate (lower) the IR to machine code.
-Again, as seen earlier in the front-end, the term *lowering* is used to indicate translation from a higher-level
-representation to a lower-level representation.
+`backend.Machine` is the interface to the backend. It has a methods to
+translate (lower) the IR to machine code.  Again, as seen earlier in the
+front-end, the term *lowering* is used to indicate translation from a
+higher-level representation to a lower-level representation.
 
-`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an SSA instruction to machine code.
-Machine-specific implementations of this method can be found in package `backend/isa/<arch>`
-where `<arch>` is either `amd64` or `arm64`.
+`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
+SSA instruction to machine code.  Machine-specific implementations of this
+method can be found in package `backend/isa/<arch>` where `<arch>` is either
+`amd64` or `arm64`.
 
 ### Debug Flags
 
-`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the lowered arch-specific instructions.
+`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
+lowered arch-specific instructions.
 
 ## Register Allocation
 
 **TODO: Not finished.**
 
-The register allocation phase is responsible for mapping the potentially infinite number of virtual registers
-to the actual registers of the target architecture. Because the number of real registers is limited,
-the register allocation phase might need to "spill" some of the virtual registers to memory; that is, it might
-store their content, and then load them back into a register when they are needed.
+The register allocation phase is responsible for mapping the potentially
+infinite number of virtual registers to the real registers of the target
+architecture. Because the number of real registers is limited, the register
+allocation phase might need to "spill" some of the virtual registers to memory;
+that is, it might store their content, and then load them back into a register
+when they are needed.
 
 The register allocation procedure is implemented in sub-phases:
 
-- `livenessAnalysis(f)` collects the liveness information for each virtual register. The algorithm is described
-  in [Chapter 9.2 of The SSA Book](https://pfalcon.github.io/ssabook/latest/book-full.pdf).
 
-- `alloc(f)` allocates registers for the given function. The algorithm is derived from
-  [the Go compiler's allocator](https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go)
+- `livenessAnalysis(f)` collects the "liveness" information for each virtual
+  register. The algorithm is described in [Chapter 9.2 of The SSA
+  Book](https://pfalcon.github.io/ssabook/latest/book-full.pdf).
+
+- `alloc(f)` allocates registers for the given function. The algorithm is
+  derived from [the Go compiler's
+  allocator](https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go)
+
+### Liveness Analysis
+
+Intuitively, a variable or name binding can be considered _live_
+at a certain point in a program, if its value will be used in the future.
+
+For instance:
+
+```
+1| int f(int x) {
+2|   int y = 2 + x;
+3|   int z = x + y;
+4|   return z;
+5| }
+```
+
+Variable `x` and `y` are both live at line 4, because they are used in the
+expression `x + y` on line 3; variable `z` is live at line 4,
+because it is used in the return statement.
+However, variables `x` and `y` can be considered _not_ live at line 4
+because they are not used anywhere after line 3.
+
+Statically, _liveness_ can be approximated by following paths backwards on the control-flow
+graph, connecting the uses of a given variable to its definitions
+(or its *unique* definition, assuming SSA form).
+
+In practice, while liveness is a property of each name binding at any point
+in the program, it is enough to keep track of liveness at the boundaries
+of basic blocks:
+
+- the _live-in_ set for a given basic block is the set of all bindings that
+  are live at the entry of that block.
+- the _live-out_ set for a given basic block is the set of all bindings that
+  are live at the exit of that block. A binding is live at the exit of a block
+  if it is live at the entry of a successor.
+
+Because the CFG is a connected graph, it is enough to keep track of either
+live-in or live-out sets, and then propagate the liveness information
+backwards or forwards, respectively. In our case, we keep track of live-ins.
+
+### Allocation
+
 
   - In short, this is just a linear scan register allocation procedure, where each block inherits the
     register allocation state from one of its predecessors. Each block inherits the selected state and
@@ -133,9 +224,49 @@ The register allocation procedure is implemented in sub-phases:
 - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
 
 
+### Example
+
+At the end of the register allocation phase, the basic blocks of our `abs`
+function look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+	mov x2, x2
+	subs wzr, w2, #0x0
+	b.ge L2
+L3 (SSA Block: blk1):
+	mov x8, xzr
+	sub w8, w8, w2
+	mov x8, x8
+	b L4
+L2 (SSA Block: blk2):
+	mov x2, x2
+	mov x8, x2
+L4 (SSA Block: blk3):
+	mov x0, x8
+	ret
+```
+
+Notice how the virtual registers have been all replaced by real registers, i.e.
+no register identifier is suffixed with `?`.
+
 ### Code
 
-The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the interfaces in `regalloc/api.go`.
+The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the interfaces
+in `regalloc/api.go`.
+
+Essentially:
+
+- each architecture exposes iteration over basic blocks of a function
+  (`regalloc.Function` interface)
+- each arch-specific basic block exposes iteration over instructions
+  (`regalloc.Block` interface)
+- each arch-specific instruction exposes the set of registers it defines and
+  uses  (`regalloc.Instr` interface)
+
+By defining these interfaces, the register allocation algorithm can assign
+real registers to virtual registers without dealing specifically with the
+target architecture.
 
 ### Debug Flags
 
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
index 616f166a23..51a0c983cc 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/frontend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -145,6 +145,8 @@ because such transformations usually correspond to targeting a lower abstraction
   `frontend.Compiler.LowerToSSA()`.
 - The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
   more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
+- Because they are semantically equivalent, in the code, basic block parameters
+  are sometimes referred to as "Phi values".
 
 #### Instructions and Values
 
@@ -322,6 +324,29 @@ For more details on critical edges read more at
 - https://en.wikipedia.org/wiki/Control-flow_graph
 - https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
 
+### Example
+
+At the end of the block layout phase, the laid out SSA for the `abs` function
+looks as follows:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+	v3:i32 = Iconst_32 0x0
+	v4:i32 = Icmp lt_s, v2, v3
+	Brz v4, blk2
+	Jump fallthrough
+
+blk1: () <-- (blk0)
+	v6:i32 = Iconst_32 0x0
+	v7:i32 = Isub v6, v2
+	Jump blk3, v7
+
+blk2: () <-- (blk0)
+	Jump fallthrough, v2
+
+blk3: (v5:i32) <-- (blk1,blk2)
+	Jump blk_ret, v5
+```
 
 ### Code
 

From 364a2c5409223ba39f9ea6962c68ed5d48b68e28 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Wed, 14 Feb 2024 15:53:48 +0100
Subject: [PATCH 10/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 88 ++++++++++++-------
 1 file changed, 58 insertions(+), 30 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index c6298500ed..0849967af4 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -3,7 +3,8 @@ title = "How the Optimizing Compiler Works: Back-End"
 layout = "single"
 +++
 
-In this section we will discuss the phases in the back-end of the optimizing compiler:
+In this section we will discuss the phases in the back-end of the optimizing
+compiler:
 
 - [Instruction Selection](#instruction-selection)
 - [Register Allocation](#register-allocation)
@@ -150,19 +151,18 @@ when they are needed.
 
 The register allocation procedure is implemented in sub-phases:
 
-
 - `livenessAnalysis(f)` collects the "liveness" information for each virtual
   register. The algorithm is described in [Chapter 9.2 of The SSA
-  Book](https://pfalcon.github.io/ssabook/latest/book-full.pdf).
+Book][ssa-book].
 
 - `alloc(f)` allocates registers for the given function. The algorithm is
   derived from [the Go compiler's
-  allocator](https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go)
+allocator][go-regalloc]
 
 ### Liveness Analysis
 
-Intuitively, a variable or name binding can be considered _live_
-at a certain point in a program, if its value will be used in the future.
+Intuitively, a variable or name binding can be considered _live_ at a certain
+point in a program, if its value will be used in the future.
 
 For instance:
 
@@ -175,45 +175,70 @@ For instance:
 ```
 
 Variable `x` and `y` are both live at line 4, because they are used in the
-expression `x + y` on line 3; variable `z` is live at line 4,
-because it is used in the return statement.
-However, variables `x` and `y` can be considered _not_ live at line 4
-because they are not used anywhere after line 3.
+expression `x + y` on line 3; variable `z` is live at line 4, because it is
+used in the return statement.  However, variables `x` and `y` can be considered
+_not_ live at line 4 because they are not used anywhere after line 3.
 
-Statically, _liveness_ can be approximated by following paths backwards on the control-flow
-graph, connecting the uses of a given variable to its definitions
+Statically, _liveness_ can be approximated by following paths backwards on the
+control-flow graph, connecting the uses of a given variable to its definitions
 (or its *unique* definition, assuming SSA form).
 
-In practice, while liveness is a property of each name binding at any point
-in the program, it is enough to keep track of liveness at the boundaries
-of basic blocks:
+In practice, while liveness is a property of each name binding at any point in
+the program, it is enough to keep track of liveness at the boundaries of basic
+blocks:
 
-- the _live-in_ set for a given basic block is the set of all bindings that
-  are live at the entry of that block.
+- the _live-in_ set for a given basic block is the set of all bindings that are
+  live at the entry of that block.
 - the _live-out_ set for a given basic block is the set of all bindings that
   are live at the exit of that block. A binding is live at the exit of a block
-  if it is live at the entry of a successor.
+if it is live at the entry of a successor.
 
 Because the CFG is a connected graph, it is enough to keep track of either
-live-in or live-out sets, and then propagate the liveness information
-backwards or forwards, respectively. In our case, we keep track of live-ins.
+live-in or live-out sets, and then propagate the liveness information backwards
+or forwards, respectively. In our case, we keep track of live-ins.
 
 ### Allocation
 
+We implemented a variant of the linear scan register allocation algorithm
+described in [the Go compiler's allocator][go-regalloc].
+
+Each basic block is allocated registers in a linear scan order, and the
+allocation state is propagated from a given basic block to its successors.
+Then, each block continues allocation from that initial state.
+
+#### Merge States
+
+Special care has to be taken when a block has multiple predecessors. We
+call this *fixing merge states*: for instance, consider the following:
 
-  - In short, this is just a linear scan register allocation procedure, where each block inherits the
-    register allocation state from one of its predecessors. Each block inherits the selected state and
-    starts allocation from there.
+```goat
+ .---.     .---.
+( BB0 )   ( BB1 )
+ `---'     `---'
+   |         |
+   +----+----+
+        v
+      .---.
+     ( BB2 )
+      `---'
+```
+
+if the live-out set of a given block `BB0` is different from the live-out set
+of a given block `BB1` and both are predecessors of a block `BB2`, then
+we need to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
+we ensure that the registers that `BB2` expects to be live-in are live-out in
+`BB0` and `BB1`.
 
-  - If there's a discrepancy in the end states between predecessors, adjustments are made to ensure consistency after
-    allocation is done (which we call "fixing merge state").
+#### Spilling
 
-  - The spill instructions (store into the dedicated slots) are inserted after all the allocations and fixing
-    merge states. That is because at the point, we all know where the reloads happen, and therefore we can
-    know the best place to spill the values. More precisely, the spill happens in the block that is
-    the lowest common ancestor of all the blocks that reloads the value.
+If the register allocator cannot find a register for a given virtual register,
+it will "spill" it to memory, *i.e.,* stash the value temporarily to memory.
+
+The spill instructions (store into the dedicated slots) are inserted after all the allocations and fixing
+merge states. That is because at the point, we all know where the reloads happen, and therefore we can
+know the best place to spill the values. More precisely, the spill happens in the block that is
+the lowest common ancestor of all the blocks that reloads the value.
 
-  All of these logics are almost the same as Go's compiler which has a dedicated description in the source file ^^.
 
 #### References
 
@@ -306,3 +331,6 @@ target architecture.
 <hr>
 
 * Previous Section: [Front-End](../frontend/)
+
+[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
+[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go

From 9675a18971ebe02274a21705fe336e6aeb1f3f71 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Wed, 14 Feb 2024 16:05:48 +0100
Subject: [PATCH 11/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../how_the_optimizing_compiler_works/backend.md     | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 0849967af4..cfe5259ab4 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -213,14 +213,14 @@ call this *fixing merge states*: for instance, consider the following:
 
 ```goat
  .---.     .---.
-( BB0 )   ( BB1 )
- `---'     `---'
-   |         |
+| BB0 |   | BB1 |
+ '-+-'     '-+-'
    +----+----+
+        |
         v
       .---.
-     ( BB2 )
-      `---'
+     | BB1 |
+      '---'
 ```
 
 if the live-out set of a given block `BB0` is different from the live-out set
@@ -235,7 +235,7 @@ If the register allocator cannot find a register for a given virtual register,
 it will "spill" it to memory, *i.e.,* stash the value temporarily to memory.
 
 The spill instructions (store into the dedicated slots) are inserted after all the allocations and fixing
-merge states. That is because at the point, we all know where the reloads happen, and therefore we can
+merge states. That is because at the point, we know where all the reloads happen, and therefore we can
 know the best place to spill the values. More precisely, the spill happens in the block that is
 the lowest common ancestor of all the blocks that reloads the value.
 

From 027e624db834aaca1a36c69f3debbead3a56d339 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Wed, 14 Feb 2024 18:39:58 +0100
Subject: [PATCH 12/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 28 +++++++++++++------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index cfe5259ab4..8fc3cb9799 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -231,14 +231,21 @@ we ensure that the registers that `BB2` expects to be live-in are live-out in
 
 #### Spilling
 
-If the register allocator cannot find a register for a given virtual register,
-it will "spill" it to memory, *i.e.,* stash the value temporarily to memory.
+If the register allocator cannot find a free register for a given virtual (live)
+register, it will "spill" the value to memory, *i.e.,* stash it temporarily to memory.
+When that virtual register is recalled later, we will have to insert instructions to
+reload the value into a real register.
 
-The spill instructions (store into the dedicated slots) are inserted after all the allocations and fixing
-merge states. That is because at the point, we know where all the reloads happen, and therefore we can
-know the best place to spill the values. More precisely, the spill happens in the block that is
-the lowest common ancestor of all the blocks that reloads the value.
+While the procedure proceeds with allocation, the procedure also records all
+the virtual registers that transition to the "spilled" state, and inserts
+the reload instructions when those registers are recalled later.
 
+The spill instructions are actually inserted at the end, after all the allocations and
+the merge states have been fixed. At this point, all the other potential sources of
+instability have been resolved, and we know where all the reloads happen.
+
+We insert the spills in the block that is the lowest common ancestor of all the blocks
+that reload the value.
 
 #### References
 
@@ -248,7 +255,6 @@ the lowest common ancestor of all the blocks that reloads the value.
 - https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
 - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
 
-
 ### Example
 
 At the end of the register allocation phase, the basic blocks of our `abs`
@@ -273,7 +279,8 @@ L4 (SSA Block: blk3):
 ```
 
 Notice how the virtual registers have been all replaced by real registers, i.e.
-no register identifier is suffixed with `?`.
+no register identifier is suffixed with `?`. This example is quite simple, and
+it does not require any spill.
 
 ### Code
 
@@ -293,6 +300,11 @@ By defining these interfaces, the register allocation algorithm can assign
 real registers to virtual registers without dealing specifically with the
 target architecture.
 
+In practice, each interface is usually implemented by instantiating a common generic
+struct that comes already with an implementation of all or most of the required methods.
+For instance,`regalloc.Function`is implemented by
+`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
+
 ### Debug Flags
 
 - `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register allocation procedure.

From 23e3f67d8744b0eca08bffb01963435ae4e04453 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Wed, 14 Feb 2024 18:54:26 +0100
Subject: [PATCH 13/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 58 +++++++++++--------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 8fc3cb9799..93df1d9a04 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -208,8 +208,8 @@ Then, each block continues allocation from that initial state.
 
 #### Merge States
 
-Special care has to be taken when a block has multiple predecessors. We
-call this *fixing merge states*: for instance, consider the following:
+Special care has to be taken when a block has multiple predecessors. We call
+this *fixing merge states*: for instance, consider the following:
 
 ```goat
  .---.     .---.
@@ -224,28 +224,29 @@ call this *fixing merge states*: for instance, consider the following:
 ```
 
 if the live-out set of a given block `BB0` is different from the live-out set
-of a given block `BB1` and both are predecessors of a block `BB2`, then
-we need to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
-we ensure that the registers that `BB2` expects to be live-in are live-out in
+of a given block `BB1` and both are predecessors of a block `BB2`, then we need
+to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice, we
+ensure that the registers that `BB2` expects to be live-in are live-out in
 `BB0` and `BB1`.
 
 #### Spilling
 
-If the register allocator cannot find a free register for a given virtual (live)
-register, it will "spill" the value to memory, *i.e.,* stash it temporarily to memory.
-When that virtual register is recalled later, we will have to insert instructions to
-reload the value into a real register.
+If the register allocator cannot find a free register for a given virtual
+(live) register, it will "spill" the value to memory, *i.e.,* stash it
+temporarily to memory.  When that virtual register is recalled later, we will
+have to insert instructions to reload the value into a real register.
 
 While the procedure proceeds with allocation, the procedure also records all
-the virtual registers that transition to the "spilled" state, and inserts
-the reload instructions when those registers are recalled later.
+the virtual registers that transition to the "spilled" state, and inserts the
+reload instructions when those registers are recalled later.
 
-The spill instructions are actually inserted at the end, after all the allocations and
-the merge states have been fixed. At this point, all the other potential sources of
-instability have been resolved, and we know where all the reloads happen.
+The spill instructions are actually inserted at the end, after all the
+allocations and the merge states have been fixed. At this point, all the other
+potential sources of instability have been resolved, and we know where all the
+reloads happen.
 
-We insert the spills in the block that is the lowest common ancestor of all the blocks
-that reload the value.
+We insert the spills in the block that is the lowest common ancestor of all the
+blocks that reload the value.
 
 #### References
 
@@ -284,8 +285,8 @@ it does not require any spill.
 
 ### Code
 
-The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the interfaces
-in `regalloc/api.go`.
+The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
+interfaces in `regalloc/api.go`.
 
 Essentially:
 
@@ -296,13 +297,13 @@ Essentially:
 - each arch-specific instruction exposes the set of registers it defines and
   uses  (`regalloc.Instr` interface)
 
-By defining these interfaces, the register allocation algorithm can assign
-real registers to virtual registers without dealing specifically with the
-target architecture.
+By defining these interfaces, the register allocation algorithm can assign real
+registers to virtual registers without dealing specifically with the target
+architecture.
 
-In practice, each interface is usually implemented by instantiating a common generic
-struct that comes already with an implementation of all or most of the required methods.
-For instance,`regalloc.Function`is implemented by
+In practice, each interface is usually implemented by instantiating a common
+generic struct that comes already with an implementation of all or most of the
+required methods.  For instance,`regalloc.Function`is implemented by
 `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
 
 ### Debug Flags
@@ -312,7 +313,14 @@ For instance,`regalloc.Function`is implemented by
 
 ## Finalization and Encoding
 
-**TODO: Not finished.**
+At the end of the register allocation phase, we have enough information to complete
+the generation of the machine code. What is still missing are the prologue and
+epilogue of the function, and the encoding of the instructions into bytes.
+
+As usual, the prologue is executed before the main body of the function, and
+the epilogue is executed at the end. The prologue is responsible for setting up
+the stack frame, and the epilogue is responsible for cleaning up the stack
+frame and returning control to the caller.
 
 ### PostRegAlloc:
 

From 783df458adbc4d28b1254ab920d9759e6eb57b59 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 10:16:19 +0100
Subject: [PATCH 14/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 122 +++++++++++++++++-
 .../frontend.md                               |  12 +-
 2 files changed, 121 insertions(+), 13 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 93df1d9a04..0250e89ca1 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -140,8 +140,6 @@ lowered arch-specific instructions.
 
 ## Register Allocation
 
-**TODO: Not finished.**
-
 The register allocation phase is responsible for mapping the potentially
 infinite number of virtual registers to the real registers of the target
 architecture. Because the number of real registers is limited, the register
@@ -149,7 +147,8 @@ allocation phase might need to "spill" some of the virtual registers to memory;
 that is, it might store their content, and then load them back into a register
 when they are needed.
 
-The register allocation procedure is implemented in sub-phases:
+For a given function `f` the register allocation procedure
+`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
 
 - `livenessAnalysis(f)` collects the "liveness" information for each virtual
   register. The algorithm is described in [Chapter 9.2 of The SSA
@@ -159,6 +158,14 @@ Book][ssa-book].
   derived from [the Go compiler's
 allocator][go-regalloc]
 
+At the end of the allocation procedure, we also record the set of registers
+that are **clobbered** by the body of the function. A register is clobbered
+if its value is overwritten by the function, and it is not saved by the
+callee. This information is used in the finalization phase to determine which
+registers need to be spilled in the prologue. This is not strictly related
+to register allocation in a textbook meaning, but it is a necessary step
+for the finalization phase.
+
 ### Liveness Analysis
 
 Intuitively, a variable or name binding can be considered _live_ at a certain
@@ -211,7 +218,7 @@ Then, each block continues allocation from that initial state.
 Special care has to be taken when a block has multiple predecessors. We call
 this *fixing merge states*: for instance, consider the following:
 
-```goat
+```goat { width=300 }
  .---.     .---.
 | BB0 |   | BB1 |
  '-+-'     '-+-'
@@ -248,14 +255,29 @@ reloads happen.
 We insert the spills in the block that is the lowest common ancestor of all the
 blocks that reload the value.
 
+#### Clobbered Registers
+
+At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
+method iterates over the set of the allocated registers and compares them
+to a set of architecture-specific set `CalleeSavedRegisters`. If a register
+has been allocated, and it is present in this set, the register is marked as
+"clobbered", i.e., we now know that the register allocator will overwrite
+that value. Thus, these values will have to be spilled in the prologue.
+
 #### References
 
+Register allocation is a complex problem, possibly the most complicated
+part of the backend. The following references were used to implement the
+algorithm:
+
 - https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
 - https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
 - https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
 - https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
 - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
 
+We suggest to refer to them to dive deeper in the topic.
+
 ### Example
 
 At the end of the register allocation phase, the basic blocks of our `abs`
@@ -306,6 +328,13 @@ generic struct that comes already with an implementation of all or most of the
 required methods.  For instance,`regalloc.Function`is implemented by
 `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
 
+`backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
+contains the instantiation of the `regalloc.RegisterInfo` struct,
+which declares, among others
+- the set of registers that are available for allocation, excluding, for instance, those that might
+  be reserved by the runtime or the OS (`AllocatableRegisters`)
+- the registers that might be saved by the callee to the stack (`CalleeSavedRegisters`)
+
 ### Debug Flags
 
 - `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register allocation procedure.
@@ -314,14 +343,93 @@ required methods.  For instance,`regalloc.Function`is implemented by
 ## Finalization and Encoding
 
 At the end of the register allocation phase, we have enough information to complete
-the generation of the machine code. What is still missing are the prologue and
-epilogue of the function, and the encoding of the instructions into bytes.
+the generate machine code (_encoding_). What is still missing are the prologue and
+epilogue of the function.
+
+### Prologue
 
 As usual, the prologue is executed before the main body of the function, and
 the epilogue is executed at the end. The prologue is responsible for setting up
 the stack frame, and the epilogue is responsible for cleaning up the stack
 frame and returning control to the caller.
 
+Generally, this means
+- saving the return address
+- a base pointer to the stack; or, equivalently,
+the height of the stack at the beginning of the function
+
+For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack pointer:
+
+```goat {width="600" height="250"}
+                (high address)                     (high address)
+    RBP ----> +-----------------+                +-----------------+
+              |      `...`      |                |      `...`      |
+              |      ret Y      |                |      ret Y      |
+              |      `...`      |                |      `...`      |
+              |      ret 0      |                |      ret 0      |
+              |      arg X      |                |      arg X      |
+              |      `...`      |     ====>      |      `...`      |
+              |      arg 1      |                |      arg 1      |
+              |      arg 0      |                |      arg 0      |
+              |   Return Addr   |                |   Return Addr   |
+    RSP ----> +-----------------+                |    Caller_RBP   |
+                 (low address)                   +-----------------+ <----- RSP, RBP
+```
+
+While, on `arm64`, there is only a stack pointer `SP`:
+
+
+```goat {width="600" height="300"}
+            (high address)                    (high address)
+  SP ---> +-----------------+               +------------------+ <----+
+          |      `...`      |               |      `...`       |      |
+          |      ret Y      |               |      ret Y       |      |
+          |      `...`      |               |      `...`       |      |
+          |      ret 0      |               |      ret 0       |      |
+          |      arg X      |               |      arg X       |      |  size_of_arg_ret.
+          |      `...`      |     ====>     |      `...`       |      |
+          |      arg 1      |               |      arg 1       |      |
+          |      arg 0      |               |      arg 0       | <----+
+          +-----------------+               |  size_of_arg_ret |
+                                            |  return address  |
+                                            +------------------+ <---- SP
+             (low address)                     (low address)
+```
+
+The procedure happens at the end of the register allocation phase because
+at this point we have collected enough information to know how much space
+we need to reserve for clobbered registers and spilled values.
+
+Regardless of the architecture, after allocating this space, the stack
+will look as follows:
+
+```goat {height="350"}
+    (high address)
+  +-----------------+
+  |      `...`      |
+  |      ret Y      |
+  |      `...`      |
+  |      ret 0      |
+  |      arg X      |
+  |      `...`      |
+  |      arg 1      |
+  |      arg 0      |
+  | (arch-specific) |
+  +-----------------+
+  |    clobbered M  |
+  |   ............  |
+  |    clobbered 1  |
+  |    clobbered 0  |
+  |   spill slot N  |
+  |   ............  |
+  |   spill slot 0  |
+  +-----------------+
+     (low address)
+```
+
+For clarity, we make a distinction between the space reserved for the
+clobbered registers and the space reserved for the spilled values.
+
 ### PostRegAlloc:
 
 * setup prologue of the function
@@ -339,7 +447,7 @@ frame and returning control to the caller.
 
 ### Code
 
-...
+- The prologue is set up as part of the `backend.Machine.PostRegAlloc` method.
 
 ### Debug Flags
 
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
index 51a0c983cc..8bebb47fcd 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/frontend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -41,7 +41,7 @@ For instance, take the following implementation of the `abs` function:
 
 This is translated to the following block diagram:
 
-```goat
+```goat {width="100%" height="500"}
                +---------------------------------------------+
                |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
                |    v3:i32 = Iconst_32 0x0                   |
@@ -51,7 +51,7 @@ This is translated to the following block diagram:
                +---------------------------------------------+
                                       |
                                       |
-                      +---(v4 != 0)---+--(v4 == 0)----+
+                      +---`(v4 != 0)`-+-`(v4 == 0)`---+
                       |                               |
                       v                               v
         +---------------------------+   +---------------------------+
@@ -62,7 +62,7 @@ This is translated to the following block diagram:
         +---------------------------+   +---------------------------+
                       |                               |
                       |                               |
-                      +-{v5 := v7}----+---{v5 := v2}--+
+                      +-`{v5 := v7}`--+--`{v5 := v2}`-+
                                       |
                                       v
                       +------------------------------+
@@ -293,7 +293,7 @@ basic blocks is called a **critical edge** when, at the same time:
 For instance, in the example below the edge between `BB0` and `BB3`
 is a critical edge.
 
-```goat
+```goat { width="300" }
 ┌───────┐    ┌───────┐
 │  BB0  │━┓  │  BB1  │
 └───────┘ ┃  └───────┘
@@ -307,7 +307,7 @@ is a critical edge.
 In these cases the critical edge is split by introducing a new basic block,
 called a **trampoline**, where the critical edge was.
 
-```goat
+```goat  { width="300" }
 ┌───────┐            ┌───────┐
 │  BB0  │──────┐     │  BB1  │
 └───────┘      ▼     └───────┘
@@ -360,8 +360,8 @@ blk3: (v5:i32) <-- (blk1,blk2)
 
 <hr>
 
-* Next Section: [Back-End](../backend/)
 * Previous Section: [How the Optimizing Compiler Works](../)
+* Next Section: [Back-End](../backend/)
 
 [ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
 [llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes

From a9f68bdd2e4b8be92c2b20ae546c89c970274313 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 10:38:08 +0100
Subject: [PATCH 15/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 61 ++++++++++++++-----
 1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 0250e89ca1..8147ea467f 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -218,7 +218,7 @@ Then, each block continues allocation from that initial state.
 Special care has to be taken when a block has multiple predecessors. We call
 this *fixing merge states*: for instance, consider the following:
 
-```goat { width=300 }
+```goat { width="30%" }
  .---.     .---.
 | BB0 |   | BB1 |
  '-+-'     '-+-'
@@ -343,24 +343,30 @@ which declares, among others
 ## Finalization and Encoding
 
 At the end of the register allocation phase, we have enough information to complete
-the generate machine code (_encoding_). What is still missing are the prologue and
-epilogue of the function.
+the generate machine code (_encoding_). What is still missing are the *trampoline*,
+the prologue and epilogue of the function.
 
-### Prologue
+### Trampoline
 
-As usual, the prologue is executed before the main body of the function, and
-the epilogue is executed at the end. The prologue is responsible for setting up
+The trampoline is the most significant part of the finalization phase, with respect
+to the Go runtime. The trampoline maps the Go calling convention of the current
+architecture to the code that we are generating.
+
+### Prologue and Epilogue
+
+As usual, the **prologue** is executed before the main body of the function, and
+the **epilogue** is executed at the end. The prologue is responsible for setting up
 the stack frame, and the epilogue is responsible for cleaning up the stack
 frame and returning control to the caller.
 
-Generally, this means
+Generally, this means, at the very least:
 - saving the return address
 - a base pointer to the stack; or, equivalently,
 the height of the stack at the beginning of the function
 
 For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack pointer:
 
-```goat {width="600" height="250"}
+```goat {width="100%" height="250"}
                 (high address)                     (high address)
     RBP ----> +-----------------+                +-----------------+
               |      `...`      |                |      `...`      |
@@ -379,7 +385,7 @@ For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack pointer:
 While, on `arm64`, there is only a stack pointer `SP`:
 
 
-```goat {width="600" height="300"}
+```goat {width="100%" height="300"}
             (high address)                    (high address)
   SP ---> +-----------------+               +------------------+ <----+
           |      `...`      |               |      `...`       |      |
@@ -396,9 +402,25 @@ While, on `arm64`, there is only a stack pointer `SP`:
              (low address)                     (low address)
 ```
 
+However, the prologue and epilogue might also be responsible for saving and
+restoring the state of registers that might be overwritten by the function
+("clobbered"); and, if spilling occurs, prologue and epilogue are also
+responsible for reserving and releasing the space for the spilled values.
+
+
+For clarity, we make a distinction between the space reserved for the
+clobbered registers and the space reserved for the spilled values:
+
+- Spill slots are used to temporarily store the values that needs spilling
+  as determined by the register allocator. This section must have a fix
+  height, but its contents will change over time, as registers are being
+  spilled and reloaded.
+- Clobbered registers are, similarly, determined by the register allocator,
+  but they are stashed in the prologue and then restored in the epilogue.
+
 The procedure happens at the end of the register allocation phase because
 at this point we have collected enough information to know how much space
-we need to reserve for clobbered registers and spilled values.
+we need to reserve.
 
 Regardless of the architecture, after allocating this space, the stack
 will look as follows:
@@ -427,14 +449,21 @@ will look as follows:
      (low address)
 ```
 
-For clarity, we make a distinction between the space reserved for the
-clobbered registers and the space reserved for the spilled values.
+The epilogue simply reverses the operation of the prologue.
+
+### Other Post-RegAlloc Logic
+
+The `backend.Machine.PostRegAlloc` method is invoked after the register
+allocation procedure; while its main role is to define the prologue and epilogue
+of the function, it also serves as a hook to perform other, arch-specific
+duty, that has to happen after the register allocation phase.
+
+For instance, on `amd64`, the constraints for some instructions are hard
+to express in a meaningful way for the register allocation procedure (for instance,
+the `div` instruction implicitly use registers `rdx`, `rax`). Instead, they are lowered
+with ad-hoc logic as part of the implementation `backend.Machine.PostRegAlloc` method.
 
-### PostRegAlloc:
 
-* setup prologue of the function
-* inserts epilogue of the function
-* machine-specific custom logic (e.g. post-regalloc lowering)
 
 ### Encoding:
 

From 1165affa2b17d2b4d3e4425ef7845bcf9b119528 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 10:51:43 +0100
Subject: [PATCH 16/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 21 +++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 8147ea467f..28823c49f5 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -352,6 +352,27 @@ The trampoline is the most significant part of the finalization phase, with resp
 to the Go runtime. The trampoline maps the Go calling convention of the current
 architecture to the code that we are generating.
 
+For instance, the Go calling convention differs in a few ways from the standard
+calling convention of the `amd64` and `arm64` architecture. The Go calling convention
+
+
+**TODO**
+
+#### Code
+
+The trampoline is generated by `backend.Machine.CompileGoFunctionTrampoline()` method.
+You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`.
+
+#### Further References
+
+- Go's [internal ABI documentation][abi-internal] is a good starting point to
+  understand the calling convention of the Go runtime.
+- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
+  is also an excellent reference for `amd64`.
+
+[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
+[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
+
 ### Prologue and Epilogue
 
 As usual, the **prologue** is executed before the main body of the function, and

From af4fae8528f80a42d0a44e44d995fd20787c3495 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 12:36:05 +0100
Subject: [PATCH 17/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../backend.md                                | 48 +++++--------------
 1 file changed, 11 insertions(+), 37 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 28823c49f5..f8fc3d3fb4 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -342,36 +342,11 @@ which declares, among others
 
 ## Finalization and Encoding
 
-At the end of the register allocation phase, we have enough information to complete
-the generate machine code (_encoding_). What is still missing are the *trampoline*,
-the prologue and epilogue of the function.
-
-### Trampoline
-
-The trampoline is the most significant part of the finalization phase, with respect
-to the Go runtime. The trampoline maps the Go calling convention of the current
-architecture to the code that we are generating.
-
-For instance, the Go calling convention differs in a few ways from the standard
-calling convention of the `amd64` and `arm64` architecture. The Go calling convention
-
-
-**TODO**
-
-#### Code
-
-The trampoline is generated by `backend.Machine.CompileGoFunctionTrampoline()` method.
-You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`.
-
-#### Further References
+**TODO: not finished**
 
-- Go's [internal ABI documentation][abi-internal] is a good starting point to
-  understand the calling convention of the Go runtime.
-- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
-  is also an excellent reference for `amd64`.
-
-[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
-[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
+At the end of the register allocation phase, we have enough information to complete
+the generate machine code (_encoding_). We are only missing the prologue and
+epilogue of the function.
 
 ### Prologue and Epilogue
 
@@ -484,16 +459,11 @@ to express in a meaningful way for the register allocation procedure (for instan
 the `div` instruction implicitly use registers `rdx`, `rax`). Instead, they are lowered
 with ad-hoc logic as part of the implementation `backend.Machine.PostRegAlloc` method.
 
-
-
 ### Encoding:
 
-* encodes the low-level instructions into bytes
-
-### Other
-
-- MMap code segment
-- resolve relocations
+The final stage of the backend encodes the machine instructions into bytes
+and writes them to the target buffer. Before proceeding with the encoding,
+relative addresses in branching instructions or addressing modes are resolved.
 
 ### Code
 
@@ -506,6 +476,10 @@ with ad-hoc logic as part of the implementation `backend.Machine.PostRegAlloc` m
 - `wazevoapi.printMachineCodeHexPerFunctionUnmodified`
 - `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable`
 
+## Appendix: Trampolines
+
+
+
 <hr>
 
 * Previous Section: [Front-End](../frontend/)

From e03ae1ef715dda47debe9a17ae8359740ea78fe5 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 14:02:02 +0100
Subject: [PATCH 18/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../how_the_optimizing_compiler_works/backend.md  | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index f8fc3d3fb4..3420f2098c 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -342,10 +342,8 @@ which declares, among others
 
 ## Finalization and Encoding
 
-**TODO: not finished**
-
-At the end of the register allocation phase, we have enough information to complete
-the generate machine code (_encoding_). We are only missing the prologue and
+At the end of the register allocation phase, we have enough information to finally
+generate machine code (_encoding_). We are only missing the prologue and
 epilogue of the function.
 
 ### Prologue and Epilogue
@@ -403,7 +401,6 @@ restoring the state of registers that might be overwritten by the function
 ("clobbered"); and, if spilling occurs, prologue and epilogue are also
 responsible for reserving and releasing the space for the spilled values.
 
-
 For clarity, we make a distinction between the space reserved for the
 clobbered registers and the space reserved for the spilled values:
 
@@ -465,14 +462,18 @@ The final stage of the backend encodes the machine instructions into bytes
 and writes them to the target buffer. Before proceeding with the encoding,
 relative addresses in branching instructions or addressing modes are resolved.
 
+The procedure encodes the instructions in the order they appear in the
+function.
+
 ### Code
 
 - The prologue is set up as part of the `backend.Machine.PostRegAlloc` method.
 
 ### Debug Flags
 
-- `wazevoapi.PrintFinalizedMachineCode`
-- `wazevoapi.PrintMachineCodeHexPerFunction`
+- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the function
+  after the finalization phase.
+- `wazevoapi.PrintMachineCodeHexPerFunction` prints a hex representation of the function
 - `wazevoapi.printMachineCodeHexPerFunctionUnmodified`
 - `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable`
 

From 18680c9ec3495a0c10748755a6e9b79b3b2c2a22 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 15:54:02 +0100
Subject: [PATCH 19/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                               | 82 +++++++++++++++++++
 .../backend.md                                | 22 +++--
 2 files changed, 96 insertions(+), 8 deletions(-)
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/appendix.md

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
new file mode 100644
index 0000000000..86871f1aa4
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -0,0 +1,82 @@
++++
+title = "Appendix: Trampolines"
+layout = "single"
++++
+
+Trampolines are used to interface between the Go runtime and the generated
+code, in two cases:
+
+- when we need to **enter the generated code** from the Go runtime.
+- when we need to **leave the generated code** to invoke a host function (written in Go).
+
+In this section we want to complete the picture of how a Wasm function gets
+translated from Wasm to executable code in the optimizing compiler, by
+describing how to jump into the execution of the generated code at run-time.
+
+## Entering the Generated Code
+
+Before the compilation of the function starts, a **preamble** is generated.
+This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
+The procedure first instantiates a `backend.FunctionABI` struct with metadata
+about the expected ABI for a function with a given signature, using the
+algorithm outlined in [Go's documentation][abi-cc].
+
+
+
+	// First, we save executionContextPtrReg into a callee-saved register so that it can be used in epilogue as well.
+	// 		mov %executionContextPtrReg, %savedExecutionContextPtr
+
+	// Next, save the current FP, SP and LR into the wazevo.executionContext:
+	// Next is to save the original RBP and RSP into the execution context.
+
+
+	// Then, move the Go-allocated stack pointer to SP:
+	// Now set the RSP to the Go-allocated stack pointer.
+
+		// Allocate stack slots for the arguments and return values.
+
+
+[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
+[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
+[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
+
+
+## Leaving the Generated Code
+
+
+In "[How do compiler functions work?][how-do-compiler-functions-work]",
+we already outlined how _leaving_ the generated code works with the help of
+a function. We will complete here the picture by briefly describing
+the code that is generated.
+
+While there are [ongoing efforts to change the status quo][proposal-register-cc],
+Go's traditional calling convention differs in a few ways from the standard
+calling convention of the `amd64` and `arm64` architecture. [Traditionally][abi-asm],
+Go has followed [Plan 9's calling convention][proposal-register-cc], in which
+arguments and results are passed **using the stack**.
+
+
+## Code
+
+- The trampoline to enter the generated function is implemented by the `backend.Machine.CompileEntryPreamble()` method.
+- The trampoline to return traps and invoke host functions is generated by `backend.Machine.CompileGoFunctionTrampoline()` method.
+
+You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`.
+
+## Further References
+
+- Go's [internal ASM documentation][abi-asm] is a good starting point to
+  understand the calling convention of the Go runtime.
+- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
+  is also an excellent reference for `amd64`.
+- Go's [internal ABI documentation][abi-internal] complements Go's ASM documentation
+  with details on the internal, unstable ABI, known as *ABIInternal*. Notice that,
+  however, the relevant bits to interface with ASM code are in the documentation for
+  *ABI0* stable interface, i.e., the aforementioned [internal ASM documentation][abi-asm]
+
+[abi-asm]: https://go.dev/doc/asm
+[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
+[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
+[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
+[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
+
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 3420f2098c..ac21eaf98c 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -442,6 +442,10 @@ will look as follows:
      (low address)
 ```
 
+Note: the prologue might also introduce a check of the stack bounds. If
+there is no sufficient space to allocate the stack frame, the function will
+exit the execution and will try to grow it from the Go runtime.
+
 The epilogue simply reverses the operation of the prologue.
 
 ### Other Post-RegAlloc Logic
@@ -456,7 +460,7 @@ to express in a meaningful way for the register allocation procedure (for instan
 the `div` instruction implicitly use registers `rdx`, `rax`). Instead, they are lowered
 with ad-hoc logic as part of the implementation `backend.Machine.PostRegAlloc` method.
 
-### Encoding:
+### Encoding
 
 The final stage of the backend encodes the machine instructions into bytes
 and writes them to the target buffer. Before proceeding with the encoding,
@@ -467,22 +471,24 @@ function.
 
 ### Code
 
-- The prologue is set up as part of the `backend.Machine.PostRegAlloc` method.
+- The prologue and epilogue are set up as part of the `backend.Machine.PostRegAlloc` method.
+- The encoding is done by the `backend.Machine.Encode` method.
 
 ### Debug Flags
 
 - `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the function
   after the finalization phase.
-- `wazevoapi.PrintMachineCodeHexPerFunction` prints a hex representation of the function
-- `wazevoapi.printMachineCodeHexPerFunctionUnmodified`
-- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable`
-
-## Appendix: Trampolines
-
+- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex representation of the function generated code as it is.
+- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex representation of the function generated code that can be disassembled.
 
+The reason for the distinction between the last two flags is that the generated
+code in some cases might not be disassemblable. `PrintMachineCodeHexPerFunctionDisassemblable`
+flag prints a hex encoding of the generated code that can be disassembled,
+but cannot be executed.
 
 <hr>
 
+* Next Section: [Appendix: Trampolines](../appendix/)
 * Previous Section: [Front-End](../frontend/)
 
 [ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf

From ea9b8001f966a003b33c30bf6c0cef9b6f2e19fa Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 17:35:48 +0100
Subject: [PATCH 20/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                               | 85 ++++++++++++++-----
 .../backend.md                                |  2 +-
 2 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
index 86871f1aa4..b2b7a07708 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -15,27 +15,73 @@ describing how to jump into the execution of the generated code at run-time.
 
 ## Entering the Generated Code
 
-Before the compilation of the function starts, a **preamble** is generated.
+At run-time, user space invokes a Wasm function through the public
+`api.Function` interface, using methods `Call()` or `CallWithStack()`.
+The implementation of this method, in turn, eventually invokes
+an ASM **trampoline**. The signature of this trampoline in Go code is:
+
+```go
+func entrypoint(
+	preambleExecutable, functionExecutable *byte,
+	executionContextPtr uintptr, moduleContextPtr *byte,
+	paramResultStackPtr *uint64,
+	goAllocatedStackSlicePtr uintptr)
+```
+
+- `preambleExecutable` is a pointer to the generated code for the preamble (see below)
+- `functionExecutable` is a pointer to the generated code for the function (as described in the previous sections).
+- `executionContextPtr` is a raw pointer to the `wazevo.executionContext` struct. This struct
+  is used to save the state of the Go runtime before entering or leaving the generated
+  code. It also holds shared state between the Go runtime and the generated code,
+  such as the exit code that is used to terminate execution on failure, or suspend
+  it to invoke host functions.
+- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct. This struct
+  Its contents are basically the pointers to the module instance, specific objects
+  as well as functions. This is sometimes called "VMContext" in other Wasm runtimes.
+- `paramResultStackPtr` is a pointer to the slice where the arguments and results
+  of the function are passed.
+- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack for
+  holding values and call frames. For further details refer to
+  [/internal/engine/compiler/engine.go][wazero-engine-stack]
+
+The ASM trampoline is guaranteed to follow the stable calling convention described in
+[Go's ASM documentation][abi-asm] (sometimes referred to as [ABI0][proposal-register-cc])
+The trampoline can be found in `backend/isa/<arch>/abi_entry_<arch>.s`.
+
+For each given architecture, the trampoline:
+- moves the arguments to some conventional registers that are documented
+  to be free at the time of the call,
+- finally, it jumps into the execution of the generated code for the preamble
+
+The **preamble** is generated distinctly from the rest of the function, and before it.
+
 This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
 The procedure first instantiates a `backend.FunctionABI` struct with metadata
 about the expected ABI for a function with a given signature, using the
 algorithm outlined in [Go's documentation][abi-cc].
 
+The preamble sets the fields in the `wazevo.executionContext`.
 
+At the beginning of the preamble:
 
-	// First, we save executionContextPtrReg into a callee-saved register so that it can be used in epilogue as well.
-	// 		mov %executionContextPtrReg, %savedExecutionContextPtr
+- We set a register to point to the `*wazevo.executionContext` struct.
+- we save the stack pointers, frame pointers, return addresses, etc. to that struct.
+- we update the stack pointer to point to `paramResultStackPtr`.
 
-	// Next, save the current FP, SP and LR into the wazevo.executionContext:
-	// Next is to save the original RBP and RSP into the execution context.
+The generated code works in concert with the assumption that the preamble
+has been entered through the aforementioned trampoline. Thus, it assumes
+that the arguments can be found in some specific registers.
 
+The preamble then assigns the arguments pointed at by `paramResultStackPtr`
+to the registers that the generated code expects.
 
-	// Then, move the Go-allocated stack pointer to SP:
-	// Now set the RSP to the Go-allocated stack pointer.
+Finally, it invokes the generated code for the function.
 
-		// Allocate stack slots for the arguments and return values.
+The epilogue reverses the process.
 
+The arch-specific code can be found in `backend/isa/<arch>/abi_entry_preamble.go`.
 
+[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
 [abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
 [abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
 [abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
@@ -43,36 +89,29 @@ algorithm outlined in [Go's documentation][abi-cc].
 
 ## Leaving the Generated Code
 
-
 In "[How do compiler functions work?][how-do-compiler-functions-work]",
 we already outlined how _leaving_ the generated code works with the help of
 a function. We will complete here the picture by briefly describing
 the code that is generated.
 
-While there are [ongoing efforts to change the status quo][proposal-register-cc],
-Go's traditional calling convention differs in a few ways from the standard
-calling convention of the `amd64` and `arm64` architecture. [Traditionally][abi-asm],
-Go has followed [Plan 9's calling convention][proposal-register-cc], in which
-arguments and results are passed **using the stack**.
-
-
 ## Code
 
 - The trampoline to enter the generated function is implemented by the `backend.Machine.CompileEntryPreamble()` method.
 - The trampoline to return traps and invoke host functions is generated by `backend.Machine.CompileGoFunctionTrampoline()` method.
 
-You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`.
+You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`,
+`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found under
+`backend/isa/<arch>/abi_entry_<arch>.s`.
 
 ## Further References
 
-- Go's [internal ASM documentation][abi-asm] is a good starting point to
-  understand the calling convention of the Go runtime.
-- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
-  is also an excellent reference for `amd64`.
 - Go's [internal ABI documentation][abi-internal] complements Go's ASM documentation
   with details on the internal, unstable ABI, known as *ABIInternal*. Notice that,
-  however, the relevant bits to interface with ASM code are in the documentation for
-  *ABI0* stable interface, i.e., the aforementioned [internal ASM documentation][abi-asm]
+  however, the calling convention for ASM is different and described in the ASM documentation.
+- Go's [internal ASM documentation][abi-asm] describes the stable, stack-based
+  calling convention for ASM (_ABI0_).
+- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
+  is also an excellent reference for `amd64`.
 
 [abi-asm]: https://go.dev/doc/asm
 [abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index ac21eaf98c..339b968167 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -446,7 +446,7 @@ Note: the prologue might also introduce a check of the stack bounds. If
 there is no sufficient space to allocate the stack frame, the function will
 exit the execution and will try to grow it from the Go runtime.
 
-The epilogue simply reverses the operation of the prologue.
+The epilogue simply reverses the operations of the prologue.
 
 ### Other Post-RegAlloc Logic
 

From 122d8d50e62fd1dd6838412b89ae2d256d19a262 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 17:39:11 +0100
Subject: [PATCH 21/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                               | 114 ++++++++++--------
 .../backend.md                                | 111 +++++++++--------
 2 files changed, 123 insertions(+), 102 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
index b2b7a07708..f7efcf55df 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -7,7 +7,8 @@ Trampolines are used to interface between the Go runtime and the generated
 code, in two cases:
 
 - when we need to **enter the generated code** from the Go runtime.
-- when we need to **leave the generated code** to invoke a host function (written in Go).
+- when we need to **leave the generated code** to invoke a host function
+  (written in Go).
 
 In this section we want to complete the picture of how a Wasm function gets
 translated from Wasm to executable code in the optimizing compiler, by
@@ -16,9 +17,9 @@ describing how to jump into the execution of the generated code at run-time.
 ## Entering the Generated Code
 
 At run-time, user space invokes a Wasm function through the public
-`api.Function` interface, using methods `Call()` or `CallWithStack()`.
-The implementation of this method, in turn, eventually invokes
-an ASM **trampoline**. The signature of this trampoline in Go code is:
+`api.Function` interface, using methods `Call()` or `CallWithStack()`.  The
+implementation of this method, in turn, eventually invokes an ASM
+**trampoline**. The signature of this trampoline in Go code is:
 
 ```go
 func entrypoint(
@@ -28,58 +29,65 @@ func entrypoint(
 	goAllocatedStackSlicePtr uintptr)
 ```
 
-- `preambleExecutable` is a pointer to the generated code for the preamble (see below)
-- `functionExecutable` is a pointer to the generated code for the function (as described in the previous sections).
-- `executionContextPtr` is a raw pointer to the `wazevo.executionContext` struct. This struct
-  is used to save the state of the Go runtime before entering or leaving the generated
-  code. It also holds shared state between the Go runtime and the generated code,
-  such as the exit code that is used to terminate execution on failure, or suspend
-  it to invoke host functions.
-- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct. This struct
-  Its contents are basically the pointers to the module instance, specific objects
-  as well as functions. This is sometimes called "VMContext" in other Wasm runtimes.
-- `paramResultStackPtr` is a pointer to the slice where the arguments and results
-  of the function are passed.
-- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack for
-  holding values and call frames. For further details refer to
-  [/internal/engine/compiler/engine.go][wazero-engine-stack]
-
-The ASM trampoline is guaranteed to follow the stable calling convention described in
-[Go's ASM documentation][abi-asm] (sometimes referred to as [ABI0][proposal-register-cc])
-The trampoline can be found in `backend/isa/<arch>/abi_entry_<arch>.s`.
+- `preambleExecutable` is a pointer to the generated code for the preamble (see
+  below)
+- `functionExecutable` is a pointer to the generated code for the function (as
+  described in the previous sections).
+- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
+  struct. This struct is used to save the state of the Go runtime before
+entering or leaving the generated code. It also holds shared state between the
+Go runtime and the generated code, such as the exit code that is used to
+terminate execution on failure, or suspend it to invoke host functions.
+- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
+  This struct Its contents are basically the pointers to the module instance,
+specific objects as well as functions. This is sometimes called "VMContext" in
+other Wasm runtimes.
+- `paramResultStackPtr` is a pointer to the slice where the arguments and
+  results of the function are passed.
+- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
+  for holding values and call frames. For further details refer to
+[/internal/engine/compiler/engine.go][wazero-engine-stack]
+
+The ASM trampoline is guaranteed to follow the stable calling convention
+described in [Go's ASM documentation][abi-asm] (sometimes referred to as
+[ABI0][proposal-register-cc]) The trampoline can be found in
+`backend/isa/<arch>/abi_entry_<arch>.s`.
 
 For each given architecture, the trampoline:
-- moves the arguments to some conventional registers that are documented
-  to be free at the time of the call,
+- moves the arguments to some conventional registers that are documented to be
+  free at the time of the call,
 - finally, it jumps into the execution of the generated code for the preamble
 
-The **preamble** is generated distinctly from the rest of the function, and before it.
+The **preamble** is generated distinctly from the rest of the function, and
+before it.
 
-This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
-The procedure first instantiates a `backend.FunctionABI` struct with metadata
-about the expected ABI for a function with a given signature, using the
-algorithm outlined in [Go's documentation][abi-cc].
+This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.  The
+procedure first instantiates a `backend.FunctionABI` struct with metadata about
+the expected ABI for a function with a given signature, using the algorithm
+outlined in [Go's documentation][abi-cc].
 
 The preamble sets the fields in the `wazevo.executionContext`.
 
 At the beginning of the preamble:
 
 - We set a register to point to the `*wazevo.executionContext` struct.
-- we save the stack pointers, frame pointers, return addresses, etc. to that struct.
+- we save the stack pointers, frame pointers, return addresses, etc. to that
+  struct.
 - we update the stack pointer to point to `paramResultStackPtr`.
 
-The generated code works in concert with the assumption that the preamble
-has been entered through the aforementioned trampoline. Thus, it assumes
-that the arguments can be found in some specific registers.
+The generated code works in concert with the assumption that the preamble has
+been entered through the aforementioned trampoline. Thus, it assumes that the
+arguments can be found in some specific registers.
 
-The preamble then assigns the arguments pointed at by `paramResultStackPtr`
-to the registers that the generated code expects.
+The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
+the registers that the generated code expects.
 
 Finally, it invokes the generated code for the function.
 
 The epilogue reverses the process.
 
-The arch-specific code can be found in `backend/isa/<arch>/abi_entry_preamble.go`.
+The arch-specific code can be found in
+`backend/isa/<arch>/abi_entry_preamble.go`.
 
 [wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
 [abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
@@ -89,29 +97,33 @@ The arch-specific code can be found in `backend/isa/<arch>/abi_entry_preamble.go
 
 ## Leaving the Generated Code
 
-In "[How do compiler functions work?][how-do-compiler-functions-work]",
-we already outlined how _leaving_ the generated code works with the help of
-a function. We will complete here the picture by briefly describing
-the code that is generated.
+In "[How do compiler functions work?][how-do-compiler-functions-work]", we
+already outlined how _leaving_ the generated code works with the help of a
+function. We will complete here the picture by briefly describing the code that
+is generated.
 
 ## Code
 
-- The trampoline to enter the generated function is implemented by the `backend.Machine.CompileEntryPreamble()` method.
-- The trampoline to return traps and invoke host functions is generated by `backend.Machine.CompileGoFunctionTrampoline()` method.
+- The trampoline to enter the generated function is implemented by the
+  `backend.Machine.CompileEntryPreamble()` method.
+- The trampoline to return traps and invoke host functions is generated by
+  `backend.Machine.CompileGoFunctionTrampoline()` method.
 
-You can find arch-specific implementations in `backend/isa/<arch>/abi_go_call.go`,
-`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found under
-`backend/isa/<arch>/abi_entry_<arch>.s`.
+You can find arch-specific implementations in
+`backend/isa/<arch>/abi_go_call.go`,
+`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
+under `backend/isa/<arch>/abi_entry_<arch>.s`.
 
 ## Further References
 
-- Go's [internal ABI documentation][abi-internal] complements Go's ASM documentation
-  with details on the internal, unstable ABI, known as *ABIInternal*. Notice that,
-  however, the calling convention for ASM is different and described in the ASM documentation.
+- Go's [internal ABI documentation][abi-internal] complements Go's ASM
+  documentation with details on the internal, unstable ABI, known as
+*ABIInternal*. Notice that, however, the calling convention for ASM is
+different and described in the ASM documentation.
 - Go's [internal ASM documentation][abi-asm] describes the stable, stack-based
   calling convention for ASM (_ABI0_).
-- Raphael Poss's [The Go low-level calling convention on x86-64][go-call-conv-x86]
-  is also an excellent reference for `amd64`.
+- Raphael Poss's [The Go low-level calling convention on
+  x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
 
 [abi-asm]: https://go.dev/doc/asm
 [abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 339b968167..7f6b10d5fd 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -329,36 +329,41 @@ required methods.  For instance,`regalloc.Function`is implemented by
 `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
 
 `backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
-contains the instantiation of the `regalloc.RegisterInfo` struct,
-which declares, among others
-- the set of registers that are available for allocation, excluding, for instance, those that might
-  be reserved by the runtime or the OS (`AllocatableRegisters`)
-- the registers that might be saved by the callee to the stack (`CalleeSavedRegisters`)
+contains the instantiation of the `regalloc.RegisterInfo` struct, which
+declares, among others
+- the set of registers that are available for allocation, excluding, for
+  instance, those that might be reserved by the runtime or the OS
+(`AllocatableRegisters`)
+- the registers that might be saved by the callee to the stack
+  (`CalleeSavedRegisters`)
 
 ### Debug Flags
 
-- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register allocation procedure.
-- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register allocation result.
+- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
+  allocation procedure.
+- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
+  allocation result.
 
 ## Finalization and Encoding
 
-At the end of the register allocation phase, we have enough information to finally
-generate machine code (_encoding_). We are only missing the prologue and
-epilogue of the function.
+At the end of the register allocation phase, we have enough information to
+finally generate machine code (_encoding_). We are only missing the prologue
+and epilogue of the function.
 
 ### Prologue and Epilogue
 
-As usual, the **prologue** is executed before the main body of the function, and
-the **epilogue** is executed at the end. The prologue is responsible for setting up
-the stack frame, and the epilogue is responsible for cleaning up the stack
-frame and returning control to the caller.
+As usual, the **prologue** is executed before the main body of the function,
+and the **epilogue** is executed at the end. The prologue is responsible for
+setting up the stack frame, and the epilogue is responsible for cleaning up the
+stack frame and returning control to the caller.
 
 Generally, this means, at the very least:
 - saving the return address
-- a base pointer to the stack; or, equivalently,
-the height of the stack at the beginning of the function
+- a base pointer to the stack; or, equivalently, the height of the stack at the
+  beginning of the function
 
-For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack pointer:
+For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
+pointer:
 
 ```goat {width="100%" height="250"}
                 (high address)                     (high address)
@@ -401,22 +406,22 @@ restoring the state of registers that might be overwritten by the function
 ("clobbered"); and, if spilling occurs, prologue and epilogue are also
 responsible for reserving and releasing the space for the spilled values.
 
-For clarity, we make a distinction between the space reserved for the
-clobbered registers and the space reserved for the spilled values:
+For clarity, we make a distinction between the space reserved for the clobbered
+registers and the space reserved for the spilled values:
 
-- Spill slots are used to temporarily store the values that needs spilling
-  as determined by the register allocator. This section must have a fix
-  height, but its contents will change over time, as registers are being
-  spilled and reloaded.
-- Clobbered registers are, similarly, determined by the register allocator,
-  but they are stashed in the prologue and then restored in the epilogue.
+- Spill slots are used to temporarily store the values that needs spilling as
+  determined by the register allocator. This section must have a fix height,
+but its contents will change over time, as registers are being spilled and
+reloaded.
+- Clobbered registers are, similarly, determined by the register allocator, but
+  they are stashed in the prologue and then restored in the epilogue.
 
-The procedure happens at the end of the register allocation phase because
-at this point we have collected enough information to know how much space
-we need to reserve.
+The procedure happens at the end of the register allocation phase because at
+this point we have collected enough information to know how much space we need
+to reserve.
 
-Regardless of the architecture, after allocating this space, the stack
-will look as follows:
+Regardless of the architecture, after allocating this space, the stack will
+look as follows:
 
 ```goat {height="350"}
     (high address)
@@ -442,49 +447,53 @@ will look as follows:
      (low address)
 ```
 
-Note: the prologue might also introduce a check of the stack bounds. If
-there is no sufficient space to allocate the stack frame, the function will
-exit the execution and will try to grow it from the Go runtime.
+Note: the prologue might also introduce a check of the stack bounds. If there
+is no sufficient space to allocate the stack frame, the function will exit the
+execution and will try to grow it from the Go runtime.
 
 The epilogue simply reverses the operations of the prologue.
 
 ### Other Post-RegAlloc Logic
 
 The `backend.Machine.PostRegAlloc` method is invoked after the register
-allocation procedure; while its main role is to define the prologue and epilogue
-of the function, it also serves as a hook to perform other, arch-specific
-duty, that has to happen after the register allocation phase.
+allocation procedure; while its main role is to define the prologue and
+epilogue of the function, it also serves as a hook to perform other,
+arch-specific duty, that has to happen after the register allocation phase.
 
-For instance, on `amd64`, the constraints for some instructions are hard
-to express in a meaningful way for the register allocation procedure (for instance,
-the `div` instruction implicitly use registers `rdx`, `rax`). Instead, they are lowered
-with ad-hoc logic as part of the implementation `backend.Machine.PostRegAlloc` method.
+For instance, on `amd64`, the constraints for some instructions are hard to
+express in a meaningful way for the register allocation procedure (for
+instance, the `div` instruction implicitly use registers `rdx`, `rax`).
+Instead, they are lowered with ad-hoc logic as part of the implementation
+`backend.Machine.PostRegAlloc` method.
 
 ### Encoding
 
-The final stage of the backend encodes the machine instructions into bytes
-and writes them to the target buffer. Before proceeding with the encoding,
-relative addresses in branching instructions or addressing modes are resolved.
+The final stage of the backend encodes the machine instructions into bytes and
+writes them to the target buffer. Before proceeding with the encoding, relative
+addresses in branching instructions or addressing modes are resolved.
 
 The procedure encodes the instructions in the order they appear in the
 function.
 
 ### Code
 
-- The prologue and epilogue are set up as part of the `backend.Machine.PostRegAlloc` method.
+- The prologue and epilogue are set up as part of the
+  `backend.Machine.PostRegAlloc` method.
 - The encoding is done by the `backend.Machine.Encode` method.
 
 ### Debug Flags
 
-- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the function
-  after the finalization phase.
-- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex representation of the function generated code as it is.
-- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex representation of the function generated code that can be disassembled.
+- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
+  function after the finalization phase.
+- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
+  representation of the function generated code as it is.
+- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
+  representation of the function generated code that can be disassembled.
 
 The reason for the distinction between the last two flags is that the generated
-code in some cases might not be disassemblable. `PrintMachineCodeHexPerFunctionDisassemblable`
-flag prints a hex encoding of the generated code that can be disassembled,
-but cannot be executed.
+code in some cases might not be disassemblable.
+`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
+the generated code that can be disassembled, but cannot be executed.
 
 <hr>
 

From 56019db018b117c6156430e77c450270a52d2363 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 17:52:24 +0100
Subject: [PATCH 22/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                                | 18 +++++++++++++++++-
 .../backend.md                                 |  2 +-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
index f7efcf55df..062314af66 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -84,7 +84,8 @@ the registers that the generated code expects.
 
 Finally, it invokes the generated code for the function.
 
-The epilogue reverses the process.
+The epilogue reverses the process, finally returning control to the caller
+of the `entrypoint()` function, and the Go runtime.
 
 The arch-specific code can be found in
 `backend/isa/<arch>/abi_entry_preamble.go`.
@@ -102,6 +103,21 @@ already outlined how _leaving_ the generated code works with the help of a
 function. We will complete here the picture by briefly describing the code that
 is generated.
 
+When the generated code needs to return control to the Go runtime,
+it inserts a meta-instruction that is called `exitSequence` in both `amd64` and `arm64` backends.
+This meta-instruction sets the `exitCode` in the `wazevo.executionContext` struct,
+restore the stack pointers and then returns control to the caller of the
+`entrypoint()` function described above.
+
+As described in "[How do compiler functions work?][how-do-compiler-functions-work]",
+the mechanism is essentially the same when invoking a host function or raising
+an error. However, when a function is invoked the `exitCode` also indicates
+the identifier of the host function to be invoked.
+
+// goCallStackView is a function to get a view of the stack before a Go call, which
+// is the view of the stack allocated in CompileGoFunctionTrampoline.
+
+
 ## Code
 
 - The trampoline to enter the generated function is implemented by the
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
index 7f6b10d5fd..0ea92f9d03 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -497,8 +497,8 @@ the generated code that can be disassembled, but cannot be executed.
 
 <hr>
 
-* Next Section: [Appendix: Trampolines](../appendix/)
 * Previous Section: [Front-End](../frontend/)
+* Next Section: [Appendix: Trampolines](../appendix/)
 
 [ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
 [go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go

From 51184156cccb504eb825f0d2cd507d9818fd0b66 Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 18:55:50 +0100
Subject: [PATCH 23/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                               | 55 +++++++++++++++++--
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
index 062314af66..a8f313b8aa 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -84,8 +84,11 @@ the registers that the generated code expects.
 
 Finally, it invokes the generated code for the function.
 
-The epilogue reverses the process, finally returning control to the caller
-of the `entrypoint()` function, and the Go runtime.
+The epilogue reverses part of the process, finally returning control to the caller
+of the `entrypoint()` function, and the Go runtime. The caller of `entrypoint()`
+is also responsible for completing the cleaning up procedure by invoking
+`afterGoFunctionCallEntrypoint()` (again, implemented in backend-specific ASM).
+which will restore the stack pointers and return control to the caller of the function.
 
 The arch-specific code can be found in
 `backend/isa/<arch>/abi_entry_preamble.go`.
@@ -114,9 +117,53 @@ the mechanism is essentially the same when invoking a host function or raising
 an error. However, when a function is invoked the `exitCode` also indicates
 the identifier of the host function to be invoked.
 
-// goCallStackView is a function to get a view of the stack before a Go call, which
-// is the view of the stack allocated in CompileGoFunctionTrampoline.
+The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()` method.
+This method is actually invoked when host modules are being instantiated.
+It generates a trampoline that is used to invoke such functions from the generated code.
 
+This trampoline implements essentially the same prologue as the `entrypoint()`,
+but it also reserves space for the arguments and results of the function to be
+invoked.
+
+A host function has the signature:
+
+```go
+func(ctx context.Context, stack []uint64)
+```
+
+the function arguments in the `stack` parameter are copied over to the
+reserved slots of the real stack. For instance, on `arm64` the stack layout
+would look as follows (on `amd64` it would be similar):
+
+```goat
+                  (high address)
+    SP ------> +-----------------+  <----+
+               |     .......     |       |
+               |      ret Y      |       |
+               |     .......     |       |
+               |      ret 0      |       |
+               |      arg X      |       |  size_of_arg_ret
+               |     .......     |       |
+               |      arg 1      |       |
+               |      arg 0      |  <----+ <-------- originalArg0Reg
+               | size_of_arg_ret |
+               |  ReturnAddress  |
+               +-----------------+ <----+
+               |      xxxx       |      |  ;; might be padded to make it 16-byte aligned.
+          +--->|  arg[N]/ret[M]  |      |
+ sliceSize|    |   ............  |      | goCallStackSize
+          |    |  arg[1]/ret[1]  |      |
+          +--->|  arg[0]/ret[0]  | <----+ <-------- arg0ret0AddrReg
+               |    sliceSize    |
+               |   frame_size    |
+               +-----------------+
+                  (low address)
+```
+
+Finally, the trampoline jumps into the execution of the host function
+using the `exitSequence` meta-instruction.
+
+Upon return, the process is reversed.
 
 ## Code
 

From a71a6bfece4cf4c8117672c5dcb6ebb396325c7c Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Thu, 15 Feb 2024 18:57:07 +0100
Subject: [PATCH 24/24] wip

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 .../appendix.md                               | 51 ++++++++++---------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
index a8f313b8aa..bcd42a621f 100644
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -84,11 +84,12 @@ the registers that the generated code expects.
 
 Finally, it invokes the generated code for the function.
 
-The epilogue reverses part of the process, finally returning control to the caller
-of the `entrypoint()` function, and the Go runtime. The caller of `entrypoint()`
-is also responsible for completing the cleaning up procedure by invoking
-`afterGoFunctionCallEntrypoint()` (again, implemented in backend-specific ASM).
-which will restore the stack pointers and return control to the caller of the function.
+The epilogue reverses part of the process, finally returning control to the
+caller of the `entrypoint()` function, and the Go runtime. The caller of
+`entrypoint()` is also responsible for completing the cleaning up procedure by
+invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
+backend-specific ASM).  which will restore the stack pointers and return
+control to the caller of the function.
 
 The arch-specific code can be found in
 `backend/isa/<arch>/abi_entry_preamble.go`.
@@ -106,20 +107,22 @@ already outlined how _leaving_ the generated code works with the help of a
 function. We will complete here the picture by briefly describing the code that
 is generated.
 
-When the generated code needs to return control to the Go runtime,
-it inserts a meta-instruction that is called `exitSequence` in both `amd64` and `arm64` backends.
-This meta-instruction sets the `exitCode` in the `wazevo.executionContext` struct,
-restore the stack pointers and then returns control to the caller of the
-`entrypoint()` function described above.
+When the generated code needs to return control to the Go runtime, it inserts a
+meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
+backends.  This meta-instruction sets the `exitCode` in the
+`wazevo.executionContext` struct, restore the stack pointers and then returns
+control to the caller of the `entrypoint()` function described above.
 
-As described in "[How do compiler functions work?][how-do-compiler-functions-work]",
-the mechanism is essentially the same when invoking a host function or raising
-an error. However, when a function is invoked the `exitCode` also indicates
-the identifier of the host function to be invoked.
+As described in "[How do compiler functions
+work?][how-do-compiler-functions-work]", the mechanism is essentially the same
+when invoking a host function or raising an error. However, when a function is
+invoked the `exitCode` also indicates the identifier of the host function to be
+invoked.
 
-The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()` method.
-This method is actually invoked when host modules are being instantiated.
-It generates a trampoline that is used to invoke such functions from the generated code.
+The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
+method.  This method is actually invoked when host modules are being
+instantiated.  It generates a trampoline that is used to invoke such functions
+from the generated code.
 
 This trampoline implements essentially the same prologue as the `entrypoint()`,
 but it also reserves space for the arguments and results of the function to be
@@ -127,13 +130,13 @@ invoked.
 
 A host function has the signature:
 
-```go
-func(ctx context.Context, stack []uint64)
+```
+go func(ctx context.Context, stack []uint64) 
 ```
 
-the function arguments in the `stack` parameter are copied over to the
-reserved slots of the real stack. For instance, on `arm64` the stack layout
-would look as follows (on `amd64` it would be similar):
+the function arguments in the `stack` parameter are copied over to the reserved
+slots of the real stack. For instance, on `arm64` the stack layout would look
+as follows (on `amd64` it would be similar):
 
 ```goat
                   (high address)
@@ -160,8 +163,8 @@ would look as follows (on `amd64` it would be similar):
                   (low address)
 ```
 
-Finally, the trampoline jumps into the execution of the host function
-using the `exitSequence` meta-instruction.
+Finally, the trampoline jumps into the execution of the host function using the
+`exitSequence` meta-instruction.
 
 Upon return, the process is reversed.