diff --git a/site/content/docs/how_the_optimizing_compiler_works/_index.md b/site/content/docs/how_the_optimizing_compiler_works/_index.md
new file mode 100644
index 0000000000..fdacf8150c
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/_index.md
@@ -0,0 +1,135 @@
++++
+title = "How the Optimizing Compiler Works"
+layout = "single"
++++
+
+What is a JIT compiler?
+-----------------------
+
+In general, when we talk about a Just-In-Time (JIT) compiler, we mean a
+compilation technique that spares cycles at build-time, trading it for run-time.
+In other words, when a language is JIT-compiled, we usually mean that
+compilation will happen during run-time. Furthermore, when we use the term
+JIT-compilation, we also often mean is that, because compilation happens _during
+run-time_, we can use information that we have collected during execution to
+direct the compilation process: these types of JIT-compilers are often referred
+to as **tracing-JITs**.
+
+Thus, if we wanted to be pedantic, **wazero** provides an **ahead-of-time**,
+**load-time** compiler. That is, a compiler that, indeed, performs compilation
+at run-time, but only when a WebAssembly module is loaded; it currently does not
+collect or leverage any information during the execution of the Wasm binary
+itself.
+
+It is important to make such a distinction, because a Just-In-Time compiler may
+not be an optimizing compiler, and an optimizing compiler may not be a tracing
+JIT. In fact, the compiler that wazero shipped before the introduction of the
+new compiler architecture performed code generation at load-time, but did not
+perform any optimization.
+
+What is an Optimizing Compiler?
+-------------------------------
+
+Wazero supports an _optimizing_ compiler in the style of other optimizing
+compilers out there, such as LLVM's or V8's. Traditionally an optimizing
+compiler performs compilation in a number of steps.
+
+Compare this to the **old compiler**, where compilation happens in one step or
+two, depending on how you count:
+
+
+```goat
+ Input +---------------+ +---------------+
+ Wasm Binary ---->| DecodeModule |---->| CompileModule |----> wazero IR
+ +---------------+ +---------------+
+```
+
+That is, the module is (1) validated then (2) translated to an Intermediate
+Representation (IR). The wazero IR can then be executed directly (in the case
+of the interpreter) or it can be further processed and translated into native
+code by the compiler. This compiler performs a straightforward translation from
+the IR to native code, without any further passes. The wazero IR is not intended
+for further processing beyond immediate execution or straightforward
+translation.
+
+```goat
+ +---- wazero IR ----+
+ | |
+ v v
+ +--------------+ +--------------+
+ | Compiler | | Interpreter |- - - executable
+ +--------------+ +--------------+
+ |
+ +----------+---------+
+ | |
+ v v
++---------+ +---------+
+| ARM64 | | AMD64 |
+| Backend | | Backend | - - - - - - - - - executable
++---------+ +---------+
+```
+
+
+Validation and translation to an IR in a compiler are usually called the
+**front-end** part of a compiler, while code-generation occurs in what we call
+the **back-end** of a compiler. The front-end is the part of a compiler that is
+closer to the input, and it generally indicates machine-independent processing,
+such as parsing and static validation. The back-end is the part of a compiler
+that is closer to the output, and it generally includes machine-specific
+procedures, such as code-generation.
+
+In the **optimizing** compiler, we still decode and translate Wasm binaries to
+an intermediate representation in the front-end, but we use a textbook
+representation called an **SSA** or "Static Single-Assignment Form", that is
+intended for further transformation.
+
+The benefit of choosing an IR that is meant for transformation is that a lot of
+optimization passes can apply directly to the IR, and thus be
+machine-independent. Then the back-end can be relatively simpler, in that it
+will only have to deal with machine-specific concerns.
+
+The wazero optimizing compiler implements the following compilation passes:
+
+* Front-End:
+ - Translation to SSA
+ - Optimization
+
+* Back-End:
+ - Instruction Selection
+ - Registry Allocation
+ - Finalization and Encoding
+
+```goat
+ Input +-------------------+ +-------------------+
+ Wasm Binary --->| DecodeModule |----->| CompileModule |--+
+ +-------------------+ +-------------------+ |
+ +----------------------------------------------------------+
+ |
+ | +---------------+ +---------------+
+ +->| Front-End |----------->| Back-End |
+ +---------------+ +---------------+
+ | |
+ v v
+ SSA Instruction Selection
+ | |
+ v v
+ Optimization Registry Allocation
+ | |
+ v v
+ Block Layout Finalization/Encoding
+```
+
+Like the other engines, the implementation can be found under `engine`, specifically
+in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
+where the implementation of the interface `wasm.Engine` is found.
+
+All the passes can be dumped to the console for debugging, by enabling, the build-time
+flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
+by default and should only be enabled during debugging. These may also change in the future.
+
+In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
+so we will omit the prefix.
+
+
+
+* Next Section: [Front-End](frontend/)
diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
new file mode 100644
index 0000000000..bcd42a621f
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -0,0 +1,199 @@
++++
+title = "Appendix: Trampolines"
+layout = "single"
++++
+
+Trampolines are used to interface between the Go runtime and the generated
+code, in two cases:
+
+- when we need to **enter the generated code** from the Go runtime.
+- when we need to **leave the generated code** to invoke a host function
+ (written in Go).
+
+In this section we want to complete the picture of how a Wasm function gets
+translated from Wasm to executable code in the optimizing compiler, by
+describing how to jump into the execution of the generated code at run-time.
+
+## Entering the Generated Code
+
+At run-time, user space invokes a Wasm function through the public
+`api.Function` interface, using methods `Call()` or `CallWithStack()`. The
+implementation of this method, in turn, eventually invokes an ASM
+**trampoline**. The signature of this trampoline in Go code is:
+
+```go
+func entrypoint(
+ preambleExecutable, functionExecutable *byte,
+ executionContextPtr uintptr, moduleContextPtr *byte,
+ paramResultStackPtr *uint64,
+ goAllocatedStackSlicePtr uintptr)
+```
+
+- `preambleExecutable` is a pointer to the generated code for the preamble (see
+ below)
+- `functionExecutable` is a pointer to the generated code for the function (as
+ described in the previous sections).
+- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
+ struct. This struct is used to save the state of the Go runtime before
+entering or leaving the generated code. It also holds shared state between the
+Go runtime and the generated code, such as the exit code that is used to
+terminate execution on failure, or suspend it to invoke host functions.
+- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
+ This struct Its contents are basically the pointers to the module instance,
+specific objects as well as functions. This is sometimes called "VMContext" in
+other Wasm runtimes.
+- `paramResultStackPtr` is a pointer to the slice where the arguments and
+ results of the function are passed.
+- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
+ for holding values and call frames. For further details refer to
+[/internal/engine/compiler/engine.go][wazero-engine-stack]
+
+The ASM trampoline is guaranteed to follow the stable calling convention
+described in [Go's ASM documentation][abi-asm] (sometimes referred to as
+[ABI0][proposal-register-cc]) The trampoline can be found in
+`backend/isa//abi_entry_.s`.
+
+For each given architecture, the trampoline:
+- moves the arguments to some conventional registers that are documented to be
+ free at the time of the call,
+- finally, it jumps into the execution of the generated code for the preamble
+
+The **preamble** is generated distinctly from the rest of the function, and
+before it.
+
+This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`. The
+procedure first instantiates a `backend.FunctionABI` struct with metadata about
+the expected ABI for a function with a given signature, using the algorithm
+outlined in [Go's documentation][abi-cc].
+
+The preamble sets the fields in the `wazevo.executionContext`.
+
+At the beginning of the preamble:
+
+- We set a register to point to the `*wazevo.executionContext` struct.
+- we save the stack pointers, frame pointers, return addresses, etc. to that
+ struct.
+- we update the stack pointer to point to `paramResultStackPtr`.
+
+The generated code works in concert with the assumption that the preamble has
+been entered through the aforementioned trampoline. Thus, it assumes that the
+arguments can be found in some specific registers.
+
+The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
+the registers that the generated code expects.
+
+Finally, it invokes the generated code for the function.
+
+The epilogue reverses part of the process, finally returning control to the
+caller of the `entrypoint()` function, and the Go runtime. The caller of
+`entrypoint()` is also responsible for completing the cleaning up procedure by
+invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
+backend-specific ASM). which will restore the stack pointers and return
+control to the caller of the function.
+
+The arch-specific code can be found in
+`backend/isa//abi_entry_preamble.go`.
+
+[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
+[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
+[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
+[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
+
+
+## Leaving the Generated Code
+
+In "[How do compiler functions work?][how-do-compiler-functions-work]", we
+already outlined how _leaving_ the generated code works with the help of a
+function. We will complete here the picture by briefly describing the code that
+is generated.
+
+When the generated code needs to return control to the Go runtime, it inserts a
+meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
+backends. This meta-instruction sets the `exitCode` in the
+`wazevo.executionContext` struct, restore the stack pointers and then returns
+control to the caller of the `entrypoint()` function described above.
+
+As described in "[How do compiler functions
+work?][how-do-compiler-functions-work]", the mechanism is essentially the same
+when invoking a host function or raising an error. However, when a function is
+invoked the `exitCode` also indicates the identifier of the host function to be
+invoked.
+
+The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
+method. This method is actually invoked when host modules are being
+instantiated. It generates a trampoline that is used to invoke such functions
+from the generated code.
+
+This trampoline implements essentially the same prologue as the `entrypoint()`,
+but it also reserves space for the arguments and results of the function to be
+invoked.
+
+A host function has the signature:
+
+```
+go func(ctx context.Context, stack []uint64)
+```
+
+the function arguments in the `stack` parameter are copied over to the reserved
+slots of the real stack. For instance, on `arm64` the stack layout would look
+as follows (on `amd64` it would be similar):
+
+```goat
+ (high address)
+ SP ------> +-----------------+ <----+
+ | ....... | |
+ | ret Y | |
+ | ....... | |
+ | ret 0 | |
+ | arg X | | size_of_arg_ret
+ | ....... | |
+ | arg 1 | |
+ | arg 0 | <----+ <-------- originalArg0Reg
+ | size_of_arg_ret |
+ | ReturnAddress |
+ +-----------------+ <----+
+ | xxxx | | ;; might be padded to make it 16-byte aligned.
+ +--->| arg[N]/ret[M] | |
+ sliceSize| | ............ | | goCallStackSize
+ | | arg[1]/ret[1] | |
+ +--->| arg[0]/ret[0] | <----+ <-------- arg0ret0AddrReg
+ | sliceSize |
+ | frame_size |
+ +-----------------+
+ (low address)
+```
+
+Finally, the trampoline jumps into the execution of the host function using the
+`exitSequence` meta-instruction.
+
+Upon return, the process is reversed.
+
+## Code
+
+- The trampoline to enter the generated function is implemented by the
+ `backend.Machine.CompileEntryPreamble()` method.
+- The trampoline to return traps and invoke host functions is generated by
+ `backend.Machine.CompileGoFunctionTrampoline()` method.
+
+You can find arch-specific implementations in
+`backend/isa//abi_go_call.go`,
+`backend/isa//abi_entry_preamble.go`, etc. The trampolines are found
+under `backend/isa//abi_entry_.s`.
+
+## Further References
+
+- Go's [internal ABI documentation][abi-internal] complements Go's ASM
+ documentation with details on the internal, unstable ABI, known as
+*ABIInternal*. Notice that, however, the calling convention for ASM is
+different and described in the ASM documentation.
+- Go's [internal ASM documentation][abi-asm] describes the stable, stack-based
+ calling convention for ASM (_ABI0_).
+- Raphael Poss's [The Go low-level calling convention on
+ x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
+
+[abi-asm]: https://go.dev/doc/asm
+[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
+[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
+[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
+[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
+
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
new file mode 100644
index 0000000000..0ea92f9d03
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -0,0 +1,504 @@
++++
+title = "How the Optimizing Compiler Works: Back-End"
+layout = "single"
++++
+
+In this section we will discuss the phases in the back-end of the optimizing
+compiler:
+
+- [Instruction Selection](#instruction-selection)
+- [Register Allocation](#register-allocation)
+- [Finalization and Encoding](#finalization-and-encoding)
+
+Each section will include a brief explanation of the phase, references to the
+code that implements the phase, and a description of the debug flags that can
+be used to inspect that phase. Please notice that, since the implementation of
+the back-end is architecture-specific, the code might be different for each
+architecture.
+
+### Code
+
+The higher-level entry-point to the back-end is the
+`backend.Compiler.Compile(context.Context)` method. This method executes, in
+turn, the following methods in the same type:
+
+- `backend.Compiler.Lower()` (instruction selection)
+- `backend.Compiler.RegAlloc()` (register allocation)
+- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
+
+## Instruction Selection
+
+The instruction selection phase is responsible for mapping the higher-level SSA
+instructions to arch-specific instructions. Each SSA instruction is translated
+to one or more machine instructions.
+
+Each target architecture comes with a different number of registers, some of
+them are general purpose, others might be specific to certain instructions. In
+general, we can expect to have a set of registers for integer computations,
+another set for floating point computations, a set for vector (SIMD)
+computations, and some specific special-purpose registers (e.g. stack pointers,
+program counters, status flags, etc.)
+
+In addition, some registers might be reserved by the Go runtime or the
+Operating System for specific purposes, so they should be handled with special
+care.
+
+At this point in the compilation process we do not want to deal with all that.
+Instead, we assume that we have a potentially infinite number of *virtual
+registers* of each type at our disposal. The next phase, the register
+allocation phase, will map these virtual registers to the actual registers of
+the target architecture.
+
+### Operands and Addressing Modes
+
+As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
+then use that virtual register as one of the arguments of the machine
+instruction that we will generate. However, usually instructions are able to
+address more than just registers: an *operand* might be able to represent a
+memory address, or an immediate value (i.e. a constant value that is encoded as
+part of the instruction itself).
+
+For these reasons, instead of mapping each `ssa.Value` to a virtual register
+(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
+`operand` type.
+
+During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
+an argument of the instruction, in the simplest case, the `operand` might be
+mapped to a virtual register, in other cases, the `operand` might be mapped to
+a memory address, or an immediate value. Sometimes this makes it possible to
+replace several SSA instructions with a single machine instruction, by folding
+the addressing mode into the instruction itself.
+
+For instance, consider the following SSA instructions:
+
+```
+ v4:i32 = Const 0x9
+ v6:i32 = Load v5, 0x4
+ v7:i32 = Iadd v6, v4
+```
+
+In the `amd64` architecture, the `add` instruction adds the second operand to
+the first operand, and assigns the result to the second operand. So assuming
+that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
+registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
+instruction on `amd64` might look like this:
+
+```asm
+ ;; AT&T syntax
+ add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
+ mov %r4?, %r7? ;; move the result from `r4?` to `r7?`
+```
+
+Notice how the load from memory has been folded into an operand of the `add`
+instruction. This transformation is possible when the value produced by the
+instruction being folded is not referenced by other instructions and the
+instructions belong to the same `InstructionGroupID` (see [Front-End:
+Optimization](../frontend/#optimization)).
+
+### Example
+
+At the end of the instruction selection phase, the basic blocks of our `abs`
+function will look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+ mov x130?, x2
+ subs wzr, w130?, #0x0
+ b.ge L2
+L3 (SSA Block: blk1):
+ mov x136?, xzr
+ sub w134?, w136?, w130?
+ mov x135?, x134?
+ b L4
+L2 (SSA Block: blk2):
+ mov x135?, x130?
+L4 (SSA Block: blk3):
+ mov x0, x135?
+ ret
+```
+
+Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
+These are labels that are used to mark the beginning of each basic block, and
+they are the target for branching instructions such as `b` and `b.ge`.
+
+### Code
+
+`backend.Machine` is the interface to the backend. It has a methods to
+translate (lower) the IR to machine code. Again, as seen earlier in the
+front-end, the term *lowering* is used to indicate translation from a
+higher-level representation to a lower-level representation.
+
+`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
+SSA instruction to machine code. Machine-specific implementations of this
+method can be found in package `backend/isa/` where `` is either
+`amd64` or `arm64`.
+
+### Debug Flags
+
+`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
+lowered arch-specific instructions.
+
+## Register Allocation
+
+The register allocation phase is responsible for mapping the potentially
+infinite number of virtual registers to the real registers of the target
+architecture. Because the number of real registers is limited, the register
+allocation phase might need to "spill" some of the virtual registers to memory;
+that is, it might store their content, and then load them back into a register
+when they are needed.
+
+For a given function `f` the register allocation procedure
+`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
+
+- `livenessAnalysis(f)` collects the "liveness" information for each virtual
+ register. The algorithm is described in [Chapter 9.2 of The SSA
+Book][ssa-book].
+
+- `alloc(f)` allocates registers for the given function. The algorithm is
+ derived from [the Go compiler's
+allocator][go-regalloc]
+
+At the end of the allocation procedure, we also record the set of registers
+that are **clobbered** by the body of the function. A register is clobbered
+if its value is overwritten by the function, and it is not saved by the
+callee. This information is used in the finalization phase to determine which
+registers need to be spilled in the prologue. This is not strictly related
+to register allocation in a textbook meaning, but it is a necessary step
+for the finalization phase.
+
+### Liveness Analysis
+
+Intuitively, a variable or name binding can be considered _live_ at a certain
+point in a program, if its value will be used in the future.
+
+For instance:
+
+```
+1| int f(int x) {
+2| int y = 2 + x;
+3| int z = x + y;
+4| return z;
+5| }
+```
+
+Variable `x` and `y` are both live at line 4, because they are used in the
+expression `x + y` on line 3; variable `z` is live at line 4, because it is
+used in the return statement. However, variables `x` and `y` can be considered
+_not_ live at line 4 because they are not used anywhere after line 3.
+
+Statically, _liveness_ can be approximated by following paths backwards on the
+control-flow graph, connecting the uses of a given variable to its definitions
+(or its *unique* definition, assuming SSA form).
+
+In practice, while liveness is a property of each name binding at any point in
+the program, it is enough to keep track of liveness at the boundaries of basic
+blocks:
+
+- the _live-in_ set for a given basic block is the set of all bindings that are
+ live at the entry of that block.
+- the _live-out_ set for a given basic block is the set of all bindings that
+ are live at the exit of that block. A binding is live at the exit of a block
+if it is live at the entry of a successor.
+
+Because the CFG is a connected graph, it is enough to keep track of either
+live-in or live-out sets, and then propagate the liveness information backwards
+or forwards, respectively. In our case, we keep track of live-ins.
+
+### Allocation
+
+We implemented a variant of the linear scan register allocation algorithm
+described in [the Go compiler's allocator][go-regalloc].
+
+Each basic block is allocated registers in a linear scan order, and the
+allocation state is propagated from a given basic block to its successors.
+Then, each block continues allocation from that initial state.
+
+#### Merge States
+
+Special care has to be taken when a block has multiple predecessors. We call
+this *fixing merge states*: for instance, consider the following:
+
+```goat { width="30%" }
+ .---. .---.
+| BB0 | | BB1 |
+ '-+-' '-+-'
+ +----+----+
+ |
+ v
+ .---.
+ | BB1 |
+ '---'
+```
+
+if the live-out set of a given block `BB0` is different from the live-out set
+of a given block `BB1` and both are predecessors of a block `BB2`, then we need
+to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice, we
+ensure that the registers that `BB2` expects to be live-in are live-out in
+`BB0` and `BB1`.
+
+#### Spilling
+
+If the register allocator cannot find a free register for a given virtual
+(live) register, it will "spill" the value to memory, *i.e.,* stash it
+temporarily to memory. When that virtual register is recalled later, we will
+have to insert instructions to reload the value into a real register.
+
+While the procedure proceeds with allocation, the procedure also records all
+the virtual registers that transition to the "spilled" state, and inserts the
+reload instructions when those registers are recalled later.
+
+The spill instructions are actually inserted at the end, after all the
+allocations and the merge states have been fixed. At this point, all the other
+potential sources of instability have been resolved, and we know where all the
+reloads happen.
+
+We insert the spills in the block that is the lowest common ancestor of all the
+blocks that reload the value.
+
+#### Clobbered Registers
+
+At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
+method iterates over the set of the allocated registers and compares them
+to a set of architecture-specific set `CalleeSavedRegisters`. If a register
+has been allocated, and it is present in this set, the register is marked as
+"clobbered", i.e., we now know that the register allocator will overwrite
+that value. Thus, these values will have to be spilled in the prologue.
+
+#### References
+
+Register allocation is a complex problem, possibly the most complicated
+part of the backend. The following references were used to implement the
+algorithm:
+
+- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
+- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
+- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
+- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
+- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
+
+We suggest to refer to them to dive deeper in the topic.
+
+### Example
+
+At the end of the register allocation phase, the basic blocks of our `abs`
+function look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+ mov x2, x2
+ subs wzr, w2, #0x0
+ b.ge L2
+L3 (SSA Block: blk1):
+ mov x8, xzr
+ sub w8, w8, w2
+ mov x8, x8
+ b L4
+L2 (SSA Block: blk2):
+ mov x2, x2
+ mov x8, x2
+L4 (SSA Block: blk3):
+ mov x0, x8
+ ret
+```
+
+Notice how the virtual registers have been all replaced by real registers, i.e.
+no register identifier is suffixed with `?`. This example is quite simple, and
+it does not require any spill.
+
+### Code
+
+The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
+interfaces in `regalloc/api.go`.
+
+Essentially:
+
+- each architecture exposes iteration over basic blocks of a function
+ (`regalloc.Function` interface)
+- each arch-specific basic block exposes iteration over instructions
+ (`regalloc.Block` interface)
+- each arch-specific instruction exposes the set of registers it defines and
+ uses (`regalloc.Instr` interface)
+
+By defining these interfaces, the register allocation algorithm can assign real
+registers to virtual registers without dealing specifically with the target
+architecture.
+
+In practice, each interface is usually implemented by instantiating a common
+generic struct that comes already with an implementation of all or most of the
+required methods. For instance,`regalloc.Function`is implemented by
+`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
+
+`backend/isa//abi.go` (where `` is either `arm64` or `amd64`)
+contains the instantiation of the `regalloc.RegisterInfo` struct, which
+declares, among others
+- the set of registers that are available for allocation, excluding, for
+ instance, those that might be reserved by the runtime or the OS
+(`AllocatableRegisters`)
+- the registers that might be saved by the callee to the stack
+ (`CalleeSavedRegisters`)
+
+### Debug Flags
+
+- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
+ allocation procedure.
+- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
+ allocation result.
+
+## Finalization and Encoding
+
+At the end of the register allocation phase, we have enough information to
+finally generate machine code (_encoding_). We are only missing the prologue
+and epilogue of the function.
+
+### Prologue and Epilogue
+
+As usual, the **prologue** is executed before the main body of the function,
+and the **epilogue** is executed at the end. The prologue is responsible for
+setting up the stack frame, and the epilogue is responsible for cleaning up the
+stack frame and returning control to the caller.
+
+Generally, this means, at the very least:
+- saving the return address
+- a base pointer to the stack; or, equivalently, the height of the stack at the
+ beginning of the function
+
+For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
+pointer:
+
+```goat {width="100%" height="250"}
+ (high address) (high address)
+ RBP ----> +-----------------+ +-----------------+
+ | `...` | | `...` |
+ | ret Y | | ret Y |
+ | `...` | | `...` |
+ | ret 0 | | ret 0 |
+ | arg X | | arg X |
+ | `...` | ====> | `...` |
+ | arg 1 | | arg 1 |
+ | arg 0 | | arg 0 |
+ | Return Addr | | Return Addr |
+ RSP ----> +-----------------+ | Caller_RBP |
+ (low address) +-----------------+ <----- RSP, RBP
+```
+
+While, on `arm64`, there is only a stack pointer `SP`:
+
+
+```goat {width="100%" height="300"}
+ (high address) (high address)
+ SP ---> +-----------------+ +------------------+ <----+
+ | `...` | | `...` | |
+ | ret Y | | ret Y | |
+ | `...` | | `...` | |
+ | ret 0 | | ret 0 | |
+ | arg X | | arg X | | size_of_arg_ret.
+ | `...` | ====> | `...` | |
+ | arg 1 | | arg 1 | |
+ | arg 0 | | arg 0 | <----+
+ +-----------------+ | size_of_arg_ret |
+ | return address |
+ +------------------+ <---- SP
+ (low address) (low address)
+```
+
+However, the prologue and epilogue might also be responsible for saving and
+restoring the state of registers that might be overwritten by the function
+("clobbered"); and, if spilling occurs, prologue and epilogue are also
+responsible for reserving and releasing the space for the spilled values.
+
+For clarity, we make a distinction between the space reserved for the clobbered
+registers and the space reserved for the spilled values:
+
+- Spill slots are used to temporarily store the values that needs spilling as
+ determined by the register allocator. This section must have a fix height,
+but its contents will change over time, as registers are being spilled and
+reloaded.
+- Clobbered registers are, similarly, determined by the register allocator, but
+ they are stashed in the prologue and then restored in the epilogue.
+
+The procedure happens at the end of the register allocation phase because at
+this point we have collected enough information to know how much space we need
+to reserve.
+
+Regardless of the architecture, after allocating this space, the stack will
+look as follows:
+
+```goat {height="350"}
+ (high address)
+ +-----------------+
+ | `...` |
+ | ret Y |
+ | `...` |
+ | ret 0 |
+ | arg X |
+ | `...` |
+ | arg 1 |
+ | arg 0 |
+ | (arch-specific) |
+ +-----------------+
+ | clobbered M |
+ | ............ |
+ | clobbered 1 |
+ | clobbered 0 |
+ | spill slot N |
+ | ............ |
+ | spill slot 0 |
+ +-----------------+
+ (low address)
+```
+
+Note: the prologue might also introduce a check of the stack bounds. If there
+is no sufficient space to allocate the stack frame, the function will exit the
+execution and will try to grow it from the Go runtime.
+
+The epilogue simply reverses the operations of the prologue.
+
+### Other Post-RegAlloc Logic
+
+The `backend.Machine.PostRegAlloc` method is invoked after the register
+allocation procedure; while its main role is to define the prologue and
+epilogue of the function, it also serves as a hook to perform other,
+arch-specific duty, that has to happen after the register allocation phase.
+
+For instance, on `amd64`, the constraints for some instructions are hard to
+express in a meaningful way for the register allocation procedure (for
+instance, the `div` instruction implicitly use registers `rdx`, `rax`).
+Instead, they are lowered with ad-hoc logic as part of the implementation
+`backend.Machine.PostRegAlloc` method.
+
+### Encoding
+
+The final stage of the backend encodes the machine instructions into bytes and
+writes them to the target buffer. Before proceeding with the encoding, relative
+addresses in branching instructions or addressing modes are resolved.
+
+The procedure encodes the instructions in the order they appear in the
+function.
+
+### Code
+
+- The prologue and epilogue are set up as part of the
+ `backend.Machine.PostRegAlloc` method.
+- The encoding is done by the `backend.Machine.Encode` method.
+
+### Debug Flags
+
+- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
+ function after the finalization phase.
+- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
+ representation of the function generated code as it is.
+- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
+ representation of the function generated code that can be disassembled.
+
+The reason for the distinction between the last two flags is that the generated
+code in some cases might not be disassemblable.
+`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
+the generated code that can be disassembled, but cannot be executed.
+
+
+
+* Previous Section: [Front-End](../frontend/)
+* Next Section: [Appendix: Trampolines](../appendix/)
+
+[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
+[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
new file mode 100644
index 0000000000..8bebb47fcd
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -0,0 +1,367 @@
++++
+title = "How the Optimizing Compiler Works: Front-End"
+layout = "single"
++++
+
+In this section we will discuss the phases in the front-end of the optimizing compiler:
+
+- [Translation to SSA](#translation-to-ssa)
+- [Optimization](#optimization)
+- [Block Layout](#block-layout)
+
+Every section includes an explanation of the phase; the subsection **Code**
+will include high-level pointers to functions and packages; the subsection **Debug Flags**
+indicates the flags that can be used to enable advanced logging of the phase.
+
+## Translation to SSA
+
+We mentioned earlier that wazero uses an internal representation called an "SSA"
+form or "Static Single-Assignment" form, but we never explained what that is.
+
+In short terms, every program, or, in our case, every Wasm function, can be
+translated in a control-flow graph. The control-flow graph is a directed graph where
+each node is a sequence of statements that do not contain a control flow instruction,
+called a **basic block**. Instead, control-flow instructions are translated into edges.
+
+For instance, take the following implementation of the `abs` function:
+
+```wasm
+(module
+ (func (;0;) (param i32) (result i32)
+ (if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
+ (then
+ (i32.sub (i32.const 0) (local.get 0)))
+ (else
+ (local.get 0))
+ )
+ )
+ (export "f" (func 0))
+)
+```
+
+This is translated to the following block diagram:
+
+```goat {width="100%" height="500"}
+ +---------------------------------------------+
+ |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
+ | v3:i32 = Iconst_32 0x0 |
+ | v4:i32 = Icmp lt_s, v2, v3 |
+ | Brz v4, blk2 |
+ | Jump blk1 |
+ +---------------------------------------------+
+ |
+ |
+ +---`(v4 != 0)`-+-`(v4 == 0)`---+
+ | |
+ v v
+ +---------------------------+ +---------------------------+
+ |blk1: () <-- (blk0) | |blk2: () <-- (blk0) |
+ | v6:i32 = Iconst_32 0x0 | | Jump blk3, v2 |
+ | v7:i32 = Isub v6, v2 | | |
+ | Jump blk3, v7 | | |
+ +---------------------------+ +---------------------------+
+ | |
+ | |
+ +-`{v5 := v7}`--+--`{v5 := v2}`-+
+ |
+ v
+ +------------------------------+
+ |blk3: (v5:i32) <-- (blk1,blk2)|
+ | Jump blk_ret, v5 |
+ +------------------------------+
+ |
+ {return v5}
+ |
+ v
+```
+
+We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
+representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
+takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
+Jump, etc...) with an optional list of arguments; these arguments are assigned
+to the target block's arguments like a function.
+
+Consider the first block `blk0`.
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+ v3:i32 = Iconst_32 0x0
+ v4:i32 = Icmp lt_s, v2, v3
+ Brz v4, blk2
+ Jump blk1
+```
+
+You will notice that, compared to the original function, it takes two extra
+parameters (`exec_ctx` and `module_ctx`):
+
+1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
+ in the face of traps or for host function calls.
+2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
+ to access memory. It is also used during host function calls.
+
+It then takes one parameter `v2`, corresponding to the function parameter, and
+it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
+comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
+`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
+
+You might also have noticed that the instructions do not correspond strictly to
+the original Wasm opcodes. This is because, similarly to the wazero IR used by
+the old compiler, this is a custom IR.
+
+You will also notice that, _on the right-hand side of the assignments_ of any statement,
+no name occurs _twice_: this is why this form is called **single-assignment**.
+
+Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
+
+```
+blk1: ()
+ ...
+ Jump blk3, v7
+
+blk2: ()
+ Jump blk3, v2
+
+blk3: (v5:i32)
+ ...
+```
+
+`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
+to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
+depending on the originating block. If you are familiar with the traditional
+representation of an SSA form, you will recognize that the role of block
+arguments is equivalent to the role of the *Phi (Φ) function*, a special
+function that returns a different value depending on the incoming edge; e.g., in
+this case: `v5 := Φ(v7, v2)`.
+
+### Code
+
+The relevant APIs can be found under sub-package `ssa` and `frontend`.
+In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
+because such transformations usually correspond to targeting a lower abstraction level.
+
+- Basic Blocks are represented by the type `ssa.Block` (`ssa/basic_block.go`).
+- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
+ in the context of `wasm.Engine.CompileModule()`, more specifically in the method
+ `frontend.Compiler.LowerToSSA()`.
+- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
+ more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
+- Because they are semantically equivalent, in the code, basic block parameters
+ are sometimes referred to as "Phi values".
+
+#### Instructions and Values
+
+An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
+consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
+instructions may not produce any value (for instance, a `Jump` instruction).
+An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
+to represent the result of an instruction, or the input to an instruction.
+
+For instance:
+
+```
+blk1: () <-- (blk0)
+ v6:i32 = Iconst_32 0x0
+ v7:i32 = Isub v6, v2
+ Jump blk3, v7
+```
+
+`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
+and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
+such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
+
+- `i32`: 32-bit integer
+- `i64`: 64-bit integer
+- `f32`: 32-bit floating point
+- `f64`: 64-bit floating point
+- `v128`: 128-bit SIMD vector
+
+Values and instructions are both allocated from pools to minimize memory allocations.
+
+### Debug Flags
+
+- `wazevoapi.PrintSSA` dumps the SSA form to the console.
+- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
+ opcodes and SSA instructions to the console.
+
+## Optimization
+
+The SSA form makes it easier to perform a number of optimizations. For instance,
+we can perform constant propagation, dead code elimination, and common
+subexpression elimination. These optimizations either act upon the instructions
+within a basic block, or they act upon the control-flow graph as a whole.
+
+On a high, level, consider the following basic block, derived from the previous
+example:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+ v2:i32 = Iconst_32 -5
+ v3:i32 = Iconst_32 0
+ v4:i32 = Icmp lt_s, v2, v3
+ Brz v4, blk2
+ Jump blk1
+```
+
+It is pretty easy to see that the comparison in `v4` can be replaced by a
+constant `1`, because the comparison is between two constant values (-5, 0).
+Therefore, the block can be rewritten as such:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+ v4:i32 = Iconst_32 1
+ Brz v4, blk2
+ Jump blk1
+```
+
+However, we can now also see that the branch is always taken, and that the block
+`blk2` is never executed, so even the branch instruction and the constant
+definition `v4` can be removed:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+ Jump blk1
+```
+
+This is a simple example of constant propagation and dead code elimination
+occurring within a basic block. However, now `blk2` is unreachable, because
+there is no other edge in the edge that points to it; thus it can be removed
+from the control-flow graph. This is an example of dead-code elimination that
+occurs at the control-flow graph level.
+
+In practice, because WebAssembly is a compilation target, these simple
+optimizations are often unnecessary. The optimization passes implemented in
+wazero are also work-in-progress and, at the time of writing, further work is
+expected to implement more advanced optimizations.
+
+### Code
+
+Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
+pass is just a function that takes a ssa builder as a parameter.
+
+Passes iterate over the basic blocks, and, for each basic block, they iterate
+over the instructions. Each pass may mutate the basic block by modifying the instructions
+it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
+blocks).
+
+Currently, there are two dead-code elimination passes:
+
+- `passDeadBlockEliminationOpt` acting at the block-level.
+- `passDeadCodeEliminationOpt` acting at instruction-level.
+
+Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
+instruction. This is used to determine whether a sequence of instructions can be
+replaced by a single machine instruction during the back-end phase. For more details,
+see also the relevant documentation in `ssa/instructions.go`
+
+There are also simple constant folding passes such as `passNopInstElimination`, which
+folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
+
+### Debug Flags
+
+`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
+
+
+## Block Layout
+
+As we have seen earlier, the SSA form instructions are contained within basic
+blocks, and the basic blocks are connected by edges of the control-flow graph.
+However, machine code is not laid out in a graph, but it is just a linear
+sequence of instructions.
+
+Thus, the last step of the front-end is to lay out the basic blocks in a linear
+sequence. Because each basic block, by design, ends with a control-flow
+instruction, one of the goals of the block layout phase is to maximize the number of
+**fall-through opportunities**. A fall-through opportunity occurs when a block ends
+with a jump instruction whose target is exactly the next block in the
+sequence. In order to maximize the number of fall-through opportunities, the
+block layout phase might reorder the basic blocks in the control-flow graph,
+and transform the control-flow instructions. For instance, it might _invert_
+some branching conditions.
+
+The end goal is to effectively minimize the number of jumps and branches in
+the machine code that will be generated later.
+
+
+### Critical Edges
+
+Special attention must be taken when a basic block has multiple predecessors,
+i.e., when it has multiple incoming edges. In particular, an edge between two
+basic blocks is called a **critical edge** when, at the same time:
+- the predecessor has multiple successors **and**
+- the successor has multiple predecessors.
+
+For instance, in the example below the edge between `BB0` and `BB3`
+is a critical edge.
+
+```goat { width="300" }
+┌───────┐ ┌───────┐
+│ BB0 │━┓ │ BB1 │
+└───────┘ ┃ └───────┘
+ │ ┃ │
+ ▼ ┃ ▼
+┌───────┐ ┃ ┌───────┐
+│ BB2 │ ┗━▶│ BB3 │
+└───────┘ └───────┘
+```
+
+In these cases the critical edge is split by introducing a new basic block,
+called a **trampoline**, where the critical edge was.
+
+```goat { width="300" }
+┌───────┐ ┌───────┐
+│ BB0 │──────┐ │ BB1 │
+└───────┘ ▼ └───────┘
+ │ ┌──────────┐ │
+ │ │trampoline│ │
+ ▼ └──────────┘ ▼
+┌───────┐ │ ┌───────┐
+│ BB2 │ └────▶│ BB3 │
+└───────┘ └───────┘
+```
+
+For more details on critical edges read more at
+
+- https://en.wikipedia.org/wiki/Control-flow_graph
+- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
+
+### Example
+
+At the end of the block layout phase, the laid out SSA for the `abs` function
+looks as follows:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+ v3:i32 = Iconst_32 0x0
+ v4:i32 = Icmp lt_s, v2, v3
+ Brz v4, blk2
+ Jump fallthrough
+
+blk1: () <-- (blk0)
+ v6:i32 = Iconst_32 0x0
+ v7:i32 = Isub v6, v2
+ Jump blk3, v7
+
+blk2: () <-- (blk0)
+ Jump fallthrough, v2
+
+blk3: (v5:i32) <-- (blk1,blk2)
+ Jump blk_ret, v5
+```
+
+### Code
+
+`ssa.Builder.LayoutBlocks()` implements the block layout phase.
+
+### Debug Flags
+
+- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
+- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
+ such as inverting branching conditions or splitting critical edges.
+
+
+
+* Previous Section: [How the Optimizing Compiler Works](../)
+* Next Section: [Back-End](../backend/)
+
+[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
+[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes