Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add book-level documentation for Pulley #10095

Merged
merged 3 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
- [Cross-platform Profiling](./examples-profiling-guest.md)
- [Checking Guests' Memory Accesses](./wmemcheck.md)
- [Building a minimal embedding](./examples-minimal.md)
- [Using Pulley](./examples-pulley.md)
- [Stability](stability.md)
- [Release Process](./stability-release.md)
- [Tiers of support](./stability-tiers.md)
Expand Down
232 changes: 232 additions & 0 deletions docs/examples-pulley.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
#uUsing Pulley
alexcrichton marked this conversation as resolved.
Show resolved Hide resolved

On architectures such as x86\_64 or aarch64 Wasmtime will by default use the
Cranelift compiler to translate WebAssembly to native machine code and execute
it. Cranelift does not support all architectures, however, for example i686
(32-bit Intel machines) is not supported at this time. To help execute
WebAssembly on these architectures Wasmtime comes with an interpreter called
Pulley.

Pulley is a bytecode interpreter originally proposed [in an RFC][rfc] which is
intended to primarily be portable. Pulley is a loose backronym for "Portable,
Universal, Low-Level Execution strategY" but mostly just a theme on
machines/tools (Cranelift, Winch, Pulley, ...). Pulley is a distinct target and
execution environment for Wasmtime.

## Enabling Pulley

The Pulley interpreter is enabled via one of two means:

1. On architectures which have Cranelift support, Pulley must be enabled via the
`pulley` crate feature of the `wasmtime` crate. This feature is otherwise
off-by-default.

2. On architectures which do NOT have Cranelift support, Pulley is already
enabled by default. This means that Wasmtime can execute WebAssembly by
default on any platform, it'll just be faster on Cranelift-supported
platforms.

For platforms in category (2) there is no opt-in necessary to execute Pulley as
that's already the default target. Platforms in category (1), such as
`x86_64-unknown-linux-gnu`, may still want to execute Pulley to run tests,
evaluate the implementation, benchmark, etc.

To force execution of Pulley on any platform the `pulley` crate feature of
the `wasmtime` crate must be enabled in addition to configuring a target.
Specifying a target is done with the `--target` CLI option to the `wasmtime`
executable, the [`Config::target`] method in Rust, or the
[`wasmtime_config_target_set`] C API. The target string for pulley must be one
of:

[`Config::target`]: https://docs.rs/wasmtime/latest/wasmtime/struct.Config.html#method.target
[`wasmtime_config_target_set`]: https://docs.wasmtime.dev/c-api/config_8h.html#ae68a2737ba1680e75cddb6ede08d682a

* `pulley32` - for 32-bit little-endian hosts
* `pulley32be` - for 32-bit big-endian hosts
* `pulley64` - for 64-bit little-endian hosts
* `pulley64be` - for 64-bit big-endian hosts

The Pulley target string must match the environment that the Pulley Bytecode
will be executing in. Some examples of Pulley targets are:

| Host target | Pulley target |
|----------------------------|---------------|
| `x86_64-unknown-linux-gnu` | `pulley64` |
| `i686-unknown-linux-gnu` | `pulley32` |
| `s390x-unknown-linux-gnu` | `pulley64be` |

Wasmtime will return an error trying to load bytecode compiled for the wrong
Pulley target. When Pulley is the default target for a particular host then the
correct Pulley target will be selected automatically. Specifying the Pulley
target may still be necessary when cross-compiling from one platform to another,
however.

## Using Pulley

Using Pulley in Wasmtime requires no further configuration beyond specifying the
target for Pulley. Once that is done all of the Wasmtime crate's Rust APIs or C
API work as usual. For example when specifying `wasmtime run --target pulley64`
on the CLI this will execute all WebAssembly in the interpreter rather than via
Cranelift.

Pulley at this time has the same feature parity for WebAssembly as Cranelift
does. This means that all WebAssembly proposals and features supported by
Wasmtime are supported by Pulley.

If you notice anything awry, however, please feel free to file an issue.

## Impact of using Pulley

Pulley is an interpreter for its own bytecode format. While the design of Pulley
is optimized for speed you should still expect a ~10x order-of-magnitude
slowdown relative to native code or Cranelift. This means that Pulley is likely
not suitable for compute-intensive tasks that must run in as little time as
possible.

The primary goal of Pulley is to enable using and embedding Wasmtime across a
variety of platforms simultaneously. The same API/interface is used to interact
with the runtime and loading WebAssembly module regardless of the host
architecture.

Pulley bytecode is produced by the Cranelift compiler today in a similar manner
to native platforms. Pulley is not designed for quickly loading WebAssembly
modules as Cranelift is an optimizing compiler. Compiling WebAssembly to Pulley
bytecode should be expected to take about the same time as compiling to native
platforms.

## High-level Design of Pulley

This section is not necessary for users of Pulley but for those interested this
is a description of the high-level design of Pulley. The Pulley virtual machine
consists of:

* 32 "X" integer registers each of which are 64-bits large. (`XReg`)
* 32 "F" float registers each of which are 64-bits large. (`FReg`)
* 32 "V" vector registers each of which are 128-bits large. (`VReg`)
* A dynamically allocated "stack" on the host's heap.
* A frame pointer register.
* A link register to store the return address for the current function.

This state lives in [`MachineState`] which is in turned stored in a [`Vm`].
Pulley's source code lives in `pulley/` in the Wasmtime repository.

Pulley's bytecode is defined in `pulley/src/lib.rs` with a combination of the
`for_each_op!` and `for_each_extended_op!` macros. Opcode numbers and opcode
layout are defined by the structure of these macros. The macros are used to
"derive" encoding/decoding/traits/etc used throughout the `pulley_interpreter`
crate.

Pulley opcodes are a single discriminator byte followed by any immediates.
Immediates are not aligned and require unaligned loads/stores to work with them.
Pulley has more than 256 opcodes, however, which is where "extended" opcodes
come into play. The final Pulley opcode is reserved to indicate that an extended
opcode is being used. Extended opcodes follow this initial discriminator with a
16-bit integer which further indicates which extended opcode is being used. This
design is intended to allow common operations to be encoded more compactly while
less common operations can still be packed in effectively without limit.

Pulley opcode assignment happens through the order of the `for_each_op!` macro
which means that it's not portable across multiple versions of Wasmtime.

The interpreter is an implementation of the [`OpVisitor`] and
[`ExtendedOpVisitor`] traits. This is located at `pulley/src/interp.rs`. Notably
this means that there's a method-per-opcode and is how the interpreter is
implemented.

The interpreter loop itself is implemented in one of two ways:

1. A "match loop" which is a Rust `loop { ... }` which internally uses the
[`Decode`] trait on each opcode. This is not literally modeled as but
compiles down to something that looks like `loop { match .. { ... } }`. This
interpreter loop is located at `pulley/src/interp/match_loop.rs`.

2. A "tail loop" were each opcode handler is a Rust function. Control flow
between opcodes continues with tail-calls and exiting the interpreter is done
by returning from the function. Tail calls are not available in stable Rust
so this interpreter loop is not used by default. It can be enabled, though,
with `RUSTFLAGS=--cfg=pulley_assume_llvm_makes_tail_calls` to rely on LLVM's
tail-call-optimization pass to implement the loop.

The "match loop" is the default interpreter loop as it's portable and works on
stable Rust. The "tail loop" is thought to probably perform better than the
"match loop" but it's not available on stable Rust (`become` in Rust is an
unfinished nightly feature at this time) or portable (tail-call-optimization
doesn't happen the same in LLVM on all architectures).

### Inspecting Pulley Bytecode

When compiling to native the `*.cwasm` produced by `wasmtime compile` can be
inspected with `objdump -S`, but this doesn't work with Pulley. A small example
in the `pulley_interpreter` crate suffices for doing this though. You can
inspect compiled Pulley bytecode from the Wasmtime repository with:

```
$ cargo run compile --target pulley64 foo.wat
$ cargo run -p pulley-interpreter --all-features --example objdump foo.cwasm
0x000000: <wasm[0]::function[20]>:
0: 9f 10 00 08 00 push_frame_save 16, x19
5: 40 13 00 xmov x19, x0
8: 03 13 13 3f cb 89 00 call2 x19, x19, 0x89cb3f // target = 0x89cb47
f: 03 13 13 8c ab 84 00 call2 x19, x19, 0x84ab8c // target = 0x84ab9b
16: 03 13 13 5b 12 00 00 call2 x19, x19, 0x125b // target = 0x1271
1d: 03 13 13 9f 12 00 00 call2 x19, x19, 0x129f // target = 0x12bc
24: 03 13 13 e0 45 00 00 call2 x19, x19, 0x45e0 // target = 0x4604
...
```

The output is intended to look somewhat similar to `objdump` but otherwise
mainly provides the ability to inspect opcode selection, see the encoded bytes,
etc.

### Profiling Pulley

Profiling the Pulley interpreter can be done with native profiler such as `perf`
but this has a few downsides:

* When profiling the "match loop" it's not clear what machine code corresponds
to which Pulley opcode. Most of the time all the samples are just in the one
big "run" function.

* When profiling with the "tail loop" you can see hot opcodes much more clearly,
but it can be difficult to understand why a particular opcode was chosen.

It can sometimes be more beneficial to see time spent per Pulley opcode itself
in the context of the all Pulley opcodes. In a similar manner as you can look at
instruction-level profiling in `perf` it can be useful to look at opcode-level
profiling of Pulley.

Pulley has limited support for opcode-level profiling. This is off-by-default as
it has a performance hit for the interpreter. To collect a profile with the
`wasmtime` CLI you'll have to build from source and enable the `profile-pulley`
feature:

```
$ cargo run --features profile-pulley --release run --profile pulley --target pulley64 foo.wat
```

This will compile an optimized `wasmtime` executable with the `profile-pulley`
Cargo feature enabled. The `--profile pulley` flag can then be passed to the
`wasmtime` CLI to enable the profiler at runtime.

The command will emit a `pulley-$pid.data` file which contains raw data about
Pulley opcodes and samples taken. To view this file you can use:

```
$ cargo run -p pulley-interpreter --example profiler-html --all-features ./pulley-$pid.data
```

This will load the `pulley-*.data` file, parse it, collate the results, and
display the hottest functions. The hottest function is emitted last and
instructions are annotated with the `%` of samples taken that were executing at
that instruction.

Some more information can be found in [the PR that implemented Pulley profiling
support][profile-pr]

[`OpVisitor`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.OpVisitor.html
[`MachineState`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/interp/struct.MachineState.html
[`Vm`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/interp/struct.Vm.html
[rfc]: https://github.com/bytecodealliance/rfcs/blob/main/accepted/pulley.md
[`ExtendedOpVisitor`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.ExtendedOpVisitor.html
[`Decode`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.Decode.html
[profile-pr]: https://github.com/bytecodealliance/wasmtime/pull/10034
12 changes: 6 additions & 6 deletions docs/stability-platform-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ Cranelift.

## Interpreter support

The `wasmtime` crate provides an implementation of a WebAssembly interpreter
named "Pulley" which is a portable implementation of executing WebAssembly
code. Pulley uses a custom bytecode which is created from input WebAssembly
similarly to how native architectures are supported. Pulley's bytecode is
created via a Cranelift backend for Pulley, so compile times for the interpreter
are expected to be similar to natively compiled code.
The `wasmtime` crate provides an implementation of a [WebAssembly interpreter
named "Pulley"](./examples-pulley.md) which is a portable implementation of
executing WebAssembly code. Pulley uses a custom bytecode which is created from
input WebAssembly similarly to how native architectures are supported. Pulley's
bytecode is created via a Cranelift backend for Pulley, so compile times for
the interpreter are expected to be similar to natively compiled code.

The main advantage of Pulley is that the bytecode can be executed on any
platform with the same pointer-width and endianness. For example to execute
Expand Down
Loading