Adding new API that accepts resusable context. #196

jakubDoka · 2024-09-21T11:29:54Z

Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress).

Context -> https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/Using.20context.20for.20.60TargetIsa.3A.3Acompile_function.60

jakubDoka · 2024-09-24T11:31:18Z

I profiled the allocator a bit and I got rid of some hashmaps in favor of using the key as an index which improved things a little bit.

cfallin · 2024-09-24T15:50:31Z

@jakubDoka could you summarize with a perf measurement (Sightglass or hyperfine of wasmtime compile is OK) the overall compile-time effect of this?

jakubDoka · 2024-09-25T08:52:53Z

I had to run the bench separately because whatever I run last performs worse (if I run two identical programs, the first run always wins) (my CPU throttles)

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     136.7 ms ±   4.5 ms    [User: 128.8 ms, System: 7.1 ms]
  Range (min … max):   132.5 ms … 157.8 ms    100 runs

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     147.1 ms ±   4.6 ms    [User: 138.2 ms, System: 7.6 ms]
  Range (min … max):   140.1 ms … 163.2 ms    100 runs

Let me know If full perf stat could be useful

(spidermonkey)
15,471,712,064      cpu_core/cycles/u // wasmtime.old
14,390,147,910      cpu_core/cycles/u // wasmtime.new
(pulldown-cmark)
631,885,447      cpu_core/cycles/u // wasmtime.old
599,844,924      cpu_core/cycles/u // wasmtime.new

cfallin

Thanks for all this work! I'm happy to see the performance improvement. After reading through, I have a few high-level thoughts:

I'd like to see some of the changes here pulled out into separate PRs. For example, you replace concatenate-and-sort steps with in-place merge when we know the liverange lists are already sorted; good optimization; but let's have that as a separate PR, so we can reason about this PR more easily and also so that we can test and revert those changes separately. Basically, I want this PR to be as close as possible to a mechanical replacement of data structures with arena-allocated or reused data structures, with no semantics changes. Some of the changes to the core allocation loop around the allocation-map iterator also make me very nervous.
The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?

fuzz/fuzz_targets/ion.rs

fuzz/fuzz_targets/ion_checker.rs

fuzz/fuzz_targets/ssagen.rs

src/cfg.rs

src/lib.rs

src/ion/moves.rs

src/ion/process.rs

src/ion/data_structures.rs

jakubDoka · 2024-09-26T07:58:45Z

The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?

I did not realize Wasmtime needs to deal with malicious code, good point, all of the changes here are guided by a simple principle: How to make the code do less. Using vec instead of BTree has many properties that CPUs like. I suspect that in try_to_insert_bundle_to_reg, replacing BTree with vec improves drastically because we mostly traverse the entries linearly for which Vec is more cache-friendly than BTree.

Now that I think about this, quadratic behavior is a problem if the function we compile is maliciously large, this is not the common case though, so what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size). What do you think @cfallin?

jakubDoka · 2024-09-26T09:50:01Z

After reverting to BTree I see a regression of ~500m cycles.

jakubDoka · 2024-09-26T10:20:53Z

After reverting this I see a regression of another ~250m cycles.

fitzgen · 2024-09-29T16:40:28Z

what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size).

Another, perhaps a little more out-there, idea is to make the whole algorithm generic over this container type, and make two instantiations: one with Vec and one with BTreeSet. Then, we try running the Vec-instantiated version first, but bail out if the length gets too long, at which point we fall back to the BTreeSet. If the length is a fixed size, we could even use an array instead of a Vec.

The failure checks and propagation overheads might dwarf any speed up though.

jakubDoka · 2024-10-02T16:25:12Z

@cfallin are there any extra modifications needed? (considering I will add the other changes in a separate pr(s))

jakubDoka · 2024-10-02T16:30:57Z

I realized there are some things I missed, so nvm

cfallin

Yes, please make sure all of my earlier comments were addressed; also one new one below from the start of my second pass.

fuzz/Cargo.toml

jakubDoka · 2024-10-02T17:01:48Z

Okay, @cfallin, I double-checked everything, hopefully nothing was missed (I found one thing in the diff).

cfallin

Thanks! I have just a few more comments, but this is getting close.

src/ion/liveranges.rs

src/ion/merge.rs

src/ion/mod.rs

src/lib.rs

cfallin

Thanks for your updates! A few more comments below; hopefully these are the last ones!

cfallin · 2024-10-03T15:30:22Z

deny.toml

@@ -1,7 +1,7 @@
 targets = [
    { triple = "x86_64-unknown-linux-gnu" },
    { triple = "x86_64-apple-darwin" },
-    { triple = "x86_64-pc-windows-msvc" },
+    { triple = "x86_64-pc-windows-msempty_vec" },


I think this was an errant find-replace problem? pc-windows-msempty_vec sounds like an interesting platform, but not one that Rust supports...

yes and its outdated, I noticed it and fixed it

cfallin · 2024-10-03T15:36:07Z

src/lib.rs

@@ -1671,6 +1671,8 @@ impl<T> VecExt<T> for Vec<T> {
 }

 #[derive(Debug, Clone, Default)]
+/// Bump is a wrapper around `bumpalo::Bump` that can be cloned and also
+/// implements `Allocator`. Using this avoids lifetime polution of `Ctx`.


From the PoV of API docs consumers, there should probably be something here about what this is used for...

... actually, it seems it's not exposed at all in the run_with_ctx signature; can we make this pub(crate) then?

Problem with that is that rust will complain in many places but I can change these to pub(crate) too

cfallin · 2024-10-03T15:40:50Z

src/lib.rs

+    env: &MachineEnv,
+    options: &RegallocOptions,
+    ctx: &mut Ctx,
+) -> Result<(), RegAllocError> {


Could we return the Output here as with run(), by mem::take'ing it from the Ctx? I'd much prefer that to an implicit result in ctx.output -- OK to be imperative and implicit inside RA2 but a functional API is best.

It is implicit because the allocations inside the output are reused, of course, If we assume the user will return the output to the Ctx this is fine but IMO I'd rather hint in the doc where the output is located, that way you don't need to write more code to get optimal performance.

Actually returning a reference could be a good middle ground

cfallin

OK, I think this is finally good to go. Thanks so much!

If you'd like, feel free to make a PR over in bytecodealliance/wasmtime to make use of run_with_ctx in Cranelift as well...

This includes two major updates: - The new single-pass fast allocator (bytecodealliance#181); - An ability to reuse allocations across runs (bytecodealliance#196).

This includes two major updates: - The new single-pass fast allocator (#181); - An ability to reuse allocations across runs (#196).

jakubDoka added 14 commits September 21, 2024 13:24

adding new api that accepts resuable context

6b21989

fixing nostd compatibility

c1c7614

reusing memory on more places

8397803

reusing memory on more places

ad3ef8e

improving things and fixing a bug

5fba123

removig major smallvec allocation

6907700

removig major smallvec allocation

523f85d

tracking down nasty hashmap

d7af63f

fixing std compat

4eec91a

experimenting with struct of arrays

fdd9abf

saving

a54ebf4

saving again

7d599e0

reverting soa changes, they give marginal returns

604eacc

removing expensive hashmaps

7b89e02

jakubDoka force-pushed the reuse-allocations branch from 3a059ce to 7b89e02 Compare September 23, 2024 21:48

jakubDoka added 2 commits September 24, 2024 09:06

removing more hashmaps

09d36db

resolving conflicts

5d2e0ac

jakubDoka force-pushed the reuse-allocations branch from 34f4b5d to 5d2e0ac Compare September 24, 2024 07:22

jakubDoka and others added 3 commits September 24, 2024 09:26

Merge branch 'main' into reuse-allocations

17a8647

fmt

622deb9

reducing dependencies

d736593

jakubDoka force-pushed the reuse-allocations branch from fcbc26b to d736593 Compare September 24, 2024 08:33

fixing ctx not running fastalloc

4100af4

cfallin reviewed Sep 25, 2024

View reviewed changes

revering some changes

fda77e2

jakubDoka added 5 commits September 26, 2024 11:56

revering code motion

b6f11fb

revering code motion

8d2d384

forgot to delete this

0008e8e

lto breaks the build

f9df2bf

fixing the std compat

c4d21e2

jakubDoka requested a review from cfallin September 26, 2024 17:42

removing comented code

4b3b715

cfallin reviewed Oct 2, 2024

View reviewed changes

fuzz/Cargo.toml Outdated Show resolved Hide resolved

jakubDoka added 2 commits October 2, 2024 18:46

removing unrelated changes

31f2a81

removing useless method

b87b46a

cfallin reviewed Oct 3, 2024

View reviewed changes

src/ion/liveranges.rs Outdated Show resolved Hide resolved

src/ion/merge.rs Outdated Show resolved Hide resolved

src/ion/mod.rs Outdated Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

jakubDoka added 2 commits October 3, 2024 08:44

resolving comments

2a071f1

fixing mistakes

742a7ee

jakubDoka requested a review from cfallin October 3, 2024 06:52

cfallin reviewed Oct 3, 2024

View reviewed changes

jakubDoka added 2 commits October 3, 2024 18:57

hiding the Bump

76bd940

making more intuitive api

21c43e3

jakubDoka requested a review from cfallin October 3, 2024 17:10

cfallin approved these changes Oct 3, 2024

View reviewed changes

cfallin merged commit f2b9533 into bytecodealliance:main Oct 3, 2024
6 checks passed

cfallin added a commit to cfallin/regalloc2 that referenced this pull request Nov 15, 2024

Bump to version 0.11.0.

1e377a6

This includes two major updates: - The new single-pass fast allocator (bytecodealliance#181); - An ability to reuse allocations across runs (bytecodealliance#196).

cfallin mentioned this pull request Nov 15, 2024

Bump to version 0.11.0. #201

Merged

cfallin added a commit that referenced this pull request Nov 15, 2024

Bump to version 0.11.0. (#201)

2a5777a

This includes two major updates: - The new single-pass fast allocator (#181); - An ability to reuse allocations across runs (#196).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new API that accepts resusable context. #196

Adding new API that accepts resusable context. #196

jakubDoka commented Sep 21, 2024

jakubDoka commented Sep 24, 2024

cfallin commented Sep 24, 2024

jakubDoka commented Sep 25, 2024

cfallin left a comment

jakubDoka commented Sep 26, 2024

jakubDoka commented Sep 26, 2024

jakubDoka commented Sep 26, 2024

fitzgen commented Sep 29, 2024

jakubDoka commented Oct 2, 2024

jakubDoka commented Oct 2, 2024

cfallin left a comment

jakubDoka commented Oct 2, 2024

cfallin left a comment

cfallin left a comment

cfallin Oct 3, 2024

jakubDoka Oct 3, 2024

cfallin Oct 3, 2024

jakubDoka Oct 3, 2024

cfallin Oct 3, 2024

jakubDoka Oct 3, 2024

jakubDoka Oct 3, 2024

cfallin left a comment

Adding new API that accepts resusable context. #196

Adding new API that accepts resusable context. #196

Conversation

jakubDoka commented Sep 21, 2024

jakubDoka commented Sep 24, 2024

cfallin commented Sep 24, 2024

jakubDoka commented Sep 25, 2024

cfallin left a comment

Choose a reason for hiding this comment

jakubDoka commented Sep 26, 2024

jakubDoka commented Sep 26, 2024

jakubDoka commented Sep 26, 2024

fitzgen commented Sep 29, 2024

jakubDoka commented Oct 2, 2024

jakubDoka commented Oct 2, 2024

cfallin left a comment

Choose a reason for hiding this comment

jakubDoka commented Oct 2, 2024

cfallin left a comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

cfallin Oct 3, 2024

Choose a reason for hiding this comment

jakubDoka Oct 3, 2024

Choose a reason for hiding this comment

cfallin Oct 3, 2024

Choose a reason for hiding this comment

jakubDoka Oct 3, 2024

Choose a reason for hiding this comment

cfallin Oct 3, 2024

Choose a reason for hiding this comment

jakubDoka Oct 3, 2024

Choose a reason for hiding this comment

jakubDoka Oct 3, 2024

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment