Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new API that accepts resusable context. #196

Merged
merged 33 commits into from
Oct 3, 2024

Conversation

jakubDoka
Copy link
Contributor

Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress).

Context -> https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/Using.20context.20for.20.60TargetIsa.3A.3Acompile_function.60

@jakubDoka
Copy link
Contributor Author

I profiled the allocator a bit and I got rid of some hashmaps in favor of using the key as an index which improved things a little bit.

@cfallin
Copy link
Member

cfallin commented Sep 24, 2024

@jakubDoka could you summarize with a perf measurement (Sightglass or hyperfine of wasmtime compile is OK) the overall compile-time effect of this?

@jakubDoka
Copy link
Contributor Author

I had to run the bench separately because whatever I run last performs worse (if I run two identical programs, the first run always wins) (my CPU throttles)

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     136.7 ms ±   4.5 ms    [User: 128.8 ms, System: 7.1 ms]
  Range (min … max):   132.5 ms … 157.8 ms    100 runs

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     147.1 ms ±   4.6 ms    [User: 138.2 ms, System: 7.6 ms]
  Range (min … max):   140.1 ms … 163.2 ms    100 runs

Let me know If full perf stat could be useful

(spidermonkey)
15,471,712,064      cpu_core/cycles/u // wasmtime.old
14,390,147,910      cpu_core/cycles/u // wasmtime.new
(pulldown-cmark)
631,885,447      cpu_core/cycles/u // wasmtime.old
599,844,924      cpu_core/cycles/u // wasmtime.new

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this work! I'm happy to see the performance improvement. After reading through, I have a few high-level thoughts:

  • I'd like to see some of the changes here pulled out into separate PRs. For example, you replace concatenate-and-sort steps with in-place merge when we know the liverange lists are already sorted; good optimization; but let's have that as a separate PR, so we can reason about this PR more easily and also so that we can test and revert those changes separately. Basically, I want this PR to be as close as possible to a mechanical replacement of data structures with arena-allocated or reused data structures, with no semantics changes. Some of the changes to the core allocation loop around the allocation-map iterator also make me very nervous.

  • The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?

fuzz/fuzz_targets/ion.rs Outdated Show resolved Hide resolved
fuzz/fuzz_targets/ion_checker.rs Outdated Show resolved Hide resolved
fuzz/fuzz_targets/ssagen.rs Outdated Show resolved Hide resolved
src/cfg.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/ion/moves.rs Outdated Show resolved Hide resolved
src/ion/moves.rs Outdated Show resolved Hide resolved
src/ion/process.rs Outdated Show resolved Hide resolved
src/ion/data_structures.rs Outdated Show resolved Hide resolved
src/ion/data_structures.rs Outdated Show resolved Hide resolved
@jakubDoka
Copy link
Contributor Author

The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?

I did not realize Wasmtime needs to deal with malicious code, good point, all of the changes here are guided by a simple principle: How to make the code do less. Using vec instead of BTree has many properties that CPUs like. I suspect that in try_to_insert_bundle_to_reg, replacing BTree with vec improves drastically because we mostly traverse the entries linearly for which Vec is more cache-friendly than BTree.

Now that I think about this, quadratic behavior is a problem if the function we compile is maliciously large, this is not the common case though, so what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size). What do you think @cfallin?

@jakubDoka
Copy link
Contributor Author

After reverting to BTree I see a regression of ~500m cycles.

@jakubDoka
Copy link
Contributor Author

After reverting this I see a regression of another ~250m cycles.

@jakubDoka jakubDoka requested a review from cfallin September 26, 2024 17:42
@fitzgen
Copy link
Member

fitzgen commented Sep 29, 2024

what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size).

Another, perhaps a little more out-there, idea is to make the whole algorithm generic over this container type, and make two instantiations: one with Vec and one with BTreeSet. Then, we try running the Vec-instantiated version first, but bail out if the length gets too long, at which point we fall back to the BTreeSet. If the length is a fixed size, we could even use an array instead of a Vec.

The failure checks and propagation overheads might dwarf any speed up though.

@jakubDoka
Copy link
Contributor Author

@cfallin are there any extra modifications needed? (considering I will add the other changes in a separate pr(s))

@jakubDoka
Copy link
Contributor Author

I realized there are some things I missed, so nvm

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please make sure all of my earlier comments were addressed; also one new one below from the start of my second pass.

fuzz/Cargo.toml Outdated Show resolved Hide resolved
@jakubDoka
Copy link
Contributor Author

Okay, @cfallin, I double-checked everything, hopefully nothing was missed (I found one thing in the diff).

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have just a few more comments, but this is getting close.

src/ion/liveranges.rs Outdated Show resolved Hide resolved
src/ion/merge.rs Outdated Show resolved Hide resolved
src/ion/mod.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
@jakubDoka jakubDoka requested a review from cfallin October 3, 2024 06:52
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updates! A few more comments below; hopefully these are the last ones!

deny.toml Outdated
@@ -1,7 +1,7 @@
targets = [
{ triple = "x86_64-unknown-linux-gnu" },
{ triple = "x86_64-apple-darwin" },
{ triple = "x86_64-pc-windows-msvc" },
{ triple = "x86_64-pc-windows-msempty_vec" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was an errant find-replace problem? pc-windows-msempty_vec sounds like an interesting platform, but not one that Rust supports...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and its outdated, I noticed it and fixed it

src/lib.rs Outdated
@@ -1671,6 +1671,8 @@ impl<T> VecExt<T> for Vec<T> {
}

#[derive(Debug, Clone, Default)]
/// Bump is a wrapper around `bumpalo::Bump` that can be cloned and also
/// implements `Allocator`. Using this avoids lifetime polution of `Ctx`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PoV of API docs consumers, there should probably be something here about what this is used for...

... actually, it seems it's not exposed at all in the run_with_ctx signature; can we make this pub(crate) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with that is that rust will complain in many places but I can change these to pub(crate) too

src/lib.rs Outdated
env: &MachineEnv,
options: &RegallocOptions,
ctx: &mut Ctx,
) -> Result<(), RegAllocError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return the Output here as with run(), by mem::take'ing it from the Ctx? I'd much prefer that to an implicit result in ctx.output -- OK to be imperative and implicit inside RA2 but a functional API is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implicit because the allocations inside the output are reused, of course, If we assume the user will return the output to the Ctx this is fine but IMO I'd rather hint in the doc where the output is located, that way you don't need to write more code to get optimal performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually returning a reference could be a good middle ground

@jakubDoka jakubDoka requested a review from cfallin October 3, 2024 17:10
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is finally good to go. Thanks so much!

If you'd like, feel free to make a PR over in bytecodealliance/wasmtime to make use of run_with_ctx in Cranelift as well...

@cfallin cfallin merged commit f2b9533 into bytecodealliance:main Oct 3, 2024
6 checks passed
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Nov 15, 2024
This includes two major updates:

- The new single-pass fast allocator (bytecodealliance#181);
- An ability to reuse allocations across runs (bytecodealliance#196).
@cfallin cfallin mentioned this pull request Nov 15, 2024
cfallin added a commit that referenced this pull request Nov 15, 2024
This includes two major updates:

- The new single-pass fast allocator (#181);
- An ability to reuse allocations across runs (#196).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants