A new tail-calling interpreter for significantly better interpreter performance #128563

Fidget-Spinner · 2025-01-06T22:09:50Z

Feature or enhancement

Proposal

Experimental branch: main...Fidget-Spinner:cpython:tail-call

Prior discussion at: faster-cpython/ideas#642

I propose adding a tail-calling interpreter to CPython for significantly better performance on compilers that support it.

This idea is not new, and has been implemented by:

Protobuf https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html
Lua (Deegen) https://sillycross.github.io/2022/11/22/2022-11-22/

CPython currently has a few interpreters:

A switch-case interpreter (MSVC)
A computed goto interpreter (Clang, GCC)
An uop interpreter. (Everything)

The tail-calling interpreter will be the 4th that coexists with the rest. This means no compatibility concerns.

Performance

Preliminary benchmarks by me suggest excellent performance improvements --- 10% geometric mean speedup in pyperformance, with up to 40% speedup in Python-heavy benchmarks: https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21. These benchmarks were performed with clang-19 on both main and my branch, with ThinLTO and PGO, on AMD64 Ubuntu 22.04. PGO seems especially crucial for the speedups based on my testing. For those outside of CPython development: a 10% speedup is roughly equal to 2 minor CPython releases worth of improvements. For example, CPython 3.12 roughly sped up by 5%.

The speedup is so significant that if accepted, the new interpreter will be faster than the current JIT compiler.

Drawbacks

Maintainability (this will introduce more code)
Portability

I will address maintainability by using the interpreter generator that was introduced as part of CPython 3.12. This generator will allow us to automatically generate most of the infrastructure needed for this change. Preliminary estimates suggest the generator will be only 200 lines of Python code, most of which is shared/copied/same conceptually as the other generators.

For portability, this will fix itself (see the next section).

Portability and Precedent

At the moment, this is only supported by clang-19 for AArch64 and AMD64, with partial support on clang-18 and gcc-next, but likely bad performance on those. The reason is that we need both the __attribute__((musttail)) and __attribute__((preserve_none)) attributes for good performance. GCC only has gnu::musttail but not preserve_none.

There has been prior precedence on adding compiler-specific optimizations for CPython. See for example the original computed goto issue by Antoine Pitrou https://bugs.python.org/issue4753. At the time, it was a new feature only on GCC and not on Clang, but we still added it anyways. Eventually a few years later, Clang also introduced the feature. The key point gcc will likely eventually catch up and add these features.

EDIT: Added that it's only a likely case to have bad perf on GCC. Reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328, I have not tested on GCC trunk. This is pure speculation that perf is bad. I can try with GCC eventually after the PR lands and we can test it from there. However, testing with clang with just musttail and no preserve_none, the performance was quite bad.

Implementation plan

Parse error labels in _PyEval_EvalFrameDefault.
Implement the rest.
Add likely/unlikely attributes to DEOPT_IF/EXIT_IF.
Support GCC 15.0 if we determine the performance is good.
Add option to Windows build script.
Mention in Whats New. (Note: We NEED PGO, otherwise the perf is not very good on clang)
Open new issue as deferred blocker to use clang-19 and the tail-calling interpreter for macOS (and possibly Windows) binaries, so we use it for the final release in 3.14.
Open new issue on improving the code quality, by moving some parameters into other places to free up registers.

Worries about new bugs

Computed goto is well-tested, so worrying about the new interpreter being buggy is fair.

I doubt logic bugs will be the primary one, as we are using the interpreter generator. This means we share common code between the base interpreter and the new one. If the new one has logic bugs, it is likely the base interpreter has it too.

The other one is compiler bugs. However to allay such fears, I point out that the GHC calling convention (the thing behind preserve_none has been around for 5 years1, and musttail has been around for almost 4 years2.

cc @pitrou as the original implementer of computed gotos, and @markshannon

Future Use

Kumar Aditya pointed out this could be used in regex and pickle as well. Likewise, Neil Schemenauer pointed out marshal and pickle might benefit from this for faster Python startup.

Has this already been discussed elsewhere?

https://discuss.python.org/t/a-new-tail-calling-interpreter-for-significantly-better-interpreter-performance/76315

Links to previous discussion of this feature:

No response

Linked PRs

The text was updated successfully, but these errors were encountered:

pitrou · 2025-01-07T13:13:08Z

Can you show what a typical tail-calling sequence looks like? Does it combine tail-calling with a computed goto as in this example from the protobuf blog post?

    MUSTTAIL return op_table[op](ARGS);

Fidget-Spinner · 2025-01-07T13:18:34Z

Mark gives a pretty good example here faster-cpython/ideas#642 (comment)

pitrou · 2025-01-07T13:23:09Z

Neat, thank you!

diegorusso · 2025-01-07T13:29:08Z

How much performance is attributed to __attribute__((preserve_none)) alone?

Fidget-Spinner · 2025-01-07T13:32:41Z

@diegorusso I did some experiments on WASI and esmcriptem (which do not support preserve_none). No speedup at all on those platforms which do not support (in fact, up to a 2x slowdown). So this is a negative without it.

WolframAlph · 2025-01-07T13:50:17Z

FYI from Clang docs

preserve_none’s ABI is still unstable, and may be changed in the future.

Fidget-Spinner · 2025-01-07T13:51:18Z

@WolframAlph I think that doesn't matter because we're only using this on _PyEval_EvalFrameDefault's opcode handlers which are not external-facing. This is an internal implementation detail.

diegorusso · 2025-01-07T14:36:52Z

Anyway I pinged the GCC team at Arm and a ticket has been created to implement preserve_none in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

diegorusso · 2025-01-07T14:58:17Z

@diegorusso I did some experiments on WASI and esmcriptem (which do not support preserve_none). No speedup at all on those platforms which do not support (in fact, up to a 2x slowdown). So this is a negative without it.

This is what I was expecting. The tail call by itself is not enough (actually I was expecting similar performance to computed goto) and you need preserve_none (like for the JIT) to make the difference.

Have you tested it on AArch64 as well?

Fidget-Spinner · 2025-01-07T15:03:38Z

Have you tested it on AArch64 as well?

Not yet. I want to test it on the Faster CPython build bot for macOS (that has clang-19, so it's a fair comparison), but I do not have access to it. If you could run some benchmarks for this I would be really grateful! If you want a quick-and-dirty check that it's working, try just running the pystones benchmark. I got a 25% speedup with tail calls and LTO and PGO on (make sure you enable those, because it contributes like half the perf win for some reason). https://gist.github.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf

And pass CC=clang-19 and all that.

WolframAlph · 2025-01-07T15:09:46Z

@Fidget-Spinner If I understand correctly, the whole trick is:

musttail guarantees to jump to the next subroutine (if their signatures are the same), thus getting rid of call overhead (prologue, epilogue, stack space reservation, etc..).
preserve_none puts burden of preserving registers on the caller. But every subsequent tail call is performed to the function with preserve_none attribute, therefore none of the opcode handlers spill registers (because we instructed so and because it won't need them anymore because of tail call) and just forward function arguments in registers through whole chain of calls.

Do I get it right?

Fidget-Spinner · 2025-01-07T15:30:38Z

@WolframAlph

Do I get it right?

Yeah. Also I suspect most of the speedup is there because the current interpreter loop is too big to optimize properly, so all the pre-existing compilers perform not-so-well for it.

For example, PGO gives this roughly another 10% speedup over just -O3. LTO roughly another 10% over PGO and -O3. Normally if we compare equally, PGO and LTO should optimize both the old interpreter and new one similarly, but I guess the new interpreter is easier to optimize so it produces better quality code.

WolframAlph · 2025-01-07T15:43:14Z

Also I suspect most of the speedup is there because the current interpreter loop is too big to optimize properly, so all the pre-existing compilers perform not-so-well for it.

Makes sense. By splitting cases into functions, compiler can optimize each one of them better individually rather than one giant chunk I assume. Same was mentioned in the protobuf article you linked.

corona10 · 2025-01-07T16:43:15Z

Here is the result for cross-checking bm_pystones with @Fidget-Spinner on macOS aarch64
The only difference between tail-calling and baseline is the dispatch technique.
I pinned the compiler version and compiler optimization policies.

Baseline: `2228e92`

Configure

CC=clang-19 ./configure --enable-optimizations --with-lto=thin

Result

➜  cpython git:(2228e92da31) ✗ ./python.exe bm_pystones.py
Pystone(1.1) time for 50000 passes = 0.0763571
This machine benchmarks at 654818 pystones/second

Tail-calling https://github.com/Fidget-Spinner/cpython/tree/tail-call

Configure

CC=clang-19 ./configure --enable-optimizations --with-lto=thin --enable-tail-call-interp

Result

➜  cpython git:(tail-call) ✗ ./python.exe bm_pystones.py
Pystone(1.1) time for 50000 passes = 0.0548847
This machine benchmarks at 911001 pystones/second

cc @diegorusso

Fidget-Spinner · 2025-01-07T16:47:11Z

According to Donghee's comment above, Pystones is nearly 50% faster on macOS AArch64 vs 25% faster on my Ubuntu AMD64 machine. I suspect it might be because AArch64 has more registers. However, who knows at this point :)?

Fidget-Spinner · 2025-01-07T18:25:55Z

Reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328, I have not tested on GCC trunk. This is pure speculation on my part that perf is bad on there. I can try with GCC eventually after the PR lands and we can test it from there. However, testing with clang with just musttail and no preserve_none, the performance was quite bad.

mdboom · 2025-01-08T16:25:10Z

EDIT: Here's a new link that has macOS/arm64 and Linux x86_64. 14.5% faster on arm, 8% faster on x86_64. Linux on ARM64 results should who up at the same URL in about 3 hours: https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250107-3.14.0a3+-f1d3190-CLANG

On the Faster CPython infrastructure, on macOS with Clang 19.1, this PR makes things about 14.5% faster over the whole pyperformance suite: https://github.com/faster-cpython/benchmarking-public/tree/use-clang-19/results/bm-20250107-3.14.0a3%2B-f1d3190

Fidget-Spinner · 2025-01-09T19:11:53Z

I'm confident in boosting x86-64 perf even further than the 7.5%, by freeing up more call registers.

However, for the first PR, it shall be just a straight-forward port to minimize bugs.

davfsa · 2025-01-09T20:23:24Z

Hey @Fidget-Spinner !

Been following your work on the interpreter improvement, and want to say thank you for giving it a shot (you and all the previous work that multiple people did before you)!

I wanted to bring to your attention a small discrepancy that I noticed in the benchmark results in the faster cpython project public repo.

The bench_mp_pool benchmark seems to have received a massive slowdown in all architectures (specially in arminc aarch64, which is the one that mostly caught my attention):

Arminc Aarch64
bench_mp_pool 14.5 ms 4.51 sec: 310.61x slower

Darwin ARM64
bench_mp_pool 37.4 ms 60.1 ms: 1.61x slower

Linux x86_64
bench_mp_pool 24.0 ms 80.7 ms: 3.36x slower

There are also some other slowdowns in some individual benchmarks, most of them being GC or python startup.

I am sorry if these were already in your radar, I just wanted to note it down, specially that 310.61x slower benchmark, which I kind of hope is just a fluke

Again, thank you so much for your work and the entire CPython team!

Fidget-Spinner · 2025-01-09T20:39:39Z

@davfsa thanks for bringing those up. bench_mp_pool is a known problem outside of our control. It's been narrowed down (likely) to an unrelated commit faster-cpython/benchmarking-public#43 (comment) .

The other (rare) benchmarks that slowed down are in the 1-2% range, which usually means it's noise. So I'm ignoring those.

The same techniques here can be applied potentially to marshal, pickle, and regex to speed up CPython startup and regex, so I'm not worried..

mdboom · 2025-01-09T20:50:09Z

Thanks for bringing this up @davfsa! I hope not to seem argumentative by showing my own results -- I genuinely would like to understand how you are arriving at yours so we can reproduce and get to the bottom of it if there is something going on here.

Are you comparing @Fidget-Spinner's branch to its merge base (where it branched off from main) or to some other older version of Python? With my own measurements, when comparing to the merge base, pyperf doesn't find any statistically significant difference (see for example).

One thing we know about bench_mp_pool specifically is that it has a lot of variation from run to run, so much so that you generally can't learn much from it, and pyperf specifically detects for this case and says "hidden because not significant" most of the time. You can validate that yourself by looking at the individual values within the same run. For example, when I ran on Linux aarch64, I got these samples:

            1.0530815310776234,
            2.0802172124385834,
            3.6146112345159054,
            8.247541021555662,
            4.341393057256937,
            11.962250161916018,
            1.467464279383421,
            0.9287587739527225,
            3.3638121373951435

As you can see, there is so much variation (on the order of 12x between min and max!) that it's pretty hard to make any conclusions from this benchmark. It does make me wonder why pyperf isn't rejecting this one for you as well. Maybe because you are comparing to a much older version of Python, and there has been some real regression in that time, unrelated to the changes here?

davfsa · 2025-01-09T21:15:16Z

Hey @mdboom!

I was just looking at the results from the link you sent in your comment above (https://github.com/faster-cpython/benchmarking-public/tree/main/results%2Fbm-20250107-3.14.0a3%2B-f1d3190-CLANG) and was scrolling through the .md of each of the archs and the specific gains and losses of the individual benchmarks.

I haven't run the benchmarks myself, I was just looking at the results 🙃

mdboom · 2025-01-09T21:19:25Z

I haven't run the benchmarks myself, I was just looking at the results 🙃

Got it. Yeah, I don't routinely look at the ones against older baselines since they don't really show the effect of the specific change at hand. It looks like we have faster-cpython/benchmarking-public#43 (comment) to track the (unrelated) issue with bench_mp_pool so that should hopefully get looked at.

mdboom · 2025-01-10T14:18:36Z

There are now Windows results available as well. Note this is comparing clang-19 with and without tailcalls, not comparing clang-19 to the default msvc (that comparison is also important, but coming soon). I'm seeing a healthy speedup of 10.8% on x86_64, but a slowdown of 14.4% on x86 (32-bit). I suspect this is not surprising.

Also note these Windows/Clang builds don't use PGO since the Windows build scripts don't currently support that configuration.

https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250107-3.14.0a3+-f1d3190-CLANG

Fidget-Spinner · 2025-01-10T16:11:45Z

Btw, the tail-call interpreter I've found is a much better debugging experience than the default one. There's two QOL improvements:

The variables don't get <optimized out> anymore.
If you want a trace of the instructions executed, you can actually turn off the tail calls, and you get a call stack corresponding to instructions executed.

Fidget-Spinner · 2025-01-10T19:47:14Z

PR up at #128718

brandtbucher · 2025-01-10T23:35:43Z

I'm seeing a healthy speedup of 10.8% on x86_64, but a slowdown of 14.4% on x86 (32-bit). I suspect this is not surprising.

This is likely because Clang doesn't support preserve_none on x86, so it's just ignored there (which highlights the importance of the calling convention, especially when registers are scarce).

Btw, the tail-call interpreter I've found is a much better debugging experience than the default one.

Another cool benefit is that perf will now tell you how much time we spend in each individual bytecode instruction, probably.

Fidget-Spinner · 2025-01-12T07:08:12Z

FWIW, musttail in GCC currently seems different than in clang. I suspect there's a bug, so I filed an issue in GCC tracker https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118430.

Edit: not a bug, intended behavior.

Fidget-Spinner · 2025-01-13T18:19:08Z

I benched this on GCC and saw a 10% improvement in pystones on -O3. The improvement might be more if we enable PGO but that's broken with musttail on GCC 15. Filed a reproducer for that in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118442.

markshannon · 2025-01-20T13:52:13Z

Before we implement the tailcalling interpreter, there are a few preparatory code changes I think we should make.
Nothing big, just changes to keep the PR for the tailcalling interpreter focused on dispatching.

Convert labels in ceval.c to LABEL(...) in bytecodes.c for the code generator to layout
Implement checks for entry frame so that they can work in the tailcalling interpreter and JIT
Factor out unwinding code, so we don't have the throwflag variable to pass around
Get the code generator to directly generate the code for GOTO_INSTRUCTION and jumps to label, not relying on C macros

…129078) Add new frame owner type for interpreter entry frames

…129115)

Fidget-Spinner added type-feature A feature request or enhancement performance Performance or resource usage labels Jan 6, 2025

Fidget-Spinner self-assigned this Jan 6, 2025

picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 6, 2025

bedevere-app bot mentioned this issue Jan 10, 2025

gh-128563: A new tail-calling interpreter #128718

Open

markshannon added a commit that referenced this issue Jan 21, 2025

GH-128563: Add new frame owner type for interpreter entry frames (GH-…

f5b6356

…129078) Add new frame owner type for interpreter entry frames

Fidget-Spinner added a commit that referenced this issue Jan 21, 2025

gh-128563: Move lltrace into the frame struct (GH-129113)

5809b25

Fidget-Spinner added a commit that referenced this issue Jan 22, 2025

gh-128563: Move GO_TO_INSTRUCTION and PREDICT to cases generator (GH-…

86c1a60

…129115)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new tail-calling interpreter for significantly better interpreter performance #128563

A new tail-calling interpreter for significantly better interpreter performance #128563

Fidget-Spinner commented Jan 6, 2025 •

edited

Loading

pitrou commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

pitrou commented Jan 7, 2025

diegorusso commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

diegorusso commented Jan 7, 2025

diegorusso commented Jan 7, 2025 •

edited by terryjreedy

Loading

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

corona10 commented Jan 7, 2025 •

edited

Loading

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

mdboom commented Jan 8, 2025 •

edited

Loading

Fidget-Spinner commented Jan 9, 2025 •

edited

Loading

davfsa commented Jan 9, 2025 •

edited

Loading

Fidget-Spinner commented Jan 9, 2025

mdboom commented Jan 9, 2025

davfsa commented Jan 9, 2025

mdboom commented Jan 9, 2025

mdboom commented Jan 10, 2025 •

edited

Loading

Fidget-Spinner commented Jan 10, 2025 •

edited

Loading

Fidget-Spinner commented Jan 10, 2025

brandtbucher commented Jan 10, 2025

Fidget-Spinner commented Jan 12, 2025 •

edited

Loading

Fidget-Spinner commented Jan 13, 2025

markshannon commented Jan 20, 2025 •

edited by Fidget-Spinner

Loading

A new tail-calling interpreter for significantly better interpreter performance #128563

A new tail-calling interpreter for significantly better interpreter performance #128563

Comments

Fidget-Spinner commented Jan 6, 2025 • edited Loading

Feature or enhancement

Proposal

Performance

Drawbacks

Portability and Precedent

Implementation plan

Worries about new bugs

Future Use

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

pitrou commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

pitrou commented Jan 7, 2025

diegorusso commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025 • edited Loading

diegorusso commented Jan 7, 2025

diegorusso commented Jan 7, 2025 • edited by terryjreedy Loading

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

Fidget-Spinner commented Jan 7, 2025

WolframAlph commented Jan 7, 2025

corona10 commented Jan 7, 2025 • edited Loading

Baseline: 2228e92

Configure

Result

Tail-calling https://github.com/Fidget-Spinner/cpython/tree/tail-call

Configure

Result

Fidget-Spinner commented Jan 7, 2025 • edited Loading

Fidget-Spinner commented Jan 7, 2025 • edited Loading

mdboom commented Jan 8, 2025 • edited Loading

Fidget-Spinner commented Jan 9, 2025 • edited Loading

davfsa commented Jan 9, 2025 • edited Loading

Fidget-Spinner commented Jan 9, 2025

mdboom commented Jan 9, 2025

davfsa commented Jan 9, 2025

mdboom commented Jan 9, 2025

mdboom commented Jan 10, 2025 • edited Loading

Fidget-Spinner commented Jan 10, 2025 • edited Loading

Fidget-Spinner commented Jan 10, 2025

brandtbucher commented Jan 10, 2025

Fidget-Spinner commented Jan 12, 2025 • edited Loading

Fidget-Spinner commented Jan 13, 2025

markshannon commented Jan 20, 2025 • edited by Fidget-Spinner Loading

Fidget-Spinner commented Jan 6, 2025 •

edited

Loading

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

diegorusso commented Jan 7, 2025 •

edited by terryjreedy

Loading

corona10 commented Jan 7, 2025 •

edited

Loading

Baseline: `2228e92`

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

Fidget-Spinner commented Jan 7, 2025 •

edited

Loading

mdboom commented Jan 8, 2025 •

edited

Loading

Fidget-Spinner commented Jan 9, 2025 •

edited

Loading

davfsa commented Jan 9, 2025 •

edited

Loading

mdboom commented Jan 10, 2025 •

edited

Loading

Fidget-Spinner commented Jan 10, 2025 •

edited

Loading

Fidget-Spinner commented Jan 12, 2025 •

edited

Loading

markshannon commented Jan 20, 2025 •

edited by Fidget-Spinner

Loading