Use "prefetch" CPU instructions during the marking phase of the GC #129201
Labels
interpreter-core
(Objects, Python, Grammar, and Parser dirs)
performance
Performance or resource usage
topic-free-threading
type-feature
A feature request or enhancement
Feature or enhancement
Proposal:
This change is partially inspired by a similar change made to the OCaml GC. Sam wrote a prototype implementation of the idea and that seemed to show promise. Now that we have a "mark alive" phase in the free-threaded GC, it is easier to add the prefetch buffer. Doing it only for the marking phase would seem to provide most of the benefit for the minimal amount of code complexity.
It is expected that using "prefetch" will only provide a benefit when the working set of objects exceeds the size of the CPU cache. If that's not the case, the prefetch logic should not (much) hurt performance. There would be a small increase in the code complexity for traversing the object graph (to selectively use the prefetch buffer or use the stack). However, on small object graphs, the time spent in the GC is also small.
Note this change is proposed for the free-threaded version of the cyclic GC. It might be possible to use prefetching in the default build GC but the design would need to be fairly different due to the next/prev GC linked lists. A separate issue should be created if someone wants to try to implement that optimization.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
The text was updated successfully, but these errors were encountered: