gh-129201: Use prefetch in GC mark alive phase. #129203

nascheme · 2025-01-22T22:32:50Z

This PR implements prefetching as suggested in the issue. The benchmarks that are part of pyperformance generally don't show much difference since they don't use enough memory to make prefetch effective.

Issue: Use "prefetch" CPU instructions during the marking phase of the GC #129201

Source code for the two "big" benchmarks. These create a fairly large object graph and then call gc.collect() to time it.

gc big tree
gc big

Updated benchmarks for commit 041b2e4. Run on a Macbook Pro M3, clang 19, PGO/LTO enabled. Both "base" and "prefetch" are configured with ./configure --enable-optimizations --disable-gil --with-lto.

benchmark	build	prefetch	time [ms]	speedup
bm_gc_collect	base	-	1113	1.00
bm_gc_collect	prefetch	on	1280	0.87
bm_gc_collect	prefetch	off	1273	0.87

bm_gc_traversal	base	-	1366	1.00
bm_gc_traversal	prefetch	on	1336	1.02
bm_gc_traversal	prefetch	off	1324	1.03

gc_big	base	-	874	1.00
gc_big	prefetch	on	465	1.88
gc_big	prefetch	off	544	1.61

gc_big_tree	base	-	565	1.00
gc_big_tree	prefetch	on	262	2.16
gc_big_tree	prefetch	off	304	1.86

pyperformance results comparing to merge base.

Some notes about the implementation:

exactly what prefetch instruction to use is a bit unclear. On my Macbook M3, it seems that using PREFETCH_L1 is slightly faster whereas on my AMD Ryzen 5 desktop the PREFETCH_L2 is a bit faster. The difference is pretty small. There are also prefetch instructions that indicate you are going to write to the word (second parameter in the __builtin_prefetch() variation). If we are visiting a GC object for the first time, we will be writing to it to set the "alive" bit in the flags. However if it's not a GC object or we have visited it already, we don't write. So reads would seem quite a bit more common.
the prefetch preprocessor gunk was mostly lifted from the zstd source code. I haven't tested the Windows or __aarch64__ conditions of that block.
The BUFFER_SIZE, BUFFER_HI, and BUFFER_LO have been moderately tuned based on benchmarking on my Ryzen 5 machine. I'm kind of surprised that BUFFER_HI is so small but that's what works best. There could be some additional tuning done on the logic of when to put objects on the stack vs into the buffer, when to push a new "span" vs adding part of it to the buffer. I did what I thought was logical but didn't benchmark all different approaches.
I run some tests that recorded a trace of the prefetch and access (following pointer) and then computed some statistics from the trace log. Based on this, it seems like the prefetch buffer is working fairly well (the median delay between prefetch and access was 14 on one benchmark run, ASCII histogram below).

Prefetch window stats
mean 52.07285263157895
median 14.0
max 256
   25.60 |65,304  | ******************************
   51.20 |5,590   | **
   76.80 |3,562   | *
  102.40 |2,683   | *
  128.00 |2,278   | *
  153.60 |2,285   | *
  179.20 |2,377   | *
  204.80 |2,238   | *
  230.40 |2,753   | *
  256.00 |5,930   | **
-------- |------- | -------
      N= |95,000

Having the specialized gc_mark_traverse_list and gc_mark_traverse_tuple functions seems like a pretty clear win. I didn't consider adding more of them. If the program doesn't have a lot of list or tuple objects maybe it's not much of a win.
The _PyTuple_MaybeUntrack() call is a bit expensive and it gets done on every collection. I considered adding a _PyGC_BITS_NEW bit and set it on newly created tuples. Then we would only need to check a tuple once rather than once per collection. I tried disabling the tuple untracking but things got slower.
As currently implemented, all the object pointers revealed by tp_traverse go through the mark buffer/stack. That seems wasteful given that a good fraction of them will be non-GC objects and some of them are known things like None or True. I tried adding a check for BSS allocated objects (using __bss_start and end) but that doesn't seem like a win. Maybe with some more refinement this would work. We could allocate a bunch of known non-GC objects into a certain region of memory and then gc_propagate_alive() could just quickly skip them based on the pointer address.
I considered adding a new number field to the type object, to allow gc_propagate_alive() to use a switch statement to use the correct traverse function. Something like switch (Py_TYPE(ob)->tp_gc_ordinal) where the tp_gc_ordinal value on known types is set on startup. That way, the switch statement can use an efficient jump table or similar. However, with only list and tuple specialized at this time, it doesn't seem worth it.

When traversing a list or tuple, use a "span" if the buffer can't hold all the items from the collection. This reduces the size of the object stack needed if large collections are encountered. It also helps keeps the buffer size optimal for prefetching.

It's possible for lists or tuples to have a NULL item. Handle that in the case that all item elements fit into the buffer.

nascheme · 2025-01-27T20:24:06Z

Benchmarks for an earlier version of the PR, without the "span" for lists/tuple.

The benchmark results below come from my own workstation (AMD Ryzen 5 7600X, Linux, GCC 12.2.0). I'm using --enable-optimizations but LTO is turned off. The "prefetch off" means the prefetch CPU instruction is omitted, otherwise code is the same.

There is something bad happening with the default build GC is gc.disable() is not used when creating a big data structure. When enabled, the "gc big" benchmark takes 5x as long. My guess is that the generation 0 collections are shuffling the objects in the gc next/prev linked list but that's only a guess.

The "bm_gc_collect" benchmark was taken from pyperformance and the constants adjusted: CYCLES = 100_000 and LINKS = 40.

The "prefetch (7f756eb0)" code branch is essentially the same as this PR (1b4e8c3). I just rebased it on the current main and removed some dead code.

code branch	build	prefetch on	benchmark	time [ms]
benchmark: bm_async_tree_io_tg
prefetch (7f756eb0)	nogil	yes	bm_async_tree_io_tg	6,752
prefetch (7f756eb0)	nogil	no	bm_async_tree_io_tg	6,964
main (`3829104`)	nogil	-	bm_async_tree_io_tg	6,960
main (`3829104`)	default	-	bm_async_tree_io_tg	7,323

benchmark: bm_gc_collect
prefetch (7f756eb0)	nogil	yes	bm_gc_collect	1,418
prefetch (7f756eb0)	nogil	no	bm_gc_collect	1,517
main (`3829104`)	nogil	-	bm_gc_collect	1,473
main (`3829104`)	default	-	bm_gc_collect	20,966

benchmark: gc big
prefetch (7f756eb0)	nogil	yes	gc big	587
prefetch (7f756eb0)	nogil	no	gc big	1,064
main (`3829104`)	nogil	-	gc big	963
main (`3829104`)	default	-	gc big	5,785

benchmark: gc big tree
prefetch (7f756eb0)	nogil	yes	gc big tree	363
prefetch (7f756eb0)	nogil	no	gc big tree	519
main (`3829104`)	nogil	-	gc big tree	723
main (`3829104`)	default	-	gc big tree	3,443

benchmark: gc big (gc disabled)
main (`3829104`)	default	-	gc big (gc disabled)	641

corona10 · 2025-01-29T00:48:30Z

Just saying this in passing, I recommend using the word "freethreaded" intead of "nogil". :)

Need to clear the "alive" bit.

nascheme · 2025-01-31T09:01:21Z

Merging with 'main' has pretty significantly impacted the gc_traversal benchmark, for the worse. I'll do some investigation on that. Something to do with 5ff2fbc would be my guess.

Using the prefetch buffer only helps if there are enough objects. Use the long-lived count to decide if it's worth enabling. If not, fallback to the previous method of marking objects alive (dereference object pointers as we encounter them). Improve code by adding some additional helper functions, adding comments and general tidying. The buffer logic has been changed to use a mask for size rather than the % operator. Some other small optimizations that only help a little.

Use prefetch in GC mark alive phase.

1dd4bb6

bedevere-app bot mentioned this pull request Jan 22, 2025

Use "prefetch" CPU instructions during the marking phase of the GC #129201

Open

nascheme added type-feature A feature request or enhancement performance Performance or resource usage topic-free-threading labels Jan 22, 2025

nascheme added 3 commits January 22, 2025 15:04

Add prefetch() macro that might work on MSVC.

1b4e8c3

Improve portability of prefetch macros.

9fbfd43

nascheme force-pushed the gh-129201-gc-mark-prefetch branch from fe9898a to 7f51104 Compare January 24, 2025 07:49

nascheme added 2 commits January 24, 2025 13:11

Fix bug in "span" enqueue logic.

041b2e4

It's possible for lists or tuples to have a NULL item. Handle that in the case that all item elements fit into the buffer.

Add some asserts, minor code cleanup.

86a5c95

nascheme marked this pull request as ready for review January 27, 2025 20:49

bedevere-app bot added the awaiting core review label Jan 27, 2025

Merge branch 'origin/main' into pythongh-129201-gc-mark-prefetch

62bace2

nascheme requested a review from mpage January 30, 2025 21:12

Fix bug in tuple untrack logic.

e12dd2e

Need to clear the "alive" bit.

nascheme marked this pull request as draft January 31, 2025 08:58

bedevere-app bot removed the awaiting core review label Jan 31, 2025

nascheme added 2 commits January 31, 2025 21:38

Improve "prefetch" instruction macros.

15ada78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-129201: Use prefetch in GC mark alive phase. #129203

gh-129201: Use prefetch in GC mark alive phase. #129203

nascheme commented Jan 22, 2025 •

edited

Loading

nascheme commented Jan 27, 2025

corona10 commented Jan 29, 2025 •

edited

Loading

nascheme commented Jan 31, 2025

gh-129201: Use prefetch in GC mark alive phase. #129203

Are you sure you want to change the base?

gh-129201: Use prefetch in GC mark alive phase. #129203

Conversation

nascheme commented Jan 22, 2025 • edited Loading

nascheme commented Jan 27, 2025

corona10 commented Jan 29, 2025 • edited Loading

nascheme commented Jan 31, 2025

nascheme commented Jan 22, 2025 •

edited

Loading

corona10 commented Jan 29, 2025 •

edited

Loading