Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEX: Implements new sampling based stats #4291

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Sonicadvance1
Copy link
Member

The ftrace/gpuviz path has an issue with some usability. It can't be put in high-frequency locations like the SIGBUS handler, and it is hard to correlate between stutters in the game with a gpuviz trace after the fact.

With the combination of these two problems I have written this little SHM implementation of statistics that are /virtually/ free on the FEX side and requires the reader to accumulate the data correctly.

Some design decisions

  • All atomics are weak-memory ordered, requiring the use of memory_barriers and mutexes in appropriate locations
    • This improves performance since it's avoid atomic overheads entirely (Which FEX already has issues with)
    • This means all stats are per-thread and the accumulation has to happen on the reader side
  • The stats SHM region can never shrink, which would result in problems.
  • The stats are in a singly linked-list because the expectation is low thread counts
  • The stats are in a single allocation region which allows us to expand the stats if the number of threads grow

This is implemented for Linux and WINE, so we can use it everywhere.

This gets us some nice stat graphics like what follows:
Screenshot 2025-01-21 at 19 28 37

We only have four stats currently, but there can definitely be more stats as time progresses.
The Mangohud patches are forthcoming while I clean them up.

Not wired up, just the definitions so it lives in the
InternalThreadState.

We want this accessible from both FEXCore and the frontends so it needs
to live there.

Two types of events supported. Scoped cyclecounts and instant
increments.

This gives us JIT time and Signal handling time, plus events for number
of SIGBUS and number of SMC events.

All useful statistics for seeing stutter live.
For the four things we care about
Not wired up to anything. Requires the frontends to allocate shared
memory in the expected way.
This is fairly straightforward. It creates the shared memory region in
/dev/shm/fex-<pid>-stats so that Mangohud can sample it.
This is a little trickier, we actually open the
`/dev/shm/fex-<pid>-stats` file directly using Windows APIs that way
Mangohud (which is going to be on the Linux side, or potentially even
embedded in to Gamescope) can safely pick up the stats.

A little quirky plus doesn't support expanding its size since WINE
doesn't support NtExtendSection, but that's fine.
@bylaws
Copy link
Collaborator

bylaws commented Jan 22, 2025

I'll make a fixup to replace the wine support with something simpler in a bit

Copy link
Member

@neobrain neobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first look - heavy stuff, but gathering and visualizing these metrics indeed is useful functionality to have.

Will mangohud need to maintain a fixed list of FEX-specific metrics (and update them if we add any) or will it automatically pick up new metrics added in FEX?

Is this designed to support accumulation of metrics across processes as well or is it solely for the process running mangohud?

The current implementation leaves me uncomfortable, since there's only so much faith one can have in reliability when interleaving weakly-ordered atomics, shared memory, and a custom linked list implementation. Well-placed and tested abstractions and more explaining docs would help support that faith.

More generally, can I ask what's the benefit of a MangoHud-specific approach compared to an established nanosecond-resolution live-profiler like Tracy? Obviously MangoHud is great for general stats like FPS and CPU load, but FEX's metrics are very specific and I'm not sure how valuable having a short window of these is. Most metrics could probably even be reported via ftrace despite its overhead (i.e. less rapidly updated metrics, or those that relate to costly SIGBUS/SMC events anyway), which could give us free multi-process awareness without bespoke accumulation logic in external projects.

Source/Common/Profiler.cpp Outdated Show resolved Hide resolved
Source/Common/Profiler.h Outdated Show resolved Hide resolved
Source/Common/Profiler.h Show resolved Hide resolved
Source/Common/Profiler.h Outdated Show resolved Hide resolved
#endif

namespace FEX::Profiler {
class StatAllocBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some heavy logic in here, so this should explain what it's doing.

Comment on lines +49 to +51
FEXCore::Profiler::ThreadStatsHeader* Head {};
FEXCore::Profiler::ThreadStats* Stats;
FEXCore::Profiler::ThreadStats* StatTail {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you preferring a custom-implemented linked list over flat memory? This seems neither more convenient for the implementation nor like it would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparsity in the allocation buffer is a concern, so this is how I'm keeping the implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what exactly the problem is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader side would need to walk the entire allocated buffer space to ensure it doesn't miss any slots, instead of the average case of a hundred or so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario exactly would trigger this? I don't know what your reader side looks like, but it sounds like a bitmask and/or active slot count would easily resolve this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to leave it as a linked list. We can rev the version number and change it to a bitmask in the future if someone is feeling up to writing that code in the future.

Copy link
Member

@neobrain neobrain Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the choice between linked lists and flat memory doesn't seem to be so important, why is the focus on weakly-ordered atomics considered critical?

(I don't agree using a custom data structure here is the way forward since it makes up a good chunk of the complexity in this PR, but on top of that the performance targets don't seem consistent here as-is.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the atomics I don't want the writer side to be affected by cacheline sharing problems at all. Semantically these are two independent things. The containing data which is FEXCore::Profiler::ThreadStats and the way to walk the list of data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand they are different things, but it seems arbitrary to want this (with no explicit motivation) while using a data structure with poor performance characteristics in the same module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, which is why we leave some of the "poor performance" problems on the reader side which has something like 500ms to sample the data, even though this is going to be hundreds of nanoseconds regardless.

}

// Find a free slot
memory_barrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the cost of the profiled events (SIGBUS/SMC), is it really worth dealing with the added complexity here instead of just using stronger atomics? Also, isn't thread creation/destruction in particular rare enough that we could use strong memory ordering on slot management variables while leaving the stats themselves weakly ordered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False dependency sharing in the hot path is a significant concern, so I'm keeping them as weak atomics.

Copy link
Member

@neobrain neobrain Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the hot path? The currently traced events are orders of magnitude more expensive to process than potential cache behavior effects.

This applies even more to thread creation/destruction, which is a rare event on top of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have some leniency on thread creation and teardown, which is already slow and causes global locks to be taken, but this is the way I have decided to implement it.

@Sonicadvance1
Copy link
Member Author

https://github.com/Sonicadvance1/MangoHud/tree/fex
For the WIP FEX support in mangohud.

Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved
Source/Windows/ARM64EC/Module.cpp Outdated Show resolved Hide resolved
Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved
Fallback to the previous path if it doesn't exist.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 23, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 23, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 24, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 24, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants