-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEX: Implements new sampling based stats #4291
base: main
Are you sure you want to change the base?
Conversation
Not wired up, just the definitions so it lives in the InternalThreadState. We want this accessible from both FEXCore and the frontends so it needs to live there. Two types of events supported. Scoped cyclecounts and instant increments. This gives us JIT time and Signal handling time, plus events for number of SIGBUS and number of SMC events. All useful statistics for seeing stutter live.
For the four things we care about
Not wired up to anything. Requires the frontends to allocate shared memory in the expected way.
This is fairly straightforward. It creates the shared memory region in /dev/shm/fex-<pid>-stats so that Mangohud can sample it.
This is a little trickier, we actually open the `/dev/shm/fex-<pid>-stats` file directly using Windows APIs that way Mangohud (which is going to be on the Linux side, or potentially even embedded in to Gamescope) can safely pick up the stats. A little quirky plus doesn't support expanding its size since WINE doesn't support NtExtendSection, but that's fine.
235e9cd
to
46dca8a
Compare
I'll make a fixup to replace the wine support with something simpler in a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a first look - heavy stuff, but gathering and visualizing these metrics indeed is useful functionality to have.
Will mangohud need to maintain a fixed list of FEX-specific metrics (and update them if we add any) or will it automatically pick up new metrics added in FEX?
Is this designed to support accumulation of metrics across processes as well or is it solely for the process running mangohud?
The current implementation leaves me uncomfortable, since there's only so much faith one can have in reliability when interleaving weakly-ordered atomics, shared memory, and a custom linked list implementation. Well-placed and tested abstractions and more explaining docs would help support that faith.
More generally, can I ask what's the benefit of a MangoHud-specific approach compared to an established nanosecond-resolution live-profiler like Tracy? Obviously MangoHud is great for general stats like FPS and CPU load, but FEX's metrics are very specific and I'm not sure how valuable having a short window of these is. Most metrics could probably even be reported via ftrace despite its overhead (i.e. less rapidly updated metrics, or those that relate to costly SIGBUS/SMC events anyway), which could give us free multi-process awareness without bespoke accumulation logic in external projects.
#endif | ||
|
||
namespace FEX::Profiler { | ||
class StatAllocBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some heavy logic in here, so this should explain what it's doing.
FEXCore::Profiler::ThreadStatsHeader* Head {}; | ||
FEXCore::Profiler::ThreadStats* Stats; | ||
FEXCore::Profiler::ThreadStats* StatTail {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you preferring a custom-implemented linked list over flat memory? This seems neither more convenient for the implementation nor like it would improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparsity in the allocation buffer is a concern, so this is how I'm keeping the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate what exactly the problem is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader side would need to walk the entire allocated buffer space to ensure it doesn't miss any slots, instead of the average case of a hundred or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What scenario exactly would trigger this? I don't know what your reader side looks like, but it sounds like a bitmask and/or active slot count would easily resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to leave it as a linked list. We can rev the version number and change it to a bitmask in the future if someone is feeling up to writing that code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the choice between linked lists and flat memory doesn't seem to be so important, why is the focus on weakly-ordered atomics considered critical?
(I don't agree using a custom data structure here is the way forward since it makes up a good chunk of the complexity in this PR, but on top of that the performance targets don't seem consistent here as-is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the atomics I don't want the writer side to be affected by cacheline sharing problems at all. Semantically these are two independent things. The containing data which is FEXCore::Profiler::ThreadStats
and the way to walk the list of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand they are different things, but it seems arbitrary to want this (with no explicit motivation) while using a data structure with poor performance characteristics in the same module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, which is why we leave some of the "poor performance" problems on the reader side which has something like 500ms to sample the data, even though this is going to be hundreds of nanoseconds regardless.
Source/Common/Profiler.cpp
Outdated
} | ||
|
||
// Find a free slot | ||
memory_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to the cost of the profiled events (SIGBUS/SMC), is it really worth dealing with the added complexity here instead of just using stronger atomics? Also, isn't thread creation/destruction in particular rare enough that we could use strong memory ordering on slot management variables while leaving the stats themselves weakly ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False dependency sharing in the hot path is a significant concern, so I'm keeping them as weak atomics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the hot path? The currently traced events are orders of magnitude more expensive to process than potential cache behavior effects.
This applies even more to thread creation/destruction, which is a rare event on top of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have some leniency on thread creation and teardown, which is already slow and causes global locks to be taken, but this is the way I have decided to implement it.
2e5457b
to
9bbe6a2
Compare
https://github.com/Sonicadvance1/MangoHud/tree/fex |
9bbe6a2
to
38dbfe2
Compare
38dbfe2
to
832f9c2
Compare
Fallback to the previous path if it doesn't exist.
When running on a system with a 48-bit VA, if FEX does any allocations between us reserving the upper 128TB and the application running, then /technically/ we are intersecting with the application's memory region in the lower 47-bits. This didn't typically result in any problems due to how ASLR works, but if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region) then these typically get pushed higher in the VA space. Again not usually a problem, but if you happen to be running an application that is using MAP_FIXED with hardcoded addresses then this can stomp over FEX-Emu memory causing problems. This is what happens with Wine, it reserves the upper-32MB of its 47-bit VA space, which is /highly/ likely to stomp on FEX memory. In-fact it likely occurs all the time, we just got lucky with whatever it was clobbering wasn't used at the time. On 39-bit VA systems this isn't a problem because the mmap fails outright with a warning message from WINE. Because we are already reserving the upper 128TB of VA space, instead just always enable our allocator and use the regions that were reserved. We need to be a little bit careful to ensure we don't accidentally allocate more memory post-reservation but that just requires a small adjustment to our unique_ptr and constructor for the 64BitAllocator. This means /all/ FEX-Emu allocations will be in the upper 128TB VA space when running 64-bit applications on a 48-bit VA system. Which is kind of nice. Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per process.
When running on a system with a 48-bit VA, if FEX does any allocations between us reserving the upper 128TB and the application running, then /technically/ we are intersecting with the application's memory region in the lower 47-bits. This didn't typically result in any problems due to how ASLR works, but if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region) then these typically get pushed higher in the VA space. Again not usually a problem, but if you happen to be running an application that is using MAP_FIXED with hardcoded addresses then this can stomp over FEX-Emu memory causing problems. This is what happens with Wine, it reserves the upper-32MB of its 47-bit VA space, which is /highly/ likely to stomp on FEX memory. In-fact it likely occurs all the time, we just got lucky with whatever it was clobbering wasn't used at the time. On 39-bit VA systems this isn't a problem because the mmap fails outright with a warning message from WINE. Because we are already reserving the upper 128TB of VA space, instead just always enable our allocator and use the regions that were reserved. We need to be a little bit careful to ensure we don't accidentally allocate more memory post-reservation but that just requires a small adjustment to our unique_ptr and constructor for the 64BitAllocator. This means /all/ FEX-Emu allocations will be in the upper 128TB VA space when running 64-bit applications on a 48-bit VA system. Which is kind of nice. Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per process.
When running on a system with a 48-bit VA, if FEX does any allocations between us reserving the upper 128TB and the application running, then /technically/ we are intersecting with the application's memory region in the lower 47-bits. This didn't typically result in any problems due to how ASLR works, but if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region) then these typically get pushed higher in the VA space. Again not usually a problem, but if you happen to be running an application that is using MAP_FIXED with hardcoded addresses then this can stomp over FEX-Emu memory causing problems. This is what happens with Wine, it reserves the upper-32MB of its 47-bit VA space, which is /highly/ likely to stomp on FEX memory. In-fact it likely occurs all the time, we just got lucky with whatever it was clobbering wasn't used at the time. On 39-bit VA systems this isn't a problem because the mmap fails outright with a warning message from WINE. Because we are already reserving the upper 128TB of VA space, instead just always enable our allocator and use the regions that were reserved. We need to be a little bit careful to ensure we don't accidentally allocate more memory post-reservation but that just requires a small adjustment to our unique_ptr and constructor for the 64BitAllocator. This means /all/ FEX-Emu allocations will be in the upper 128TB VA space when running 64-bit applications on a 48-bit VA system. Which is kind of nice. Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per process.
When running on a system with a 48-bit VA, if FEX does any allocations between us reserving the upper 128TB and the application running, then /technically/ we are intersecting with the application's memory region in the lower 47-bits. This didn't typically result in any problems due to how ASLR works, but if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region) then these typically get pushed higher in the VA space. Again not usually a problem, but if you happen to be running an application that is using MAP_FIXED with hardcoded addresses then this can stomp over FEX-Emu memory causing problems. This is what happens with Wine, it reserves the upper-32MB of its 47-bit VA space, which is /highly/ likely to stomp on FEX memory. In-fact it likely occurs all the time, we just got lucky with whatever it was clobbering wasn't used at the time. On 39-bit VA systems this isn't a problem because the mmap fails outright with a warning message from WINE. Because we are already reserving the upper 128TB of VA space, instead just always enable our allocator and use the regions that were reserved. We need to be a little bit careful to ensure we don't accidentally allocate more memory post-reservation but that just requires a small adjustment to our unique_ptr and constructor for the 64BitAllocator. This means /all/ FEX-Emu allocations will be in the upper 128TB VA space when running 64-bit applications on a 48-bit VA system. Which is kind of nice. Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per process.
The ftrace/gpuviz path has an issue with some usability. It can't be put in high-frequency locations like the SIGBUS handler, and it is hard to correlate between stutters in the game with a gpuviz trace after the fact.
With the combination of these two problems I have written this little SHM implementation of statistics that are /virtually/ free on the FEX side and requires the reader to accumulate the data correctly.
Some design decisions
This is implemented for Linux and WINE, so we can use it everywhere.
This gets us some nice stat graphics like what follows:
We only have four stats currently, but there can definitely be more stats as time progresses.
The Mangohud patches are forthcoming while I clean them up.