-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call a function on each hardware thread #52
Comments
While it's not obvious or documented, you can use Is that good enough for your purposes? |
Thank you, |
Are you sure that
The number after |
Ahhh, I see what you mean. I'm guessing you're using the Sherwood scheduler and aren't forcing a shepherd per core. Qthreads only binds thread (task) mobility to the shepherd, so if your shepherd isn't limited to hardware threads (Sherwood defaults it to a L2 cache domain), then you're absolutely right. Hmmm. I guess having a tool to explicitly make a per-hw-thread callback would be handy in some cases. Until one exists, use the environment variables to limit the shepherd boundaries. I forget the exact variable name, but it's something like QT_SHEPHERD_BOUNDARY=pu
…Sent from my iPhone
On Mar 28, 2017, at 12:51 PM, Erik Schnetter ***@***.***> wrote:
Are you sure that qt_loop spreads out the work across all workers? I obtained this output:
$ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello
FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process
FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work? |
The performance implications of fiddling with the shepherd/worker balance are somewhat app-specific; generally what reducing the shepherd boundary to a pu (and thus increasing the shepherd count) does is make the scheduler into a pure workstealing model. How that impacts your performance depends on things like cache affinity between adjacent tasks. On the other hand, increasing the shepherd boundary (e.g. To a socket, thus decreasing the shepherd count) lets inter-task cache affinity get closer to approximately serial performance (this is a kinda deep question, and I can point you to an academic paper if you really want to dig into it.
…Sent from my iPhone
On Mar 28, 2017, at 9:51 PM, Erik Schnetter ***@***.***> wrote:
What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@eschnett was this answer sufficient? |
|
@eschnett This actually dovetails into some work I'm doing here. I'll see if I can fix the problem. Can you give me sample code, along with examples of the expected and actual behavior? |
The issue with As example code, I would call hwloc and output the hardware core id for each thread. |
Have you looked at the binders options at all? |
Yes, I've looked at Qthreads' CPU binding support. The issue is that I might run multiple MPI processes per node, which means that different processes need to use different sets of cores. Setting environment variables to different values for different MPI processes is difficult. An ideal solution would be if Qthreads had a way to pass in the node-local MPI rank and size. |
What if MPI used Qthreads? |
@npe9 I what sense would/could MPI use Qthreads? |
Imagine if MPI's underlying threading runtime (for progress and computation threads) were actually Qthreads. So if you're using MPI and Qthreads together they just "work". This space has been mined before (cf. http://dl.acm.org/citation.cfm?id=2712388). If can help you get Mpiq up if you want to play with it. |
We'd also be interested in a mechanism to call something on each worker thread in order to modify some thread state. For arm based macs we have interest in setting some quality-of-service flags to limit what cores threads can run on. For more traditional configs we have interest in dynamic unpinning/pinning of the threads to avoid interfering with other parallel runtimes (most commonly a user wants to call out to some OpenMP optimized library and we want a way to get our threads out of the way) |
For certain low-level tasks, it is necessary to call a function exactly once on each hardware thread, sometimes even concurrently. For example, I might want to check that the hardware threads' CPU bindings are correctly set up by calling the respective
hwloc
function, or I might want to initializePAPI
on each thread.(Why do I suspect problems with CPU bindings? Because I used both OpenMP and Qthreads in an application, and I didn't realize that both set the CPU binding for the main thread, but they do it differently, leading to conflicts and 50% performance loss even if OpenMP is not used, and the OpenMP threads are all sleeping on the OS. These kinds of issues are more easily debugged if one has access to certain low-level primitives in Qthreads.)
I have currently a mechanism to work around this, by starting many threads that busy-loop for a certain amount of time, and this often succeeds. However, a direct implementation and an official API would be convenient.
The text was updated successfully, but these errors were encountered: