Call a function on each hardware thread #52

eschnett · 2017-03-26T20:54:36Z

For certain low-level tasks, it is necessary to call a function exactly once on each hardware thread, sometimes even concurrently. For example, I might want to check that the hardware threads' CPU bindings are correctly set up by calling the respective hwloc function, or I might want to initialize PAPI on each thread.

(Why do I suspect problems with CPU bindings? Because I used both OpenMP and Qthreads in an application, and I didn't realize that both set the CPU binding for the main thread, but they do it differently, leading to conflicts and 50% performance loss even if OpenMP is not used, and the OpenMP threads are all sleeping on the OS. These kinds of issues are more easily debugged if one has access to certain low-level primitives in Qthreads.)

I have currently a mechanism to work around this, by starting many threads that busy-loop for a certain amount of time, and this often succeeds. However, a direct implementation and an official API would be convenient.

The text was updated successfully, but these errors were encountered:

m3m0ryh0l3 · 2017-03-27T18:36:00Z

While it's not obvious or documented, you can use qt_loop() for this purpose. qt_loop() guarantees that iterations of the same number will occur on the same processing element AND guarantees that it will spread over all processing elements. Thus, qt_loop(0, qthread_num_workers()-1, func, NULL) will effectively call func once on every (non-disabled) hardware thread.

Is that good enough for your purposes?

eschnett · 2017-03-28T16:28:19Z

Thank you, qt_loop seems to be doing exactly what I need.

eschnett · 2017-03-28T16:51:41Z

Are you sure that qt_loop spreads out the work across all workers? I obtained this output:

$ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello
FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process
FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]

The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1.

m3m0ryh0l3 · 2017-03-28T21:28:09Z

Ahhh, I see what you mean. I'm guessing you're using the Sherwood scheduler and aren't forcing a shepherd per core. Qthreads only binds thread (task) mobility to the shepherd, so if your shepherd isn't limited to hardware threads (Sherwood defaults it to a L2 cache domain), then you're absolutely right. Hmmm. I guess having a tool to explicitly make a per-hw-thread callback would be handy in some cases. Until one exists, use the environment variables to limit the shepherd boundaries. I forget the exact variable name, but it's something like QT_SHEPHERD_BOUNDARY=pu

…

Sent from my iPhone

On Mar 28, 2017, at 12:51 PM, Erik Schnetter ***@***.***> wrote: Are you sure that qt_loop spreads out the work across all workers? I obtained this output: $ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

eschnett · 2017-03-29T01:51:11Z

What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work?

m3m0ryh0l3 · 2017-03-29T04:51:57Z

The performance implications of fiddling with the shepherd/worker balance are somewhat app-specific; generally what reducing the shepherd boundary to a pu (and thus increasing the shepherd count) does is make the scheduler into a pure workstealing model. How that impacts your performance depends on things like cache affinity between adjacent tasks. On the other hand, increasing the shepherd boundary (e.g. To a socket, thus decreasing the shepherd count) lets inter-task cache affinity get closer to approximately serial performance (this is a kinda deep question, and I can point you to an academic paper if you really want to dig into it.

…

Sent from my iPhone

On Mar 28, 2017, at 9:51 PM, Erik Schnetter ***@***.***> wrote: What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

npe9 · 2017-06-13T16:12:28Z

@eschnett was this answer sufficient?

eschnett · 2017-06-13T18:08:21Z

qt_loop did not work for me. I am still using my original work-around, which is to start a set of threads, each blocking until all threads are running.

npe9 · 2017-06-13T18:14:03Z

@eschnett This actually dovetails into some work I'm doing here. I'll see if I can fix the problem. Can you give me sample code, along with examples of the expected and actual behavior?

eschnett · 2017-06-13T18:59:21Z

The issue with qt_loop seems to be that it doesn't start one thread per core -- it possibly starts the same number of threads for each shepherd, but that isn't sufficient for me. I really need to start one thread per core, e.g. to set up thread affinity via hwloc. (There is some related discussion above regarding schedulers, shepherds, workers, and cores.)

As example code, I would call hwloc and output the hardware core id for each thread.

npe9 · 2017-06-13T21:20:33Z

Have you looked at the binders options at all?

eschnett · 2017-06-14T02:10:42Z

Yes, I've looked at Qthreads' CPU binding support. The issue is that I might run multiple MPI processes per node, which means that different processes need to use different sets of cores. Setting environment variables to different values for different MPI processes is difficult.

An ideal solution would be if Qthreads had a way to pass in the node-local MPI rank and size.

npe9 · 2017-06-14T16:46:12Z

What if MPI used Qthreads?

eschnett · 2017-06-14T17:03:10Z

@npe9 I what sense would/could MPI use Qthreads?

npe9 · 2017-06-14T17:07:22Z

Imagine if MPI's underlying threading runtime (for progress and computation threads) were actually Qthreads. So if you're using MPI and Qthreads together they just "work". This space has been mined before (cf. http://dl.acm.org/citation.cfm?id=2712388). If can help you get Mpiq up if you want to play with it.

ronawho · 2023-09-06T21:48:49Z

We'd also be interested in a mechanism to call something on each worker thread in order to modify some thread state.

For arm based macs we have interest in setting some quality-of-service flags to limit what cores threads can run on. For more traditional configs we have interest in dynamic unpinning/pinning of the threads to avoid interfering with other parallel runtimes (most commonly a user wants to call out to some OpenMP optimized library and we want a way to get our threads out of the way)

m3m0ryh0l3 mentioned this issue Mar 27, 2017

PAPI integration or similar #53

Closed

eschnett closed this as completed Mar 28, 2017

eschnett reopened this Mar 28, 2017

janciesko added enhancement medium priority labels Dec 10, 2020

ronawho mentioned this issue Sep 7, 2023

Chapel Wish List #36

Open

14 tasks

insertinterestingnamehere added this to the 1.22 Release milestone Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call a function on each hardware thread #52

Call a function on each hardware thread #52

eschnett commented Mar 26, 2017

m3m0ryh0l3 commented Mar 27, 2017

eschnett commented Mar 28, 2017

eschnett commented Mar 28, 2017

m3m0ryh0l3 commented Mar 28, 2017 via email

eschnett commented Mar 29, 2017

m3m0ryh0l3 commented Mar 29, 2017 via email

npe9 commented Jun 13, 2017

eschnett commented Jun 13, 2017

npe9 commented Jun 13, 2017

eschnett commented Jun 13, 2017

npe9 commented Jun 13, 2017

eschnett commented Jun 14, 2017

npe9 commented Jun 14, 2017

eschnett commented Jun 14, 2017

npe9 commented Jun 14, 2017

ronawho commented Sep 6, 2023

Call a function on each hardware thread #52

Call a function on each hardware thread #52

Comments

eschnett commented Mar 26, 2017

m3m0ryh0l3 commented Mar 27, 2017

eschnett commented Mar 28, 2017

eschnett commented Mar 28, 2017

m3m0ryh0l3 commented Mar 28, 2017 via email

eschnett commented Mar 29, 2017

m3m0ryh0l3 commented Mar 29, 2017 via email

npe9 commented Jun 13, 2017

eschnett commented Jun 13, 2017

npe9 commented Jun 13, 2017

eschnett commented Jun 13, 2017

npe9 commented Jun 13, 2017

eschnett commented Jun 14, 2017

npe9 commented Jun 14, 2017

eschnett commented Jun 14, 2017

npe9 commented Jun 14, 2017

ronawho commented Sep 6, 2023