-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathNEWS
598 lines (500 loc) · 20.8 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
--- 1.21 ---
Improvements:
- Support and testing for clang 19 and gcc 14
- Partial support for thread sanitizer
- Dramatically simplified build system
- Removed numerous unused build configs
- Fixed a minor performance regression from the last release
- Removed hand-imlemented atomics in favor of C11 atomics and existing intrinsics
- Removed unmaintained support for eurekas
- Removed unmaintained support for distributed Qthreads
- Removed various unmaintained debugging and profiling code paths
- Dropped support for numerous compilers that no are no longer available
- Dropped support for various unmaintained architectures like Tilera, Itanium, Alpha, MIPS, and Sparc
- Dropped support for various OS configs that we don't test like Solaris, HPUX, and AIX
- Removed several unmaintained/untested threadqueue and topology detection options
--- 1.20 ---
Improvements:
- Improve and expand CI testing
- Fix compilation and testing for an expanded range of platforms, compilers, and operating systems
- Fix numerous sanitizer errors
- Remove some unused/outdated code
--- 1.19 ---
New Features:
- Add a mechanism to reset the default task spawn order
- Enable the use of HWLOC_GET_TOPOLOGY_FUNCTION
Improvements:
- Fix fast context swap on ARM64
- Mature support of ARM64 for Apple MacOS and Linux
- Improvements to testing
- Improvements to configure scripts
- Remove unused files
- Make version numbering consistent
Notes:
- Software repository migrated to new locaton
--- 1.18 ---
New Features:
- Add support for Apple Mac Mx hardware
- Add spinlock-based implementation of locking API
- Add support for recursive locks
Improvements:
- Fix synchronization primitives by adding correct fencing
- Use stdatomic fence acquire and release for >=C11
- Increase default stack size from 4kB to 16kB
- Fix hang in qt_blocking_subsystem_internal_stopwork on qthreads_finalize
- Enable QTHREAD_ARMV8_A64 context swap on Apple Mx hardware
- Add distributable package gen script
--- 1.17 ---
Improvements:
- Add GitHub actions for CI testing
- Add missing include files in the binders topology layer
- Handle SIGSEGV during configure testing to avoid user exposure
- Add --with-hwloc-symbol-prefix configure option
- Add errno preservation from system calls
- Fix compilation errors on M1 MAC computers
- Remove Loxley experimental schedulers
--- 1.16 ---
Improvements:
- Add support for Spack testing (spack test run qthreads)
- Add support for hwloc2.x (minimal required version is hwloc1.5)
Documentation:
- User guide for basic Qthreads usage (see userguide subdirectory for LaTex and included code source files)
--- 1.15 ---
New Features:
- Add experimental support for Thread Local Storage and new Condition variable initializers
- Non-blocking FEB functions are now exposed (e.g. qthread_readFF_nb), other runtimes (e.g. MPI) require access.
Improvements:
- Remove the deprecated register keyword
- Fix automake machinery so that "make dist" works as expected
--- 1.14 ---
New Features:
- Incorporate minor features requested by Chapel team
- Add experimental support for the MPI finepoints interface in test/benchmarks/finepoints
- Added a new thread feature flag for "network tasks", allowing schedulers to treat communication tasks differently
- Added a RUNTIME_DATA_SIZE flag to qthread_readstate (Courtesy of the Chapel team)
Improvements:
- Read the "no" topology (fix proposed by the Chapel team)
- Updated SINC implementations to use modern Qthreads APIs
- Removed deprecated ROSE support
- Removed deprecated RCRtool support
--- 1.13 ---
New Features:
- Performance Monitoring: use --enable-performance-monitoring
--- 1.12 ---
New Features:
- Pluggable allocators: use --with-alloc={base,chapel} to choose
Improvements:
- Do condwait/backoff in nemesis when getting threads to improve task spawning performance.
Bugfixes:
- Change sleep() and nanosleep() implementation to qt_sleep() and qt_nanosleep() to avoid confusing PMI.
--- 1.11 ---
New Features:
- Distib scheduler: lower contention numa-aware scheduling
- Binders affinity layer: more control over affinity
- Updated Portals4 runtime support
- Added missing XMT FEB operations
--- 1.10 ---
New Features:
- Task queue subsystem
- Task-level barrier for SPR
- Environment option to toggle use of guard pages
Improvements:
- Fixed loop constructs to use all workers
- Unified (and simplified) internal handling of multi-threaded vs single-threaded scheduling regimes
- Removed defunct SST support, to avoid confusing people
- Removed futurelib
- Cleans up affinity subsytem interface and hwloc implementation
- Added support for Chapel launcher.
- Added configuration option to enable eureka support; marked as _experimental_
- Added support for improved compatibility with DUMA tool
Bugfixes:
- Fix various memory leaks associated with spawncache, sincs, preconds, timers.
- Fix SPR put/get handle bug.
--- 1.9 ---
New Features:
- Eurekas within a team
- Manual spawn cache flushing (qthread_flushsc())
- Preliminary Chapel SPR support
- Unified task-barrier APIs
Improvements:
- Direct context swapping in "direct-yield", used in qt_loop()
- Better Chapel integration
- Cleaned up Chapel multinode compilation
- Detect and work around Apple compiler bugs
- Qtimer OMP compatibility
- Can reset a sinc with outstanding submits
- Compatibility with older libtool versions
- Reorganized code to make it easier for new developers to understand
- Better TLS handling
- Add the ability to query more runtime state (e.g. WORKER_OCCUPATION); fleshes out Chapel support
- Add HPCCG benchmarks (both pure and via ROSE)
- Add LULESH benchmark (via ROSE)
- Better handling of MPI runtime (for multinode)
- Support for MPICC and MPICXX environment variables (for multinode)
- Better topology handling
Bugfixes:
- Fix 32-bit fastcontext (broken in 1.8.1)
- Fix handling of QT_WORKER_UNIT and QT_SHEPHERD_BOUNDARY that are caches (hwloc-only)
- Properly initialize worker information when QT_NUM_SHEPHERDS has been specified (hwloc-only)
- OMP barrier nesting improved
- Correct detection of CMPXCHG16B
- A bunch of minor bugfixes
--- 1.8.1 ---
Improvements:
- Improved integration with Chapel build system
- Support for Chapel's blockreport debugging
- Cheaper context swap on x86(_64)
Bugfixes:
- Support for libnumaV2 fixed
- Build system avoids obsolete macros
- Tilera affinity support compiles again (was missing a #include)
--- 1.8 ---
New Features:
- Concurrent lock-free hash table (dictionary); three alternate designs
- New "simple" tasks: reduced context swap overhead, no stack size limit, but cannot block or yield
- qthread_spawn() function exposed to simplify complex task creation
- Arbitrary blocking functions now public (BE CAREFUL)
- Optional (experimental) lock-free hash table implementation of FEBs (very fast)
- ARM support
- Tasks can use a sinc as a return value location
- spr_init() can replace MPI_Init()
- spr_unify() converts SPMD to single-flow-of-control
- C++ version of qt_loopaccum_balance and SPR interface
- QT_HWPAR environment variable simplifies concurrency management
Improvements:
- LOTS of performance improvements
- Massive improvement to memory pooling
- Handle buggy versions of hwloc without crashing
- Terse multinode documentation
- Public memory pool now front-end for internal memory pool
- Scribbling in memory pool (debugging assist)
- Reduced memory footprint
- Additional (micro-)benchmarks, including Cilk, TBB, and OpenMP implementations
- Better checking of compile requirements
- Faster precondition handling
- Better Tilera support
- Configure-time define default stack size
- Task teams are now reliable
- Better PPC support (detect all 3 ABIs)
- Use compiler ("native") TLS when practical
- Better behavior when topology information is unavailable
- Use PMI runtime API for multinode
- Hazardptr implementation works under high load
- Matt Baker's Chapel syncvar idea
- Add configure-time "oversubscription" mode; avoids spinlocks, uses sched_yield() when necessary
- More man pages
Bugfixes:
- Fix distance support with hwloc
- Fix alignment error in C++ futurelib (old bug)
- Fix rare hang in Sherwood scheduler
- Fix pwrite() define (Issue #11)
- Improve floating-point tests
- Fix fincr/dincr synchronization bug (some compilers created two reads)
- Fix BOTS benchmark string handling
- Fix reinitialization race condition
- Many minor fixes
--- 1.7.1 ---
New Features:
- New experimental scheduler: loxley
- Experimental task team support
- Provide QTHREAD_VERSION in qthread.h
Improvements:
- Significant performance improvements in the scheduler
- Task-spawn caching
- More benchmarks in the tests tree
- Improved memory pool (still not perfect, but now more reliable)
- Lots of OpenMP support improvements
- Removed 'volatile' in favor of explicit memory fences (for speed)
- TilePro/TileGX support
- Experimental signal-based task termination
- Updates task ID support with distinct NULL and "not really a qthread" identifiers
Bugfixes:
- Fixed cross-compilation behavior
- Lots of minor fixes
--- 1.7 ---
New Features:
- Experimental multi-node support with Portals4
- Task-local data
- Preconditioned task spawning
- More control with hwloc: QTHREAD_WORKER_UNIT
- Several new schedulers: mutexfifo, mtsfifo, nemesis, nottingham
- Experimental sinc support (counting synchronization, proto-thread-teams)
Improvements:
- Default scheduler: sherwood
- Updated to new Chapel tasking API
- Works with Chapel multi-node
- Better Chapel behavior (e.g. bigger default stack)
- Faster qsort
- Even faster thread spawns
- Better syncvar behavior under contention
- qthread_num_workers() works independent of scheduler
- Better debug-output control
- Tree-based qloop spawning
- Better behavior in shepherd over-subscription case
- Use hash to improve qtimer_fastrand() output (still not cryptographically sound)
- Suppress "Forced X" messages unless QTHREAD_INFO is set
- Prevents errors if qthread_finalize() is called from the wrong thread
Bugfixes:
- Fix 64-bit increment detection on 32-bit architectures
- Fix valgrind building
- Fix PPC building
- Fix I/O subsystem queueing
- Many stability improvements
--- 1.6 ---
New Features:
- Brand new C++ qloop interface
- Supports multi-locale Chapel
- New scheduling queue options (faster FIFO, and new LIFO)
- Experimental support for RCRDaemon
- Experimental support for OpenMP affinity extensions
- Experimental support for system call intercepting
Improvements:
- Support for external threads using qthread blocking operations
- Renamed internal context functions, to improve dynamic library loading
- Chapel interface built as a secondary (static-only) library
- Reduced exported symbols
- Updated documentation
Bugfixes:
- Fixed critical fastcontext bug on 32-bit x86
- Fixed incorrect rounding in atomic operations on old compilers in 32-bit systems
- Eliminated the faulty test_internal_mod test
- Fixed qloop memory leak
- Fixed compiling on 32-bit SparcV9
--- 1.5.1 ---
New Features:
- Chapel shims
Improvements:
- Compiles under Cygwin
- Better OpenMP nested loop behavior (more efficient barriers)
- Use memory affinity interface in hwloc (if available)
- Better CPU affinity behavior for non-default shepherd counts
Bugfixes:
- Better behavior when void functions are used as syncvar-returning threads
- Pthread_id() now does not change between qthread_initialize() and qthread_finalize()
- Affinity of calling thread restored during qthread_finalize()
--- 1.5 ---
New Features:
- Multithreaded shepherds (non-default)
- Qtimer-based fast "random" function
- Syncvar_t incrF function
- New qthread_readstate() function, helps support Chapel
- Convenience functions for storing data in syncvar_t's
- Removed all library API reliance on qthread_t
Improvements:
- Speed up the qt_feb_barrier with syncvar_t's
- Speed up context switch on x86/ppc Linux with handwritten one
- Semi-non-native mode in SST
- Remove external dependency on C++ and/or cprops
- Stop using setrlimit unnecessarily on Linux
- Built-in cache-aware hash for speed and portability
- Syncvar_t initializers (static and dynamic)
- Faster thread creation and lower memory footprint
- MANY improvements to ROSE OpenMP code
- Faster syncvar_t performance under load
- Removed deprecated functions
Bugfixes:
- Fixed the race condition in the qt_feb_barrier
- Fixed C++ headers
- Match new SST layout
- Fixed syncvar_t bugs, race-conditions, corner-cases
- Better behavior with CPU subsets, workaround libnuma bugs
--- 1.4 ---
New Features:
- qarray_dist_like() to match one qarray to another
- qarray_set_shepof() to specify a qarray segment's location
- qarray_iter_loopaccum() to accumulate a function over a qarray
- qt_feb_barrier() to allow threads to engage in a basic barrier
- qtimer functions, for high-resolution low-overhead cross-platform timing
- qt_loop_queue* functions, for self-schedule loops
- syncvar_t datatype can support high-speed asynchronous FEB operations (but
cannot be treated as a generic integer type); with C++ wrapper
Improvements:
- Renamed DIST_REG_FIELDS and DIST_REG_STRIPES qarray distributions
- Added FIXED_FIELDS to qarray distributions
- Increased size of qthread_shepherd_id_t
- Fixed qthread_dincr() on ppc32
- Fixed qt_cas() on ppc64
- Better workaround for broken MacOS X swapcontext()
- Fixed logic in debug output to avoid floating point
- Fixed bug in lock profiling code (credit to David Evensky)
- Better detection of compiler options
- More ROSE support
- More valgrind support
- Work around libnuma brokenness
- Speed up FEB operations in single-thread case
- Reworked qt_loop* interface to support loop increments of >1
- Speed up qloop functions
- Support hwloc
- Improved support for parallel build
Bugfixes:
- Fixed the profiling code
- Fixed the race condition in some qt_loop functions.
- Properly handle heterogeneous node layouts
--- 1.3.1 ---
Improvements:
- Documentation
- Fix GCC 4.4 compiler error
- More ROSE support
- Better compiler detection
--- 1.3 ---
New Features:
- Optional guard pages, to assist in debugging
- No longer need to call qthread_finalize()
- Environment variable QTHREAD_NUM_SHEPHERDS to force a specific number of
shepherds
- Shepherds an be enabled/disabled at runtime; threads are forcibly evicted
from disabled shepherds
- Support for Tilera systems and Tilera locality libraries
- A primitive queue-based for-loop to handle imbalanced loop iterations, the qqloop
Improvements:
- Significant speed improvements (and full support) for systems without
hardware atomic operations
- Now prefer qthread_initialize(), which doesn't take arguments;
qthread_init() is deprecated (but still supported)
- Threads have second-order affinity, only established via explicit programmer
placement
- Better detection of insufficient compile environment
- Better detection of cacheline size
- Better detection of CPUs on MacOS X and Linux, via sysctl and sysconf
- Better compatibility with Intel compiler
- Better Sun Studio ASM support (still fighting ICE)
- Better interaction with libnuma to detect desired number of shepherds
- Added a more lightweight qsort() implementation
- Test codes use environment variables instead of arguments
- Compatibility with glibc's new qsort_r() function
- Use autoconf 2.64's C++ "restrict" tests, rather than the broken 2.59 ones
- Removed small qarray memory leak
- Standardized C++ detection of integer arguments
- Fixed error detection in qthread_fork_to()
- Additional documentation
- Restored ability to avoid C++ by using cprops library
- Can now be re-initialized safely
- No longer relies on PTHREAD_RECURSIVE
- Use clock_gettime() for timing, if available, instead of gettimeofday()
--- 1.2 ---
New Features:
- Distributed Data Structures: array (qarray), queue (qdqueue), memory pool (qpool)
- Lock-free Data Structures: queue (qlfqueue)
- CPU Affinity for shepherds/threads on most systems (not OSX 10.4)
- Machine topology information can be queried
- qthread_num_shepherds()
- qthread_distance()
- qthread_sorted_sheps()/_remote()
- qthread_init(0) creates 1 shepherd per location
- Portable CAS
- future_fork_to()
- qthread_cacheline() returns the machine's cache line size in bytes
Improvements:
- Portability improvements
- Several bugs in queue management fixed
- Workarounds for bad compiler management of volatile operations
- Avoid ABA problems in lock-free memory pools
- Better memory barriers on Sparc
- Better valgrind support
- Removed dependency on external cprops library by depending on C++ hash maps
--- 1.1.20090123 ---
New Features:
- Can now force 64-bit alignment
- Can now migrate threads between shepherds
- More atomic increments (floating point)
- Aligned_t qsort()
- C++ template wrappers to standard FEB functions
- Environment variable controls stack size
- SST support
- High-resolution timers for profiling
Improvements:
- Lock-free internal memory pooling
- Lock-free thread queueing
- Better Apple PPC64/PPC/x86/x86_64 support
- Better Sparc support
- Better architecture detection
- Documentation fixes
- Detects more makecontext() irregularities
- Better C++ portability/compatibility
- Better compatibility with unusual compilers
- Better test cases
- Uses compiler-builtin atomic operations when available
- Better 64-bit support
- Real serial mode
- Faster qt_loop_balance synchronization
- Fixed atomic increment volatility declaration
- Thread counting also checks hash stripe balance
--- 1.0 ---
New Features:
- Added error handling.
Improvements:
- Reorganized osx_compat stuff
- Released under BSD OSS license with Sandia's blessing!
- Fixed some portability issues with increment functions.
--- 0.8 ---
New Features:
- Added qloop functions to provide precisely balanced loop spawning. Still
relatively primitive, but the interface allows for bare-minimum overhead (in
terms of context-swaps and memory footprint) with more advanced scheduling.
- Configurable setrlimit and atomic increment use
Improvements:
- Atomic increments are more widely used throughout library.
- Added qutil documentation
- Improved futurelib documentation
- Added workaround for GCC's broken gcse
- Bugfixes and portability improvements on architectures/environments where
atomic increments are unsupported
- Minor improvements to qalloc (from Vitus)
- Eliminated memory imbalancing by reintroducing locks. (Basic testing shows
it does not dramatically increase overhead.)
--- 0.7 ---
New Features:
- qthread_incr() can be used for atomic increments
--- 0.6 ---
Improvements:
- Threads now use pthread thread-local (TSD) memory instead of doing lookups
into bottleneck data-structures
- Futurelib and qthreads are now more tightly coupled, which obviates some
bottleneck data-structures
- Removed locks when spawning qthreads and/or futures from a qthread (locks
are now only needed in special cases)
- qthread_lock() and qthread_unlock() are more parallel
- Better futurelib documentation (see README)
--- 0.5 ---
New Features:
- Threads can have return values, which obey FEB semantics
- qthread_stackleft() returns the number of bytes left in the stack (with some
inaccuracy)
Improvements:
- Functions are now anonymous, and there is no major distinction between
detached and undetached threads
- Removed qthread_join(); use qthread_readFF on a thread's return value
instead
- FutureLib uses behavior-templates rather than type-templates
Bugfixes:
- Corrected a race condition in the FEB handling that could lead to deadlock
--- 0.4 ---
New Features:
- man pages for all major functions
- added qthread_feb_status()
Improvements:
- changed the qthread_f prototype to be easier to use
--- 0.3 ---
New Features:
- added qthread_writeF(), which as the same arguments and effect as writeEF,
but does not block
- added qthread_prepare(), qthread_schedule(), and associated functions to
decouple thread creation from thread scheduling
Improvements:
- added information about compiling with the PGI compiler
- added support for "make check" to test most of the major functionality in
the library
- made qthread_fork() (and related functions) significantly faster if called
from within a qthread by making it possible to avoid using mutexes to
protect memory pools
- qthread_shep() may now take a NULL if you don't have a qthread_t handy
(qthread_shep(NULL) is faster than qthread_shep(qthread_self()))
Bug fixes:
- corrected the behavior of qthread_readFF() and readFE() (they were
dereferencing things too many times)
- qthread_unlock() will now function correctly if unlocking something that's
already unlocked (it could get into a deadlock situation before, because it
wasn't cleaning up after itself)
- fixed a typo in qalloc_dynmalloc() that could cause deadlock on some
architectures
- corrected memory pooling to eliminate assertion failures