Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[L0] Updated Driver In order lists check and required version #2491

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nrspruit
Copy link
Contributor

@nrspruit nrspruit commented Dec 19, 2024

  • Cleaned up the checks for driver in order lists and migrated the check
    to platform.
  • Updated version needed to match version with fixes.
  • Fixed sync Immediate command List in order flag type.

@nrspruit nrspruit requested review from a team as code owners December 19, 2024 17:18
@nrspruit nrspruit requested a review from hdelan December 19, 2024 17:18
@github-actions github-actions bot added level-zero L0 adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Dec 19, 2024
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from 6e4f1a9 to f072da3 Compare December 19, 2024 17:22
@nrspruit
Copy link
Contributor Author

Not targeted to v0.11.x, this will be for v0.12.x

@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from f072da3 to f0c556b Compare January 14, 2025 16:08
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from f0c556b to d13db51 Compare January 14, 2025 16:51
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from d13db51 to 23a9979 Compare January 15, 2025 16:57
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025
@nrspruit
Copy link
Contributor Author

There are some errors suddenly in the SYCL tests with this enabled. I am going to set this to "draft" until I can determine why they started to fail.

@nrspruit nrspruit marked this pull request as draft January 15, 2025 23:02
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from 23a9979 to 16e070f Compare January 24, 2025 00:08
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 24, 2025
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch 3 times, most recently from 108d976 to a0da40d Compare January 28, 2025 17:46
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 28, 2025
@nrspruit nrspruit marked this pull request as ready for review January 28, 2025 17:48
Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13016551883

Copy link

Compute Benchmarks level_zero run (--env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0 ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13020807896
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 99.215%.
Improved 3 Regressed 7 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (3): 100.529%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 219.402000 μs 221.653 μs 101.03% 1.03% .
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 252.469000 μs 253.936 μs 100.58% 0.58% .
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.841 μs 5.840000 μs 99.98% -0.02% .
Relative perf in group api (2): 100.772%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.676000 μs 1.694 μs 101.07% 1.07% .
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.124000 μs 2.134 μs 100.47% 0.47% .
Relative perf in group Velocity-Bench (1): 99.583%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench dl-mnist 2.400 s 2.390000 s 99.58% -0.42% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 100.513%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2068.760000 ns 2161.990 ns 104.51% 4.51% +++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 290.580000 ns 292.911 ns 100.80% 0.80% .
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3119.410 ns 3110.730000 ns 99.72% -0.28% .
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2701.690 ns 2624.960000 ns 97.16% -2.84% --
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.198%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 701.560000 ns 706.810 ns 100.75% 0.75% .
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 194.197000 ns 194.563 ns 100.19% 0.19% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 212.487 ns 208.270000 ns 98.02% -1.98% .
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 274.260 ns 268.430000 ns 97.87% -2.13% -
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 99.186%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1918.170000 ns 1990.550 ns 103.77% 3.77% ++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3304.840 ns 3265.110000 ns 98.80% -1.20% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 256.172 ns 251.344000 ns 98.12% -1.88% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1478.380 ns 1422.410000 ns 96.21% -3.79% --
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 95.384%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 206.016 ns 205.968000 ns 99.98% -0.02% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 191.082 ns 189.983000 ns 99.42% -0.58% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 310.891 ns 306.784000 ns 98.68% -1.32% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 886.996 ns 748.528000 ns 84.39% -15.61% ----------
Relative perf in group alloc/min (4): 100.437%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 946.629000 ns 965.220 ns 101.96% 1.96% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1039.820000 ns 1043.990 ns 100.40% 0.40% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 800.897 ns 800.188000 ns 99.91% -0.09% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 177.011 ns 176.103000 ns 99.49% -0.51% .
Relative perf in group multiple (12): 99.084%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32242.400000 ns 33998.100 ns 105.45% 5.45% +++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 161596.000000 ns 164458.000 ns 101.77% 1.77% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 14800.100000 ns 14852.400 ns 100.35% 0.35% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25668.400000 ns 25689.000 ns 100.08% 0.08% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4258.360 ns 4247.790000 ns 99.75% -0.25% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 140987.000 ns 139811.000000 ns 99.17% -0.83% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 74516.100 ns 73718.800000 ns 98.93% -1.07% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 42679.000 ns 42068.500000 ns 98.57% -1.43% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1192480.000 ns 1169360.000000 ns 98.06% -1.94% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1197700.000 ns 1157480.000000 ns 96.64% -3.36% --
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 32646.200 ns 31315.500000 ns 95.92% -4.08% ---
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 148020.000 ns 140253.000000 ns 94.75% -5.25% ---

Details

Benchmark details - environment, command...
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Velocity-Bench dl-mnist

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0
NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

Copy link

Compute Benchmarks level_zero run (with params: --iterations 10):
https://github.com/oneapi-src/unified-runtime/actions/runs/13021025779

Copy link

Compute Benchmarks level_zero run (--iterations 10):
https://github.com/oneapi-src/unified-runtime/actions/runs/13021025779
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 99.004%.
Improved 5 Regressed 12 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (3): 99.694%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 219.613000 μs 221.653 μs 100.93% 0.93% .
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.875 μs 5.840000 μs 99.40% -0.60% .
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 257.119 μs 253.936000 μs 98.76% -1.24% .
Relative perf in group api (2): 99.941%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.129000 μs 2.134 μs 100.23% 0.23% .
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.700 μs 1.694000 μs 99.65% -0.35% .
Relative perf in group Velocity-Bench (1): 100.000%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench dl-mnist 2.390000 s 2.390 s 100.00% 0.00% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 98.848%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2090.440000 ns 2161.990 ns 103.42% 3.42% ++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 299.499 ns 292.911000 ns 97.80% -2.20% -
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3191.670 ns 3110.730000 ns 97.46% -2.54% -
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2710.490 ns 2624.960000 ns 96.84% -3.16% --
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.123%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 705.154000 ns 706.810 ns 100.23% 0.23% .
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 194.123000 ns 194.563 ns 100.23% 0.23% .
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 270.546 ns 268.430000 ns 99.22% -0.78% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 215.041 ns 208.270000 ns 96.85% -3.15% --
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 102.129%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1242.110000 ns 1422.410 ns 114.52% 14.52% +++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1913.590000 ns 1990.550 ns 104.02% 4.02% ++
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 261.884 ns 251.344000 ns 95.98% -4.02% --
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3431.250 ns 3265.110000 ns 95.16% -4.84% --
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 94.448%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 205.508000 ns 205.968 ns 100.22% 0.22% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 307.908 ns 306.784000 ns 99.63% -0.37% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 191.088 ns 189.983000 ns 99.42% -0.58% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 933.896 ns 748.528000 ns 80.15% -19.85% ----------
Relative perf in group alloc/min (4): 98.987%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1016.030000 ns 1043.990 ns 102.75% 2.75% +
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 959.777000 ns 965.220 ns 100.57% 0.57% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 178.788 ns 176.103000 ns 98.50% -1.50% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 848.298 ns 800.188000 ns 94.33% -5.67% ---
Relative perf in group multiple (12): 99.141%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 30229.700000 ns 31315.500 ns 103.59% 3.59% ++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 163357.000000 ns 164458.000 ns 100.67% 0.67% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25579.800000 ns 25689.000 ns 100.43% 0.43% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 34018.900 ns 33998.100000 ns 99.94% -0.06% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 140836.000 ns 139811.000000 ns 99.27% -0.73% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 42387.800 ns 42068.500000 ns 99.25% -0.75% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4283.980 ns 4247.790000 ns 99.16% -0.84% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1180170.000 ns 1157480.000000 ns 98.08% -1.92% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15187.500 ns 14852.400000 ns 97.79% -2.21% -
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 75417.400 ns 73718.800000 ns 97.75% -2.25% -
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1203720.000 ns 1169360.000000 ns 97.15% -2.85% -
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 144864.000 ns 140253.000000 ns 96.82% -3.18% --

Details

Benchmark details - environment, command...
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

@pbalcer
Copy link
Contributor

pbalcer commented Jan 29, 2025

Many of the benchmarks failed to run:

2025-01-28T23:08:39.8287006Z terminate called after throwing an instance of 'sycl::_V1::exception'
2025-01-28T23:08:39.8290660Z   what():  The program was built for 1 devices
2025-01-28T23:08:39.8294383Z Build program log for 'Intel(R) Data Center GPU Max 1100':
2025-01-28T23:08:39.9512534Z FAILED assertion EXPECT_UR_RESULT_SUCCESS(urKernelSetArgPointer(kernel, 0, nullptr, usm[i][j]))
2025-01-28T23:08:39.9512866Z 	value: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
2025-01-28T23:08:39.9513329Z 	Location: /home/pmdk/bench_workdir/compute-benchmarks-repo/source/benchmarks/multithread_benchmark/implementations/ur/memcpy_execute_interleaved.cpp:113
2025-01-28T23:08:39.9513712Z 

Looks like something with the new feature bugged out the drivers.

EDIT: I've restarted the system and I updated the UMD to https://github.com/intel/compute-runtime/releases/tag/24.52.32224.8.

Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13028914356

Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13028914356
Job status: failure. Test status: failure.

Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13034414623

Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13034414623
Job status: failure. Test status: skipped.

@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from 24239e3 to c759860 Compare January 29, 2025 17:14
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 29, 2025
Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13050652218

Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13050652218
Job status: success. Test status: success.

Summary

Total 146 benchmarks in mean.
Geomean 104.379%.
Improved 41 Regressed 20 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 101.983%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_ur SubmitKernel in order 14.907000 μs 16.785 μs 112.60% 12.60% +
api_overhead_benchmark_sycl SubmitKernel in order 22.673000 μs 24.407 μs 107.65% 7.65% +
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 115983.000000 instr 123166.000 instr 106.19% 6.19% .
api_overhead_benchmark_sycl SubmitKernel out of order 22.952000 μs 23.506 μs 102.41% 2.41% .
api_overhead_benchmark_ur SubmitKernel in order CPU count 107820.000000 instr 110006.000 instr 102.03% 2.03% .
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.139000 μs 2.149 μs 100.47% 0.47% .
api_overhead_benchmark_ur SubmitKernel out of order 15.813000 μs 15.866 μs 100.34% 0.34% .
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104883.000 instr 104663.000000 instr 99.79% -0.21% .
api_overhead_benchmark_l0 SubmitKernel in order 11.532 μs 11.395000 μs 98.81% -1.19% .
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.785 μs 21.495000 μs 98.67% -1.33% .
api_overhead_benchmark_l0 SubmitKernel out of order 11.572 μs 11.369000 μs 98.25% -1.75% .
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.713 μs 1.673000 μs 97.66% -2.34% .
Relative perf in group memory (4): 122.123%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 120.704000 μs 219.832 μs 182.12% 82.12% ++++++
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 225.251000 μs 252.914 μs 112.28% 12.28% +
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.631000 μs 5.900 μs 104.78% 4.78% .
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.187000 GB/s 3.070 GB/s 103.81% 3.81% .
Relative perf in group miscellaneous (1): 99.966%
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum 858.316 bw GB/s 858.023000 bw GB/s 99.97% -0.03% .
Relative perf in group multithread (10): 140.490%
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 844.435000 μs 2047.766 μs 242.50% 142.50% ++++++++++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 19833.135000 μs 46811.855 μs 236.03% 136.03% ++++++++++
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 826.301000 μs 1199.669 μs 145.19% 45.19% +++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 19553.215000 μs 27030.035 μs 138.24% 38.24% +++
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 6985.391000 μs 8883.578 μs 127.17% 27.17% ++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 5695.653000 μs 6896.127 μs 121.08% 21.08% +
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 6676.246000 μs 7766.797 μs 116.33% 16.33% +
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 97661.135000 μs 112408.658 μs 115.10% 15.10% +
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 15245.222000 μs 17165.065 μs 112.59% 12.59% +
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 37927.968000 μs 42602.254 μs 112.32% 12.32% +
Relative perf in group graph (10): 123.687%
Benchmark This PR baseline Relative perf Change -
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 2453.321000 μs 5621.320 μs 229.13% 129.13% +++++++++
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 24771.753000 μs 56454.921 μs 227.90% 127.90% +++++++++
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 69362.686000 μs 72583.103 μs 104.64% 4.64% .
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.588000 μs 55.253 μs 101.22% 1.22% .
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 349518.135000 μs 353086.695 μs 101.02% 1.02% .
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 676.543000 μs 677.203 μs 100.10% 0.10% .
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 72113.761 μs 71746.038000 μs 99.49% -0.51% .
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 358934.461 μs 353349.563000 μs 98.44% -1.56% .
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 - 62.493000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 - 5631.730000 μs
Relative perf in group Velocity-Bench (9): 100.207%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench Bitcracker 35.203600 s 38.359 s 108.96% 8.96% +
Velocity-Bench Hashtable 370.681652 M keys/sec 363.340 M keys/sec 102.02% 2.02% .
Velocity-Bench CudaSift 200.893000 ms 203.947 ms 101.52% 1.52% .
Velocity-Bench QuickSilver 118.230000 MMS/CTT 116.460 MMS/CTT 101.52% 1.52% .
Velocity-Bench Sobel Filter 602.999000 ms 603.076 ms 100.01% 0.01% .
Velocity-Bench dl-cifar 23.634 s 23.630300 s 99.98% -0.02% .
Velocity-Bench dl-mnist 2.720 s 2.710000 s 99.63% -0.37% .
Velocity-Bench Easywave 228.000 ms 227.000000 ms 99.56% -0.44% .
Velocity-Bench svm 0.152 s 0.135900 s 89.64% -10.36% -
Relative perf in group Runtime (8): 98.029%
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 274.697000 ms 276.461 ms 100.64% 0.64% .
Runtime_IndependentDAGTaskThroughput_SingleTask 258.398000 ms 259.444 ms 100.40% 0.40% .
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 275.886 ms 274.274000 ms 99.42% -0.58% .
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 277.595 ms 275.173000 ms 99.13% -0.87% .
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1762.919 ms 1710.439000 ms 97.02% -2.98% .
Runtime_DAGTaskThroughput_NDRangeParallelFor 1732.018 ms 1673.462000 ms 96.62% -3.38% .
Runtime_DAGTaskThroughput_SingleTask 1720.056 ms 1648.643000 ms 95.85% -4.15% .
Runtime_DAGTaskThroughput_BasicParallelFor 1788.297 ms 1704.436000 ms 95.31% -4.69% .
Relative perf in group MicroBench (14): 100.343%
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.495000 ms 4.909 ms 109.21% 9.21% +
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.683000 ms 4.940 ms 105.49% 5.49% .
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.402000 ms 4.585 ms 104.16% 4.16% .
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.617000 ms 4.716 ms 102.14% 2.14% .
MicroBench_LocalMem_fp32_4096 29.858000 ms 29.902 ms 100.15% 0.15% .
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.256 ms 616.834000 ms 99.93% -0.07% .
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.890 ms 617.437000 ms 99.93% -0.07% .
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.898 ms 617.442000 ms 99.93% -0.07% .
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 617.240 ms 616.784000 ms 99.93% -0.07% .
MicroBench_LocalMem_int32_4096 29.899 ms 29.862000 ms 99.88% -0.12% .
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.407 ms 4.376000 ms 99.30% -0.70% .
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.488 ms 4.456000 ms 99.29% -0.71% .
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.518 ms 4.276000 ms 94.64% -5.36% .
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.921 ms 4.526000 ms 91.97% -8.03% -
Relative perf in group Pattern (10): 103.611%
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_Hierarchical_int32 13.626000 ms 16.339 ms 119.91% 19.91% +
Pattern_Reduction_NDRange_int32 13.789000 ms 16.339 ms 118.49% 18.49% +
Pattern_SegmentedReduction_NDRange_fp32 2.163000 ms 2.168 ms 100.23% 0.23% .
Pattern_SegmentedReduction_NDRange_int64 2.335000 ms 2.337 ms 100.09% 0.09% .
Pattern_SegmentedReduction_NDRange_int32 2.164000 ms 2.165 ms 100.05% 0.05% .
Pattern_SegmentedReduction_NDRange_int16 2.264000 ms 2.265 ms 100.04% 0.04% .
Pattern_SegmentedReduction_Hierarchical_int64 11.780000 ms 11.782 ms 100.02% 0.02% .
Pattern_SegmentedReduction_Hierarchical_int32 11.588000 ms 11.588 ms 100.00% 0.00% .
Pattern_SegmentedReduction_Hierarchical_int16 11.800 ms 11.796000 ms 99.97% -0.03% .
Pattern_SegmentedReduction_Hierarchical_fp32 11.592 ms 11.587000 ms 99.96% -0.04% .
Relative perf in group ScalarProduct (6): 99.900%
Benchmark This PR baseline Relative perf Change -
ScalarProduct_Hierarchical_int32 10.525000 ms 10.541 ms 100.15% 0.15% .
ScalarProduct_Hierarchical_fp32 10.153000 ms 10.167 ms 100.14% 0.14% .
ScalarProduct_Hierarchical_int64 11.492 ms 11.490000 ms 99.98% -0.02% .
ScalarProduct_NDRange_fp32 3.754 ms 3.749000 ms 99.87% -0.13% .
ScalarProduct_NDRange_int32 3.777 ms 3.765000 ms 99.68% -0.32% .
ScalarProduct_NDRange_int64 5.448 ms 5.425000 ms 99.58% -0.42% .
Relative perf in group USM (7): 100.489%
Benchmark This PR baseline Relative perf Change -
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.810000 ms 1.893 ms 104.59% 4.59% .
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.206000 ms 1.258 ms 104.31% 4.31% .
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.046000 ms 1.087 ms 103.92% 3.92% .
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.679000 ms 1.737 ms 103.45% 3.45% .
USM_Allocation_latency_fp32_host 37.723 ms 37.623000 ms 99.73% -0.27% .
USM_Allocation_latency_fp32_device 0.068 ms 0.065000 ms 95.59% -4.41% .
USM_Allocation_latency_fp32_shared 0.067 ms 0.062000 ms 92.54% -7.46% -
Relative perf in group VectorAddition (3): 99.986%
Benchmark This PR baseline Relative perf Change -
VectorAddition_int64 3.050000 ms 3.088 ms 101.25% 1.25% .
VectorAddition_fp32 1.482 ms 1.480000 ms 99.87% -0.13% .
VectorAddition_int32 1.494 ms 1.477000 ms 98.86% -1.14% .
Relative perf in group Polybench (3): 99.525%
Benchmark This PR baseline Relative perf Change -
Polybench_2mm 1.040 ms 1.039000 ms 99.90% -0.10% .
Polybench_3mm 1.482 ms 1.477000 ms 99.66% -0.34% .
Polybench_Atax 6.466 ms 6.402000 ms 99.01% -0.99% .
Relative perf in group Kmeans (1): 100.035%
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 14.106000 ms 14.111 ms 100.04% 0.04% .
Relative perf in group LinearRegressionCoeff (1): 102.147%
Benchmark This PR baseline Relative perf Change -
LinearRegressionCoeff_fp32 863.378000 ms 881.915 ms 102.15% 2.15% .
Relative perf in group MolecularDynamics (1): 103.448%
Benchmark This PR baseline Relative perf Change -
MolecularDynamics 0.029000 ms 0.030 ms 103.45% 3.45% .
Relative perf in group llama.cpp (6): 100.822%
Benchmark This PR baseline Relative perf Change -
llama.cpp Text Generation Batched 128 63.536107 token/s 62.791 token/s 101.19% 1.19% .
llama.cpp Text Generation Batched 512 63.489263 token/s 62.789 token/s 101.12% 1.12% .
llama.cpp Text Generation Batched 256 63.438981 token/s 62.777 token/s 101.05% 1.05% .
llama.cpp Prompt Processing Batched 256 884.944770 token/s 878.291 token/s 100.76% 0.76% .
llama.cpp Prompt Processing Batched 128 835.018224 token/s 830.097 token/s 100.59% 0.59% .
llama.cpp Prompt Processing Batched 512 436.704848 token/s 435.724 token/s 100.23% 0.23% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (5): 97.882%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2076.010000 ns 2113.560 ns 101.81% 1.81% .
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 2733.570 ns 2688.530000 ns 98.35% -1.65% .
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3180.660 ns 3097.620000 ns 97.39% -2.61% .
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2536.280 ns 2464.050000 ns 97.15% -2.85% .
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 303.386 ns 287.722000 ns 94.84% -5.16% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (5): 98.785%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 271.812000 ns 272.237 ns 100.16% 0.16% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 210.890 ns 208.759000 ns 98.99% -1.01% .
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 193.600 ns 191.313000 ns 98.82% -1.18% .
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 717.514 ns 705.635000 ns 98.34% -1.66% .
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 715.357 ns 698.410000 ns 97.63% -2.37% .
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (5): 98.840%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1804.180000 ns 2038.360 ns 112.98% 12.98% +
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3358.930 ns 3338.690000 ns 99.40% -0.60% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 263.205 ns 261.553000 ns 99.37% -0.63% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1361.300 ns 1274.570000 ns 93.63% -6.37% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 1358.000 ns 1226.080000 ns 90.29% -9.71% -
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (5): 99.740%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 304.225000 ns 310.903 ns 102.20% 2.20% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 195.845000 ns 196.551 ns 100.36% 0.36% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 191.724 ns 189.545000 ns 98.86% -1.14% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 715.548 ns 706.907000 ns 98.79% -1.21% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 717.984 ns 707.467000 ns 98.54% -1.46% .
Relative perf in group alloc/min (6): 97.723%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 947.697000 ns 958.800 ns 101.17% 1.17% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 177.754 ns 177.130000 ns 99.65% -0.35% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 840.056 ns 832.725000 ns 99.13% -0.87% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 176.648 ns 174.753000 ns 98.93% -1.07% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 813.400 ns 797.092000 ns 98.00% -2.00% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1074.330 ns 965.779000 ns 89.90% -10.10% -
Relative perf in group multiple (16): 100.751%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 14541.500000 ns 16418.600 ns 112.91% 12.91% +
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 40144.600000 ns 41438.000 ns 103.22% 3.22% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 30082.600000 ns 30910.300 ns 102.75% 2.75% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4235.480000 ns 4283.690 ns 101.14% 1.14% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 29937.200000 ns 30121.800 ns 100.62% 0.62% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25408.000000 ns 25525.500 ns 100.46% 0.46% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 145903.000000 ns 146423.000 ns 100.36% 0.36% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 4195.380000 ns 4208.520 ns 100.31% 0.31% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 137977.000000 ns 138360.000 ns 100.28% 0.28% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 75255.500000 ns 75451.700 ns 100.26% 0.26% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 141210.000 ns 140162.000000 ns 99.26% -0.74% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1184930.000 ns 1174970.000000 ns 99.16% -0.84% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 164311.000 ns 162279.000000 ns 98.76% -1.24% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 27946.400 ns 27477.700000 ns 98.32% -1.68% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33924.000 ns 33153.600000 ns 97.73% -2.27% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1193200.000 ns 1162100.000000 ns 97.39% -2.61% .

Details

Benchmark details - environment, command...
api_overhead_benchmark_l0 SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_l0 SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros --multiplier=1

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

miscellaneous_benchmark_sycl VectorSum

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=1 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=4 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=1

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=1 --numKernels=100

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=1 --numKernels=100

api_overhead_benchmark_ur SubmitKernel out of order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Velocity-Bench Hashtable

Environment Variables:

Command:

/home/pmdk/bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Environment Variables:

Command:

/home/pmdk/bench_workdir/bitcracker/bitcracker -f /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/2mm.csv --size=512

Polybench_3mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/3mm.csv --size=512

Polybench_Atax

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Kmeans.csv --size=700000000

LinearRegressionCoeff_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LinearRegressionCoeff.csv --size=1638400000

MolecularDynamics

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

@pbalcer
Copy link
Contributor

pbalcer commented Jan 30, 2025

Perf looks good, but I had to use compute runtime from yesterday.

@nrspruit
Copy link
Contributor Author

Perf looks good, but I had to use compute runtime from yesterday.

Thanks, I am trying to determine why Graph/RecordReplay/barrier_multi_queue.cpp is failing in CI only, I cannot reproduce the failure outside of the CI even with the same version of the driver and compiler libraries.

ie here:
https://github.com/intel/llvm/actions/runs/13020477272/job/36320344367?pr=16431

I think this may be failing due to a memory leak that I fixed in this PR, but why it is failing now is not clear.

Once I can ensure that is not an issue or resolve it, then this patch should be ready to go.

@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from c759860 to 03574ae Compare January 30, 2025 16:50
nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 30, 2025
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from 03574ae to e322cd7 Compare February 7, 2025 18:24
nrspruit added a commit to nrspruit/llvm that referenced this pull request Feb 7, 2025
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from e322cd7 to 4ebc29a Compare February 7, 2025 21:32
@github-actions github-actions bot added loader Loader related feature/bug conformance Conformance test suite issues. specification Changes or additions to the specification experimental Experimental feature additions/changes/specification cuda CUDA adapter specific issues hip HIP adapter specific issues opencl OpenCL adapter specific issues native-cpu Native CPU adapter specific issues labels Feb 7, 2025
- Cleaned up the checks for driver in order lists and migrated the check
  to platform.
- Updated version needed to match version with fixes.
- Fixed sync Immediate command List in order flag type.

Signed-off-by: Neil R. Spruit <[email protected]>
@nrspruit nrspruit force-pushed the enable_driver_in_order_compat_check branch from 4ebc29a to 1fd368a Compare February 8, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command-buffer Command Buffer feature addition/changes/specification conformance Conformance test suite issues. cuda CUDA adapter specific issues experimental Experimental feature additions/changes/specification hip HIP adapter specific issues level-zero L0 adapter specific issues loader Loader related feature/bug native-cpu Native CPU adapter specific issues opencl OpenCL adapter specific issues specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants