[L0] Updated Driver In order lists check and required version #2491

nrspruit · 2024-12-19T17:18:16Z

Cleaned up the checks for driver in order lists and migrated the check
to platform.
Updated version needed to match version with fixes.
Fixed sync Immediate command List in order flag type.

nrspruit · 2024-12-19T17:27:56Z

Not targeted to v0.11.x, this will be for v0.12.x

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

source/adapters/level_zero/common.hpp

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit · 2025-01-15T23:02:40Z

There are some errors suddenly in the SYCL tests with this enabled. I am going to set this to "draft" until I can determine why they started to fail.

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

github-actions · 2025-01-28T17:56:39Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13016551883

github-actions · 2025-01-28T22:58:56Z

Compute Benchmarks level_zero run (--env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=0 ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13020807896
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 99.215%.
Improved 3 Regressed 7 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (3): 100.529%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	219.402000 μs	221.653 μs	101.03%	1.03%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	252.469000 μs	253.936 μs	100.58%	0.58%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.841 μs	5.840000 μs	99.98%	-0.02%	.

Relative perf in group api (2): 100.772%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.676000 μs	1.694 μs	101.07%	1.07%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.124000 μs	2.134 μs	100.47%	0.47%	.

Relative perf in group Velocity-Bench (1): 99.583%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench dl-mnist	2.400 s	2.390000 s	99.58%	-0.42%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 100.513%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2068.760000 ns	2161.990 ns	104.51%	4.51%	+++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	290.580000 ns	292.911 ns	100.80%	0.80%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3119.410 ns	3110.730000 ns	99.72%	-0.28%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2701.690 ns	2624.960000 ns	97.16%	-2.84%	--

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.198%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	701.560000 ns	706.810 ns	100.75%	0.75%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	194.197000 ns	194.563 ns	100.19%	0.19%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.487 ns	208.270000 ns	98.02%	-1.98%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	274.260 ns	268.430000 ns	97.87%	-2.13%	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 99.186%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1918.170000 ns	1990.550 ns	103.77%	3.77%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3304.840 ns	3265.110000 ns	98.80%	-1.20%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	256.172 ns	251.344000 ns	98.12%	-1.88%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1478.380 ns	1422.410000 ns	96.21%	-3.79%	--

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 95.384%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	206.016 ns	205.968000 ns	99.98%	-0.02%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.082 ns	189.983000 ns	99.42%	-0.58%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	310.891 ns	306.784000 ns	98.68%	-1.32%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	886.996 ns	748.528000 ns	84.39%	-15.61%	----------

Relative perf in group alloc/min (4): 100.437%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	946.629000 ns	965.220 ns	101.96%	1.96%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1039.820000 ns	1043.990 ns	100.40%	0.40%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	800.897 ns	800.188000 ns	99.91%	-0.09%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.011 ns	176.103000 ns	99.49%	-0.51%	.

Relative perf in group multiple (12): 99.084%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32242.400000 ns	33998.100 ns	105.45%	5.45%	+++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	161596.000000 ns	164458.000 ns	101.77%	1.77%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14800.100000 ns	14852.400 ns	100.35%	0.35%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25668.400000 ns	25689.000 ns	100.08%	0.08%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4258.360 ns	4247.790000 ns	99.75%	-0.25%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140987.000 ns	139811.000000 ns	99.17%	-0.83%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	74516.100 ns	73718.800000 ns	98.93%	-1.07%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42679.000 ns	42068.500000 ns	98.57%	-1.43%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1192480.000 ns	1169360.000000 ns	98.06%	-1.94%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1197700.000 ns	1157480.000000 ns	96.64%	-3.36%	--
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	32646.200 ns	31315.500000 ns	95.92%	-4.08%	---
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	148020.000 ns	140253.000000 ns	94.75%	-5.25%	---

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

github-actions · 2025-01-28T22:59:13Z

Compute Benchmarks level_zero run (with params: --iterations 10):
https://github.com/oneapi-src/unified-runtime/actions/runs/13021025779

github-actions · 2025-01-28T23:16:45Z

Compute Benchmarks level_zero run (--iterations 10):
https://github.com/oneapi-src/unified-runtime/actions/runs/13021025779
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 99.004%.
Improved 5 Regressed 12 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (3): 99.694%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	219.613000 μs	221.653 μs	100.93%	0.93%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.875 μs	5.840000 μs	99.40%	-0.60%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	257.119 μs	253.936000 μs	98.76%	-1.24%	.

Relative perf in group api (2): 99.941%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.129000 μs	2.134 μs	100.23%	0.23%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.700 μs	1.694000 μs	99.65%	-0.35%	.

Relative perf in group Velocity-Bench (1): 100.000%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench dl-mnist	2.390000 s	2.390 s	100.00%	0.00%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 98.848%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2090.440000 ns	2161.990 ns	103.42%	3.42%	++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	299.499 ns	292.911000 ns	97.80%	-2.20%	-
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3191.670 ns	3110.730000 ns	97.46%	-2.54%	-
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2710.490 ns	2624.960000 ns	96.84%	-3.16%	--

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.123%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	705.154000 ns	706.810 ns	100.23%	0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	194.123000 ns	194.563 ns	100.23%	0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	270.546 ns	268.430000 ns	99.22%	-0.78%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	215.041 ns	208.270000 ns	96.85%	-3.15%	--

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 102.129%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1242.110000 ns	1422.410 ns	114.52%	14.52%	+++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1913.590000 ns	1990.550 ns	104.02%	4.02%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	261.884 ns	251.344000 ns	95.98%	-4.02%	--
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3431.250 ns	3265.110000 ns	95.16%	-4.84%	--

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 94.448%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	205.508000 ns	205.968 ns	100.22%	0.22%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	307.908 ns	306.784000 ns	99.63%	-0.37%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.088 ns	189.983000 ns	99.42%	-0.58%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	933.896 ns	748.528000 ns	80.15%	-19.85%	----------

Relative perf in group alloc/min (4): 98.987%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1016.030000 ns	1043.990 ns	102.75%	2.75%	+
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	959.777000 ns	965.220 ns	100.57%	0.57%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	178.788 ns	176.103000 ns	98.50%	-1.50%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	848.298 ns	800.188000 ns	94.33%	-5.67%	---

Relative perf in group multiple (12): 99.141%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30229.700000 ns	31315.500 ns	103.59%	3.59%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	163357.000000 ns	164458.000 ns	100.67%	0.67%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25579.800000 ns	25689.000 ns	100.43%	0.43%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	34018.900 ns	33998.100000 ns	99.94%	-0.06%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140836.000 ns	139811.000000 ns	99.27%	-0.73%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42387.800 ns	42068.500000 ns	99.25%	-0.75%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4283.980 ns	4247.790000 ns	99.16%	-0.84%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1180170.000 ns	1157480.000000 ns	98.08%	-1.92%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15187.500 ns	14852.400000 ns	97.79%	-2.21%	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75417.400 ns	73718.800000 ns	97.75%	-2.25%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1203720.000 ns	1169360.000000 ns	97.15%	-2.85%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	144864.000 ns	140253.000000 ns	96.82%	-3.18%	--

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

pbalcer · 2025-01-29T09:36:26Z

Many of the benchmarks failed to run:

2025-01-28T23:08:39.8287006Z terminate called after throwing an instance of 'sycl::_V1::exception'
2025-01-28T23:08:39.8290660Z   what():  The program was built for 1 devices
2025-01-28T23:08:39.8294383Z Build program log for 'Intel(R) Data Center GPU Max 1100':

2025-01-28T23:08:39.9512534Z FAILED assertion EXPECT_UR_RESULT_SUCCESS(urKernelSetArgPointer(kernel, 0, nullptr, usm[i][j]))
2025-01-28T23:08:39.9512866Z 	value: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
2025-01-28T23:08:39.9513329Z 	Location: /home/pmdk/bench_workdir/compute-benchmarks-repo/source/benchmarks/multithread_benchmark/implementations/ur/memcpy_execute_interleaved.cpp:113
2025-01-28T23:08:39.9513712Z

Looks like something with the new feature bugged out the drivers.

EDIT: I've restarted the system and I updated the UMD to https://github.com/intel/compute-runtime/releases/tag/24.52.32224.8.

github-actions · 2025-01-29T10:05:31Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13028914356

github-actions · 2025-01-29T10:25:41Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13028914356
Job status: failure. Test status: failure.

github-actions · 2025-01-29T15:15:19Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13034414623

github-actions · 2025-01-29T15:18:07Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13034414623
Job status: failure. Test status: skipped.

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

github-actions · 2025-01-30T10:44:35Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13050652218

github-actions · 2025-01-30T11:44:47Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13050652218
Job status: success. Test status: success.

Summary

Total 146 benchmarks in mean.
Geomean 104.379%.
Improved 41 Regressed 20 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 101.983%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_ur SubmitKernel in order	14.907000 μs	16.785 μs	112.60%	12.60%	+
api_overhead_benchmark_sycl SubmitKernel in order	22.673000 μs	24.407 μs	107.65%	7.65%	+
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	115983.000000 instr	123166.000 instr	106.19%	6.19%	.
api_overhead_benchmark_sycl SubmitKernel out of order	22.952000 μs	23.506 μs	102.41%	2.41%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	107820.000000 instr	110006.000 instr	102.03%	2.03%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.139000 μs	2.149 μs	100.47%	0.47%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.813000 μs	15.866 μs	100.34%	0.34%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104883.000 instr	104663.000000 instr	99.79%	-0.21%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.532 μs	11.395000 μs	98.81%	-1.19%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.785 μs	21.495000 μs	98.67%	-1.33%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.572 μs	11.369000 μs	98.25%	-1.75%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.713 μs	1.673000 μs	97.66%	-2.34%	.

Relative perf in group memory (4): 122.123%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	120.704000 μs	219.832 μs	182.12%	82.12%	++++++
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	225.251000 μs	252.914 μs	112.28%	12.28%	+
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.631000 μs	5.900 μs	104.78%	4.78%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.187000 GB/s	3.070 GB/s	103.81%	3.81%	.

Relative perf in group miscellaneous (1): 99.966%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	858.316 bw GB/s	858.023000 bw GB/s	99.97%	-0.03%	.

Relative perf in group multithread (10): 140.490%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	844.435000 μs	2047.766 μs	242.50%	142.50%	++++++++++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	19833.135000 μs	46811.855 μs	236.03%	136.03%	++++++++++
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	826.301000 μs	1199.669 μs	145.19%	45.19%	+++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	19553.215000 μs	27030.035 μs	138.24%	38.24%	+++
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	6985.391000 μs	8883.578 μs	127.17%	27.17%	++
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	5695.653000 μs	6896.127 μs	121.08%	21.08%	+
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	6676.246000 μs	7766.797 μs	116.33%	16.33%	+
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	97661.135000 μs	112408.658 μs	115.10%	15.10%	+
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	15245.222000 μs	17165.065 μs	112.59%	12.59%	+
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	37927.968000 μs	42602.254 μs	112.32%	12.32%	+

Relative perf in group graph (10): 123.687%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	2453.321000 μs	5621.320 μs	229.13%	129.13%	+++++++++
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	24771.753000 μs	56454.921 μs	227.90%	127.90%	+++++++++
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	69362.686000 μs	72583.103 μs	104.64%	4.64%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.588000 μs	55.253 μs	101.22%	1.22%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	349518.135000 μs	353086.695 μs	101.02%	1.02%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	676.543000 μs	677.203 μs	100.10%	0.10%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	72113.761 μs	71746.038000 μs	99.49%	-0.51%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	358934.461 μs	353349.563000 μs	98.44%	-1.56%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.493000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5631.730000 μs

Relative perf in group Velocity-Bench (9): 100.207%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Bitcracker	35.203600 s	38.359 s	108.96%	8.96%	+
Velocity-Bench Hashtable	370.681652 M keys/sec	363.340 M keys/sec	102.02%	2.02%	.
Velocity-Bench CudaSift	200.893000 ms	203.947 ms	101.52%	1.52%	.
Velocity-Bench QuickSilver	118.230000 MMS/CTT	116.460 MMS/CTT	101.52%	1.52%	.
Velocity-Bench Sobel Filter	602.999000 ms	603.076 ms	100.01%	0.01%	.
Velocity-Bench dl-cifar	23.634 s	23.630300 s	99.98%	-0.02%	.
Velocity-Bench dl-mnist	2.720 s	2.710000 s	99.63%	-0.37%	.
Velocity-Bench Easywave	228.000 ms	227.000000 ms	99.56%	-0.44%	.
Velocity-Bench svm	0.152 s	0.135900 s	89.64%	-10.36%	-

Relative perf in group Runtime (8): 98.029%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	274.697000 ms	276.461 ms	100.64%	0.64%	.
Runtime_IndependentDAGTaskThroughput_SingleTask	258.398000 ms	259.444 ms	100.40%	0.40%	.
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	275.886 ms	274.274000 ms	99.42%	-0.58%	.
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	277.595 ms	275.173000 ms	99.13%	-0.87%	.
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1762.919 ms	1710.439000 ms	97.02%	-2.98%	.
Runtime_DAGTaskThroughput_NDRangeParallelFor	1732.018 ms	1673.462000 ms	96.62%	-3.38%	.
Runtime_DAGTaskThroughput_SingleTask	1720.056 ms	1648.643000 ms	95.85%	-4.15%	.
Runtime_DAGTaskThroughput_BasicParallelFor	1788.297 ms	1704.436000 ms	95.31%	-4.69%	.

Relative perf in group MicroBench (14): 100.343%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.495000 ms	4.909 ms	109.21%	9.21%	+
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.683000 ms	4.940 ms	105.49%	5.49%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.402000 ms	4.585 ms	104.16%	4.16%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.617000 ms	4.716 ms	102.14%	2.14%	.
MicroBench_LocalMem_fp32_4096	29.858000 ms	29.902 ms	100.15%	0.15%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.256 ms	616.834000 ms	99.93%	-0.07%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.890 ms	617.437000 ms	99.93%	-0.07%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.898 ms	617.442000 ms	99.93%	-0.07%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.240 ms	616.784000 ms	99.93%	-0.07%	.
MicroBench_LocalMem_int32_4096	29.899 ms	29.862000 ms	99.88%	-0.12%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.407 ms	4.376000 ms	99.30%	-0.70%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.488 ms	4.456000 ms	99.29%	-0.71%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.518 ms	4.276000 ms	94.64%	-5.36%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.921 ms	4.526000 ms	91.97%	-8.03%	-

Relative perf in group Pattern (10): 103.611%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_Hierarchical_int32	13.626000 ms	16.339 ms	119.91%	19.91%	+
Pattern_Reduction_NDRange_int32	13.789000 ms	16.339 ms	118.49%	18.49%	+
Pattern_SegmentedReduction_NDRange_fp32	2.163000 ms	2.168 ms	100.23%	0.23%	.
Pattern_SegmentedReduction_NDRange_int64	2.335000 ms	2.337 ms	100.09%	0.09%	.
Pattern_SegmentedReduction_NDRange_int32	2.164000 ms	2.165 ms	100.05%	0.05%	.
Pattern_SegmentedReduction_NDRange_int16	2.264000 ms	2.265 ms	100.04%	0.04%	.
Pattern_SegmentedReduction_Hierarchical_int64	11.780000 ms	11.782 ms	100.02%	0.02%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.588000 ms	11.588 ms	100.00%	0.00%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.800 ms	11.796000 ms	99.97%	-0.03%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.592 ms	11.587000 ms	99.96%	-0.04%	.

Relative perf in group ScalarProduct (6): 99.900%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_Hierarchical_int32	10.525000 ms	10.541 ms	100.15%	0.15%	.
ScalarProduct_Hierarchical_fp32	10.153000 ms	10.167 ms	100.14%	0.14%	.
ScalarProduct_Hierarchical_int64	11.492 ms	11.490000 ms	99.98%	-0.02%	.
ScalarProduct_NDRange_fp32	3.754 ms	3.749000 ms	99.87%	-0.13%	.
ScalarProduct_NDRange_int32	3.777 ms	3.765000 ms	99.68%	-0.32%	.
ScalarProduct_NDRange_int64	5.448 ms	5.425000 ms	99.58%	-0.42%	.

Relative perf in group USM (7): 100.489%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.810000 ms	1.893 ms	104.59%	4.59%	.
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.206000 ms	1.258 ms	104.31%	4.31%	.
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.046000 ms	1.087 ms	103.92%	3.92%	.
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.679000 ms	1.737 ms	103.45%	3.45%	.
USM_Allocation_latency_fp32_host	37.723 ms	37.623000 ms	99.73%	-0.27%	.
USM_Allocation_latency_fp32_device	0.068 ms	0.065000 ms	95.59%	-4.41%	.
USM_Allocation_latency_fp32_shared	0.067 ms	0.062000 ms	92.54%	-7.46%	-

Relative perf in group VectorAddition (3): 99.986%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_int64	3.050000 ms	3.088 ms	101.25%	1.25%	.
VectorAddition_fp32	1.482 ms	1.480000 ms	99.87%	-0.13%	.
VectorAddition_int32	1.494 ms	1.477000 ms	98.86%	-1.14%	.

Relative perf in group Polybench (3): 99.525%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_2mm	1.040 ms	1.039000 ms	99.90%	-0.10%	.
Polybench_3mm	1.482 ms	1.477000 ms	99.66%	-0.34%	.
Polybench_Atax	6.466 ms	6.402000 ms	99.01%	-0.99%	.

Relative perf in group Kmeans (1): 100.035%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	14.106000 ms	14.111 ms	100.04%	0.04%	.

Relative perf in group LinearRegressionCoeff (1): 102.147%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	863.378000 ms	881.915 ms	102.15%	2.15%	.

Relative perf in group MolecularDynamics (1): 103.448%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.029000 ms	0.030 ms	103.45%	3.45%	.

Relative perf in group llama.cpp (6): 100.822%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Text Generation Batched 128	63.536107 token/s	62.791 token/s	101.19%	1.19%	.
llama.cpp Text Generation Batched 512	63.489263 token/s	62.789 token/s	101.12%	1.12%	.
llama.cpp Text Generation Batched 256	63.438981 token/s	62.777 token/s	101.05%	1.05%	.
llama.cpp Prompt Processing Batched 256	884.944770 token/s	878.291 token/s	100.76%	0.76%	.
llama.cpp Prompt Processing Batched 128	835.018224 token/s	830.097 token/s	100.59%	0.59%	.
llama.cpp Prompt Processing Batched 512	436.704848 token/s	435.724 token/s	100.23%	0.23%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (5): 97.882%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2076.010000 ns	2113.560 ns	101.81%	1.81%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2733.570 ns	2688.530000 ns	98.35%	-1.65%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3180.660 ns	3097.620000 ns	97.39%	-2.61%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2536.280 ns	2464.050000 ns	97.15%	-2.85%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	303.386 ns	287.722000 ns	94.84%	-5.16%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (5): 98.785%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	271.812000 ns	272.237 ns	100.16%	0.16%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	210.890 ns	208.759000 ns	98.99%	-1.01%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	193.600 ns	191.313000 ns	98.82%	-1.18%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	717.514 ns	705.635000 ns	98.34%	-1.66%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	715.357 ns	698.410000 ns	97.63%	-2.37%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (5): 98.840%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1804.180000 ns	2038.360 ns	112.98%	12.98%	+
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3358.930 ns	3338.690000 ns	99.40%	-0.60%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	263.205 ns	261.553000 ns	99.37%	-0.63%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1361.300 ns	1274.570000 ns	93.63%	-6.37%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1358.000 ns	1226.080000 ns	90.29%	-9.71%	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (5): 99.740%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	304.225000 ns	310.903 ns	102.20%	2.20%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	195.845000 ns	196.551 ns	100.36%	0.36%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.724 ns	189.545000 ns	98.86%	-1.14%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	715.548 ns	706.907000 ns	98.79%	-1.21%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	717.984 ns	707.467000 ns	98.54%	-1.46%	.

Relative perf in group alloc/min (6): 97.723%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	947.697000 ns	958.800 ns	101.17%	1.17%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	177.754 ns	177.130000 ns	99.65%	-0.35%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	840.056 ns	832.725000 ns	99.13%	-0.87%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	176.648 ns	174.753000 ns	98.93%	-1.07%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	813.400 ns	797.092000 ns	98.00%	-2.00%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1074.330 ns	965.779000 ns	89.90%	-10.10%	-

Relative perf in group multiple (16): 100.751%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14541.500000 ns	16418.600 ns	112.91%	12.91%	+
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	40144.600000 ns	41438.000 ns	103.22%	3.22%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30082.600000 ns	30910.300 ns	102.75%	2.75%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4235.480000 ns	4283.690 ns	101.14%	1.14%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	29937.200000 ns	30121.800 ns	100.62%	0.62%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25408.000000 ns	25525.500 ns	100.46%	0.46%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	145903.000000 ns	146423.000 ns	100.36%	0.36%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4195.380000 ns	4208.520 ns	100.31%	0.31%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	137977.000000 ns	138360.000 ns	100.28%	0.28%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75255.500000 ns	75451.700 ns	100.26%	0.26%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	141210.000 ns	140162.000000 ns	99.26%	-0.74%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1184930.000 ns	1174970.000000 ns	99.16%	-0.84%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	164311.000 ns	162279.000000 ns	98.76%	-1.24%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	27946.400 ns	27477.700000 ns	98.32%	-1.68%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33924.000 ns	33153.600000 ns	97.73%	-2.27%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1193200.000 ns	1162100.000000 ns	97.39%	-2.61%	.

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

pbalcer · 2025-01-30T11:48:58Z

Perf looks good, but I had to use compute runtime from yesterday.

nrspruit · 2025-01-30T16:06:28Z

Perf looks good, but I had to use compute runtime from yesterday.

Thanks, I am trying to determine why Graph/RecordReplay/barrier_multi_queue.cpp is failing in CI only, I cannot reproduce the failure outside of the CI even with the same version of the driver and compiler libraries.

ie here:
https://github.com/intel/llvm/actions/runs/13020477272/job/36320344367?pr=16431

I think this may be failing due to a memory leak that I fixed in this PR, but why it is failing now is not clear.

Once I can ensure that is not an issue or resolve it, then this patch should be ready to go.

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

- Cleaned up the checks for driver in order lists and migrated the check to platform. - Updated version needed to match version with fixes. - Fixed sync Immediate command List in order flag type. Signed-off-by: Neil R. Spruit <[email protected]>

Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit requested review from a team as code owners December 19, 2024 17:18

nrspruit requested a review from hdelan December 19, 2024 17:18

github-actions bot added level-zero L0 adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Dec 19, 2024

nrspruit force-pushed the enable_driver_in_order_compat_check branch from 6e4f1a9 to f072da3 Compare December 19, 2024 17:22

nrspruit added a commit to nrspruit/llvm that referenced this pull request Dec 19, 2024

[UR][L0] Updated Driver In order lists check and required version

e33eb4a

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit mentioned this pull request Dec 19, 2024

[UR][L0] Updated Driver In order lists check and required version intel/llvm#16431

Draft

hdelan reviewed Jan 6, 2025

View reviewed changes

source/adapters/level_zero/common.hpp Outdated Show resolved Hide resolved

nrspruit force-pushed the enable_driver_in_order_compat_check branch from f072da3 to f0c556b Compare January 14, 2025 16:08

EwanC approved these changes Jan 14, 2025

View reviewed changes

nrspruit force-pushed the enable_driver_in_order_compat_check branch from f0c556b to d13db51 Compare January 14, 2025 16:51

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025

[UR][L0] Updated Driver In order lists check and required version

0c1b459

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit force-pushed the enable_driver_in_order_compat_check branch from d13db51 to 23a9979 Compare January 15, 2025 16:57

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025

[UR][L0] Updated Driver In order lists check and required version

cb7946f

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025

[UR][L0] Updated Driver In order lists check and required version

9daa780

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025

[UR][L0] Updated Driver In order lists check and required version

143c700

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 15, 2025

[UR][L0] Updated Driver In order lists check and required version

11098e7

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit marked this pull request as draft January 15, 2025 23:02

nrspruit force-pushed the enable_driver_in_order_compat_check branch from 23a9979 to 16e070f Compare January 24, 2025 00:08

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 24, 2025

[UR][L0] Updated Driver In order lists check and required version

f0f7107

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

pbalcer approved these changes Jan 28, 2025

View reviewed changes

nrspruit force-pushed the enable_driver_in_order_compat_check branch 3 times, most recently from 108d976 to a0da40d Compare January 28, 2025 17:46

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 28, 2025

[UR][L0] Updated Driver In order lists check and required version

156acfa

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit marked this pull request as ready for review January 28, 2025 17:48

nrspruit force-pushed the enable_driver_in_order_compat_check branch from 24239e3 to c759860 Compare January 29, 2025 17:14

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 29, 2025

[UR][L0] Updated Driver In order lists check and required version

cb279ad

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit force-pushed the enable_driver_in_order_compat_check branch from c759860 to 03574ae Compare January 30, 2025 16:50

nrspruit added a commit to nrspruit/llvm that referenced this pull request Jan 30, 2025

[UR][L0] Updated Driver In order lists check and required version

ecd1fc6

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit force-pushed the enable_driver_in_order_compat_check branch from 03574ae to e322cd7 Compare February 7, 2025 18:24

nrspruit added a commit to nrspruit/llvm that referenced this pull request Feb 7, 2025

[UR][L0] Updated Driver In order lists check and required version

046fe99

-pre-commit PR for oneapi-src/unified-runtime#2491 Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit force-pushed the enable_driver_in_order_compat_check branch from e322cd7 to 4ebc29a Compare February 7, 2025 21:32

nrspruit added 2 commits February 7, 2025 16:13

Removed unecessary driver workaround

1fd368a

Signed-off-by: Neil R. Spruit <[email protected]>

nrspruit force-pushed the enable_driver_in_order_compat_check branch from 4ebc29a to 1fd368a Compare February 8, 2025 00:13

[L0] Updated Driver In order lists check and required version #2491

Are you sure you want to change the base?

[L0] Updated Driver In order lists check and required version #2491

Conversation

nrspruit commented Dec 19, 2024 • edited Loading

nrspruit commented Dec 19, 2024

nrspruit commented Jan 15, 2025

github-actions bot commented Jan 28, 2025

github-actions bot commented Jan 28, 2025

Summary

Performance change in benchmark groups

Details

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

nrspruit commented Dec 19, 2024 •

edited

Loading