Architecture: Oryon-1
Setting: 4 cores @ 3.4Ghz + 8 cores @ 4.0Ghz
For single core:
PS C:\Data\cpufp> .\cpufp.exe --thread_pool=[4] Number Threads: 1 Thread Pool Binding: 4 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | i8mm | mmla(s32,s8,s8) | 442.36 GOPS | | i8mm | mmla(u32,u8,u8) | 434.67 GOPS | | i8mm | mmla(s32,u8,s8) | 437.35 GOPS | | i8mm | dp4a.vs(s32,s8,u8) | 520.02 GOPS | | i8mm | dp4a.vs(s32,u8,s8) | 525.78 GOPS | | i8mm | dp4a.vv(s32,u8,s8) | 515.6 GOPS | | asimd_dp | dp4a.vs(s32,s8,s8) | 510.91 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 516.89 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 518 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 514.3 GOPS | | bf16 | mmla(f32,bf16,bf16) | 223.53 GFLOPS | | bf16 | dp2a.vs(f32,bf16,bf16) | 256.44 GFLOPS | | bf16 | dp2a.vv(f32,bf16,bf16) | 252.13 GFLOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 260.4 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 259.04 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 127.29 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 125.67 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 65.2 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 65.195 GFLOPS | ----------------------------------------------------------------
For 12 cores:
PS C:\Data\cpufp> .\cpufp.exe --thread_pool=[0-11] Number Threads: 12 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | i8mm | mmla(s32,s8,s8) | 4.3971 TOPS | | i8mm | mmla(u32,u8,u8) | 4.3813 TOPS | | i8mm | mmla(s32,u8,s8) | 4.3889 TOPS | | i8mm | dp4a.vs(s32,s8,u8) | 5.1953 TOPS | | i8mm | dp4a.vs(s32,u8,s8) | 5.221 TOPS | | i8mm | dp4a.vv(s32,u8,s8) | 5.209 TOPS | | asimd_dp | dp4a.vs(s32,s8,s8) | 5.2081 TOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 5.2275 TOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 5.222 TOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 5.2146 TOPS | | bf16 | mmla(f32,bf16,bf16) | 2.2578 TFLOPS | | bf16 | dp2a.vs(f32,bf16,bf16) | 2.6124 TFLOPS | | bf16 | dp2a.vv(f32,bf16,bf16) | 2.6172 TFLOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 2.6051 TFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 2.6035 TFLOPS | | asimd | fmla.vs(f32,f32,f32) | 1.3028 TFLOPS | | asimd | fmla.vv(f32,f32,f32) | 1.3032 TFLOPS | | asimd | fmla.vs(f64,f64,f64) | 654.67 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 654.44 GFLOPS | ----------------------------------------------------------------
Architecture: Taishan V110
Setting: 2 * 64 cores
For single core:
$ ./cpufp --thread_pool=[1] Number Threads: 1 Thread Pool Binding: 1 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 166.3 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 166.32 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 166.31 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 166.29 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 83.161 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 83.151 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 41.576 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 41.579 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 10.395 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 10.394 GFLOPS | ----------------------------------------------------------------
For 32 cores:
$ ./cpufp --thread_pool=[0-31] Number Threads: 32 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 5.304 TOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 5.3108 TOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 5.307 TOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 5.3123 TOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 2.6555 TFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 2.6564 TFLOPS | | asimd | fmla.vs(f32,f32,f32) | 1.3252 TFLOPS | | asimd | fmla.vv(f32,f32,f32) | 1.328 TFLOPS | | asimd | fmla.vs(f64,f64,f64) | 331.95 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 331.98 GFLOPS | ----------------------------------------------------------------
For 64 cores:
$ ./cpufp --thread_pool=[0-63] Number Threads: 64 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 10.601 TOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 10.586 TOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 10.587 TOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 10.593 TOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 5.2966 TFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 5.2975 TFLOPS | | asimd | fmla.vs(f32,f32,f32) | 2.6551 TFLOPS | | asimd | fmla.vv(f32,f32,f32) | 2.6557 TFLOPS | | asimd | fmla.vs(f64,f64,f64) | 663.98 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 663.73 GFLOPS | ----------------------------------------------------------------
For 128 cores:
$ ./cpufp --thread_pool=[0-127] Number Threads: 128 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 20.951 TOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 20.27 TOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 19.736 TOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 16.495 TOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 10.481 TFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 10.514 TFLOPS | | asimd | fmla.vs(f32,f32,f32) | 5.1993 TFLOPS | | asimd | fmla.vv(f32,f32,f32) | 4.117 TFLOPS | | asimd | fmla.vs(f64,f64,f64) | 1.2754 TFLOPS | | asimd | fmla.vv(f64,f64,f64) | 1.049 TFLOPS | ----------------------------------------------------------------
Architecture: Neoverse V1
Setting: Virtual 1 Core
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | i8mm | mmla(s32,s8,s8) | 332.34 GGOPS | | i8mm | mmla(u32,u8,u8) | 332.46 GGOPS | | i8mm | mmla(s32,u8,s8) | 332.46 GGOPS | | i8mm | dp4a.vs(s32,s8,u8) | 166.23 GGOPS | | i8mm | dp4a.vs(s32,u8,s8) | 166.17 GGOPS | | i8mm | dp4a.vv(s32,u8,s8) | 166.14 GGOPS | | asimd_dp | dp4a.vs(s32,s8,s8) | 166.18 GGOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 166.22 GGOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 166.22 GGOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 166.22 GGOPS | | bf16 | mmla(f32,bf16,bf16) | 166.18 GGFLOPS | | bf16 | dp2a.vs(f32,bf16,bf16) | 83.085 GGFLOPS | | bf16 | dp2a.vv(f32,bf16,bf16) | 83.111 GGFLOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 83.105 GGFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 83.113 GGFLOPS | | asimd | fmla.vs(f32,f32,f32) | 41.549 GGFLOPS | | asimd | fmla.vv(f32,f32,f32) | 41.542 GGFLOPS | | asimd | fmla.vs(f64,f64,f64) | 35.96 GGFLOPS | | asimd | fmla.vv(f64,f64,f64) | 20.779 GGFLOPS | ----------------------------------------------------------------
Setting: 4 Cortex-A72 Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 11.958 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 11.958 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 5.9792 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 5.9792 GFLOPS | -------------------------------------------------------------
For 4 cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 47.883 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 47.88 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 23.933 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 23.943 GFLOPS | -------------------------------------------------------------
Setting: 4 Cortex-A76 Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 153.48 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 153.48 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 153.47 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 153.48 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 76.738 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 76.738 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 38.369 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 38.369 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 19.185 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 19.185 GFLOPS | ----------------------------------------------------------------
For 4 cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 613.79 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 614.02 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 613.98 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 613.99 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 306.88 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 306.98 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 153.48 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 153.5 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 74.513 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 76.751 GFLOPS | ----------------------------------------------------------------
Setting: 4 Cortex-A76(big) Cores + 4 Cortex-A55(Little) Cores
For single Little core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 58.379 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 58.371 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 58.369 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 58.382 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 29.193 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 29.192 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 14.593 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 14.596 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 7.2971 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 7.2972 GFLOPS | ----------------------------------------------------------------
For 4 Little cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 233.08 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 233.05 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 233.06 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 233.05 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 116.54 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 116.51 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 58.261 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 58.258 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 29.13 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 29.126 GFLOPS | ----------------------------------------------------------------
For single big core:
$ ./cpufp --thread_pool=[4] Number Threads: 1 Thread Pool Binding: 4 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 152.1 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 152.1 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 152.06 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 152.08 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 76.022 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 76.027 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 38.012 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 38.008 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 19.004 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 19.004 GFLOPS | ----------------------------------------------------------------
For 4 big cores:
$ ./cpufp --thread_pool=[4-7] Number Threads: 4 Thread Pool Binding: 4 5 6 7 ---------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd_dp | dp4a.vs(s32,s8,s8) | 601.71 GOPS | | asimd_dp | dp4a.vv(s32,s8,s8) | 602.2 GOPS | | asimd_dp | dp4a.vs(u32,u8,u8) | 602.22 GOPS | | asimd_dp | dp4a.vv(u32,u8,u8) | 602.2 GOPS | | asimd_hp | fmla.vs(fp16,fp16,fp16) | 300.97 GFLOPS | | asimd_hp | fmla.vv(fp16,fp16,fp16) | 300.93 GFLOPS | | asimd | fmla.vs(f32,f32,f32) | 149.79 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 150.15 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 75.222 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 75.215 GFLOPS | ----------------------------------------------------------------
Setting: 2 Cortex-A72(big) Cores + 4 Cortex-A53(Little) Cores
For single Little core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 11.255 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 11.255 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 5.6275 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 5.6277 GFLOPS | -------------------------------------------------------------
For 4 Little cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 45.029 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 45.027 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 22.509 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 22.513 GFLOPS | -------------------------------------------------------------
For single big core:
$ ./cpufp --thread_pool=[4] Number Threads: 1 Thread Pool Binding: 4 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 14.348 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 14.348 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 7.1744 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 7.1743 GFLOPS | -------------------------------------------------------------
For 2 big cores:
$ ./cpufp --thread_pool=[4,5] Number Threads: 2 Thread Pool Binding: 4 5 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 28.698 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 28.698 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 14.349 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 14.347 GFLOPS | -------------------------------------------------------------
Setting: 8 FTC663 Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 18.376 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 18.375 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 9.1877 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 9.1891 GFLOPS | -------------------------------------------------------------
For 4 cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 ------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | asimd | fmla.vs(f32,f32,f32) | 73.51 GFLOPS | | asimd | fmla.vv(f32,f32,f32) | 73.51 GFLOPS | | asimd | fmla.vs(f64,f64,f64) | 36.755 GFLOPS | | asimd | fmla.vv(f64,f64,f64) | 36.747 GFLOPS | -------------------------------------------------------------