perf: implement Conv1D with im2col and GEMM #1597

ebraraktas · 2024-01-11T10:20:14Z

This PR implements Conv1D operation on CPU with im2col and GEMM.
It brings upto 16x speedup. See test results below, they show
total duration spent on 2 Conv1D layers of whisper-tiny encoder.

On MacBook Pro with M2 Pro (10 core, GemmBackend=Accelerate):

thread_count	master	im2col + GEMM	speedup
1	80.5 ms	4.8 ms	16.7x
2	41.6 ms	4.4 ms	9.4x
4	22.0 ms	4.5 ms	4.9x
8	20.1 ms	4.9 ms	4.1x
10	18.7 ms	4.5 ms	4.2x

On Android (Samsung Galaxy S21, GemmBackend=RUY):

thread_count	master	im2col + GEMM	speedup
1	627.7 ms	39.3 ms	16.0x
2	350.9 ms	28.1 ms	12.5x
4	192.6 ms	38.6 ms	5.0x

The test results above are measured when batch_size=1, but the speedup ratio is
approximately same when tested with batch_size=4 (on MacBook):

thread_count	master	im2col + GEMM	speedup
1	320.9 ms	18.3 ms	17.6x
2	162.6 ms	17.5 ms	9.3x
4	82.2 ms	15.9 ms	5.2x
8	63.8 ms	16.0 ms	4.0x

EDIT: I want to add some information about memory usage of this implementation.

Previous implementation applies Transpose to input and weight, and if I am not wrong it allocates same size of input and weight.

This one creates temporary im2col tensor having batch_size * (in_channels * kernel_size) * out_length. Hence it allocates slightly more than previous implementation:

old_usage = out_channel * in_channel * kernel_size + batch_size * in_channel * in_length
new_usage = batch_size * in_channel * kernel_size * out_length

Here are example numbers for whisper-tiny (out_channel = 384; kernel_size =3):

batch_size	in_channel	in_length	out_length	old_usage (KB)	new_usage (KB)	relative_impact
1	80	3000	3000	324	703	+379 KB (+117%)
1	384	3000	1500	1557	1687	+130 KB (+8%)
2	80	3000	3000	558	1406	+848 KB (+152%)
2	384	3000	1500	2682	3375	+693 KB (+26%)
4	80	3000	3000	1027	2812	+1785 KB (+174%)
4	384	3000	1500	4932	6750	+1818 KB (+37%)

minhthuc2502 · 2024-01-26T12:17:23Z

Thank you for your PR. Do you think with the larger model, the increasing in memory will be very important? @nguyendc-systran @vince62s What do you think ?

ebraraktas · 2024-01-27T19:15:54Z

You can find the allocation sizes for large-v2:

batch_size	in_channel	in_length	out_channel	out_length	old_usage (KB)	new_usage (KB)	relative_impact
1	80	3000	80	3000	534	703	+169 KB (+32%)
1	1280	3000	1280	1500	8550	5625	-2925 KB (-34%)
2	80	3000	80	3000	768	1406	+638 KB (+83%)
2	1280	3000	1280	1500	12300	11250	-1050 KB (-9%)
4	80	3000	80	3000	1237	2812	+1575 KB (+127%)
4	1280	3000	1280	1500	19800	22500	+2700 KB (+14%)

Note that these are temporary tensors, and I don't think extra ~3MB for batch_size=4 is a big amount.

ebraraktas · 2024-02-08T15:39:41Z

Are there any blockers to merge this?

minhthuc2502 · 2024-02-09T08:45:08Z

No extra temporary memory is required. It looks good to me. I will merge this.

perf: implement conv1d with im2col and GEMM

9fad7cb

ebraraktas mentioned this pull request Jan 15, 2024

perf: conv1d quantization #1601

Merged

minhthuc2502 merged commit ce47032 into OpenNMT:master Feb 9, 2024
17 checks passed

ebraraktas deleted the perf/conv1d-with-im2col branch March 26, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: implement Conv1D with im2col and GEMM #1597

perf: implement Conv1D with im2col and GEMM #1597

ebraraktas commented Jan 11, 2024 •

edited

Loading

minhthuc2502 commented Jan 26, 2024

ebraraktas commented Jan 27, 2024

ebraraktas commented Feb 8, 2024

minhthuc2502 commented Feb 9, 2024

perf: implement Conv1D with im2col and GEMM #1597

perf: implement Conv1D with im2col and GEMM #1597

Conversation

ebraraktas commented Jan 11, 2024 • edited Loading

minhthuc2502 commented Jan 26, 2024

ebraraktas commented Jan 27, 2024

ebraraktas commented Feb 8, 2024

minhthuc2502 commented Feb 9, 2024

ebraraktas commented Jan 11, 2024 •

edited

Loading