Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: implement Conv1D with im2col and GEMM #1597

Merged
merged 1 commit into from
Feb 9, 2024

Conversation

ebraraktas
Copy link
Contributor

@ebraraktas ebraraktas commented Jan 11, 2024

This PR implements Conv1D operation on CPU with im2col and GEMM.
It brings upto 16x speedup. See test results below, they show
total duration spent on 2 Conv1D layers of whisper-tiny encoder.

On MacBook Pro with M2 Pro (10 core, GemmBackend=Accelerate):

thread_count master im2col + GEMM speedup
1 80.5 ms 4.8 ms 16.7x
2 41.6 ms 4.4 ms 9.4x
4 22.0 ms 4.5 ms 4.9x
8 20.1 ms 4.9 ms 4.1x
10 18.7 ms 4.5 ms 4.2x

On Android (Samsung Galaxy S21, GemmBackend=RUY):

thread_count master im2col + GEMM speedup
1 627.7 ms 39.3 ms 16.0x
2 350.9 ms 28.1 ms 12.5x
4 192.6 ms 38.6 ms 5.0x

The test results above are measured when batch_size=1, but the speedup ratio is
approximately same when tested with batch_size=4 (on MacBook):

thread_count master im2col + GEMM speedup
1 320.9 ms 18.3 ms 17.6x
2 162.6 ms 17.5 ms 9.3x
4 82.2 ms 15.9 ms 5.2x
8 63.8 ms 16.0 ms 4.0x

EDIT: I want to add some information about memory usage of this implementation.

Previous implementation applies Transpose to input and weight, and if I am not wrong it allocates same size of input and weight.

This one creates temporary im2col tensor having batch_size * (in_channels * kernel_size) * out_length. Hence it allocates slightly more than previous implementation:

old_usage = out_channel * in_channel * kernel_size + batch_size * in_channel * in_length
new_usage = batch_size * in_channel * kernel_size * out_length

Here are example numbers for whisper-tiny (out_channel = 384; kernel_size =3):

batch_size in_channel in_length out_length old_usage (KB) new_usage (KB) relative_impact
1 80 3000 3000 324 703 +379 KB (+117%)
1 384 3000 1500 1557 1687 +130 KB (+8%)
2 80 3000 3000 558 1406 +848 KB (+152%)
2 384 3000 1500 2682 3375 +693 KB (+26%)
4 80 3000 3000 1027 2812 +1785 KB (+174%)
4 384 3000 1500 4932 6750 +1818 KB (+37%)

@minhthuc2502
Copy link
Collaborator

Thank you for your PR. Do you think with the larger model, the increasing in memory will be very important? @nguyendc-systran @vince62s What do you think ?

@ebraraktas
Copy link
Contributor Author

You can find the allocation sizes for large-v2:

batch_size in_channel in_length out_channel out_length old_usage (KB) new_usage (KB) relative_impact
1 80 3000 80 3000 534 703 +169 KB (+32%)
1 1280 3000 1280 1500 8550 5625 -2925 KB (-34%)
2 80 3000 80 3000 768 1406 +638 KB (+83%)
2 1280 3000 1280 1500 12300 11250 -1050 KB (-9%)
4 80 3000 80 3000 1237 2812 +1575 KB (+127%)
4 1280 3000 1280 1500 19800 22500 +2700 KB (+14%)

Note that these are temporary tensors, and I don't think extra ~3MB for batch_size=4 is a big amount.

@ebraraktas
Copy link
Contributor Author

Are there any blockers to merge this?

@minhthuc2502
Copy link
Collaborator

No extra temporary memory is required. It looks good to me. I will merge this.

@minhthuc2502 minhthuc2502 merged commit ce47032 into OpenNMT:master Feb 9, 2024
17 checks passed
@ebraraktas ebraraktas deleted the perf/conv1d-with-im2col branch March 26, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants