Why training is ~3x slower than Swin? #15

rayleizhu · 2022-09-12T12:10:06Z

Thanks for open sourcing this great work. While trying the code, I found the training speed is ~3x slower than Swin Transformer. For example, for quadtree-b2 which has similar FLOPs as Swin-T, training takes ~2.5s per batch. And it is even slower (3s/batch) when I align its macro design (depths, embedding dims, etc.) with Swin-T.

Can you give some insights to account for this scenario?

rayleizhu · 2022-09-12T12:15:43Z

Personally, I have some guesses for the slowness:

the recursive assembly process of quadtree attention
the low computation intensity of the quadtree attention due to scattered keys and values

Am I correct? Between the above two, which one has more effect? Are there any other possible reasons according to your experience?

Tangshitao · 2022-09-12T17:08:36Z

Not exactly. There are 2 reasons: 1) we implement the quadtree attention with raw cuda without much optimization. We expect a speedup if implemented with torch.geometry. 2) The sparsity nature of quadtree attention make it unfriendly to hardware. This cannot be solved from code level.
I suggest that the easiest solution is to reduce top K, so you can achieve significant speedup without much performance loss.

rayleizhu changed the title ~~training is ~3x slower than Swin~~ Why training is ~3x slower than Swin? Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why training is ~3x slower than Swin? #15

Why training is ~3x slower than Swin? #15

rayleizhu commented Sep 12, 2022

rayleizhu commented Sep 12, 2022

Tangshitao commented Sep 12, 2022

Why training is ~3x slower than Swin? #15

Why training is ~3x slower than Swin? #15

Comments

rayleizhu commented Sep 12, 2022

rayleizhu commented Sep 12, 2022

Tangshitao commented Sep 12, 2022