Transformer Model Compression for Edge Deployment
Explore the repository»
View Report
tags : model compression, transformers, edge learning, federated learning, iwslt translation, english, german, deep learning, pytorch
Transformer architectures and their extensions such as BERT, GPT etc, has revolutionized the world of Natural Language, Speech and Image processing. However, the large number of parameters and the computation cost inhibits the transformer models to be deployed on edge devices such as smartphones. In this work, we explore the model compression for transformer architectures by quantization. Quantization not only reduces the memory footprint, but also improves energy efficiency. Research has shown that 8 bit quantized model uses 4x lesser memory and 18x lesser energy. Model compression for transformer architectures would lead to reduced storage, memory footprint and compute power requirements. We show that transformer models can be compressed with no loss of or improved performance on the IWSLT English-German translation task. We specifically explore quantization aware training of the linear layers and demonstrate the performance for 8 bits, 4 bits, 2 bits and 1 bit (binary) quantization. We find that the linear layers of the attention network to be highly resilient to quantization and can be compressed aggressively. A detailed description of quantization algorithms and analysis of the results are available in the Report.
This project was built with
- python v3.8.3
- PyTorch v1.5
- The environment used for developing this project is available at environment.yml.
Clone the repository into a local machine and enter the src directory using
git clone https://github.com/vineeths96/Compressed-Transformers
cd Compressed-Transformers/src
Create a new conda environment and install all the libraries by running the following command
conda env create -f environment.yml
The dataset used in this project (IWSLT English-German translation) will be automatically downloaded and setup in src/data
directory during execution.
The training_script.py
requires arguments to be passed (check default values in code):
-
--batch_size
- set to a maximum value that won't raise CUDA out of memory error -
--language_direction
- pick betweenE2G
andG2E
-
--binarize
- binarize attention module linear layers during training -
--binarize_all_linear
- binarize all linear layers during training -
--quantize
- quantize attention module linear layers during training -
--quantize_bits
- number of bits of quantization -
--quantize_all_linear
- quantize all linear layers during training
To train the transformer model without any compression,
python training_script.py --batch_size 1500
To train the transformer model with binarization of attention linear layers,
python training_script.py --batch_size 1500 --binarize True
To train the transformer model with binarization of all linear layers,
python training_script.py --batch_size 1500 --binarize_all_linear True
To train the transformer model with quantization of attention linear layers,
python training_script.py --batch_size 1500 --quantize True --quantize_bits 8
To train the transformer model with quantization of all linear layers,
python training_script.py --batch_size 1500 --quantize_all_linear True --quantize_bits 8
The transformer architecture implemented follows from the seminal paper Attention Is All You Need by Vaswani et al. The architecture of the model is shown below.
We evaluate the baseline models and proposed quantization methods on IWSLT dataset. We use Bilingual Eval. More detailed results and inferences are available in report here.
Model | BLEU Score |
---|---|
Base line (Uncompressed) | 27.9 |
Binary Quantization (All Linear) | 13.2 |
Binary Quantization (Attention Linear) | 26.87 |
Quantized - 8 Bit (Attention Linear) | 29.83 |
Quantized - 4 Bit (Attention Linear) | 29.76 |
Quantized - 2 Bit (Attention Linear) | 28.72 |
Quantized - 1 Bit (Attention Linear) | 24.32 |
Quantized - 8 Bit (Attention + Embedding) | 21.26 |
Quantized - 8 Bit (All Linear) | 27.19 |
Quantized - 4 Bit (All Linear) | 27.72 |
The proposed method can also be used for post-training quantization with minimal performance loss (< 1%) on pretrained BERT models. (Results are not shown due to lack of space).
Distributed under the MIT License. See LICENSE
for more information.
Vineeth S - [email protected]
Project Link: https://github.com/vineeths96/Compressed-Transformers