Contlo On-Campus Placement Assignment: Implementation of GPT-2 Architecture
GPT-2, short for "Generative Pre-trained Transformer 2," is a state-of-the-art natural language processing (NLP) model developed by OpenAI. This repository provide a simplified code for the GPT-2 architecture, modifications and with large scale training stategy. The Contlo Placement Assignment provided task to develope the architecture along with modification. The code focused on understanding the Transformer architecture, modifying its structures for improved performance, and implementing efficient training loops suitable for distributed training across multiple GPUs. The tasks are described as follows:
Implement the GPT2-small
model (with 125 million parameters) using Python and PyTorch. Touch upon the key aspects of the model like multi-head self-attention mechanism, feed-forward networks and positional encoding.
Key points:
- Follow the original GPT-2 design of using both token and positional embeddings.
- Implement the transformer layers with multi-head self-attention and point-wise feed-forward network.
- Required to abstain from using pre-built transformer libraries.
To validate your implementation, load the original GPT-2 125M model checkpoints and run a sample prediction.
Add alterations to the original GPT-2 model architecture to experiment and assess the potential of improvements. Here's what you need to do:
- Rotary Positional Embedding: Replace the original positional embeddings in the GPT-2 model with Rotary embeddings. You may refer to Su et. al. RoFormer.
- Group Query Attention: Equip your model with the Group Query Attention mechanism following the insights from the Ainslie et. al. GQA: Training Generalized Multi-Query Transformer. Analyze how this mechanism can modify the model's operation compared to the standard attention mechanism.
- Sliding Window Attention: Imbibe the Sliding Window Attention mechanism in your model and observe its effects on model performance. Refer to the work by Beltagy et. al. Longformer for better comprehension of its implementation and advantages.
Finally, create a training loop considering these following requirements:
- Single GPU Training Loop: Your base implementation should be equipped to train your model on a single GPU setup.
- Distributed Data Parallel (DDP): Extend the single GPU training loop to support training across multiple GPUs using DDP. Revisit the PyTorch's DDP tutorial for guidance.
- Fully Sharded Data Parallel (FSDP): Implement FSDP as a part of the training loop to shard the model parameters, gradients, and optimizer state. You can follow Gupta et al., 2020, Training GPT-3 Like Models on a Single Machine for a comprehensive understanding of it.
To ensure a consistent environment, it is recommended to use conda for managing dependencies. Follow these steps to set up the environment using the provided environment.yml
file.
-
Clone the repository:
git clone https://github.com/omm-prakash/GPT-2.git cd GPT-2
-
Create conda environment:
conda env create -f environment.yml
-
Activate the environment:
conda activate gpt2
-
Create directoris:
mkdir data assets data curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin && mv gpt2-pytorch_model.bin ./assets/
To use the proposed GPT-2 architecture using the below arguments while for python main.py
.
--text
: Input text, required.--nsamples
: Number of samples, default value is 1.--unconditional
: If true, unconditional generation. Default is false.--temperature
: Temperature for sampling, default value is 0.7.--batch_size
: Batch size for generation, default value is -1.--length
: Length of generated text, default value is -1.--config
: Path to config file, default value is 'config.yml'.--top_k
: Value for top-k sampling, default value is 40.--load_pretrained
: If true, load pretrained model. Default is false.
To start training the model on a text corpus, by using below arguments.
--config
: Path to configuration file, default value is 'config.yml'.--load_pretrained
: If true, load pretrained model. Default is false.--data_path
: Path to data file, default value is 'data/data.txt'.--fsdp
: If true, use FSDP (Fully Sharded Data Parallelism). Default is false.--dpp
: If true, use DPP (Data Parallelism Pipeline). Default is false.--seed
: Random seed, default value is 1.
Ensure that you have updated the necessary configurations in the config.yml
file before starting the training process.
The project has not been licensed till now.
The gpt2 is referred from GPT-2.
Please contact me at [email protected] or [email protected] for any query related to the code.