Skip to content

Latest commit

 

History

History
68 lines (58 loc) · 6.23 KB

README.md

File metadata and controls

68 lines (58 loc) · 6.23 KB

Building GPT

This repo is referenced from the build-nanogpt by Andrej Karpathy. Thanks!

Fig a. GPT architecture

Papers 📄

I am reading these papers:
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
Attention is All You Need
Gaussian Error Linear Units (GELUs)
Using the Output Embedding to Improve Language Models
☑️ FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
☑️ FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
☑️ Online normalizer calculation for softmax
HellaSwag: Can a Machine Really Finish Your Sentence?

Goals 🎯

✅ Read the GPT-2 paper for baseline of the model architecture.
✅ Inspect the source code of the GPT-2 model from OpenAI & HuggingFace.
✅ Prepare a notebook to experiment the model and outputs as done by Andrej Karpathy.
✅ Implement the GPT-2 model from scratch with diagrams and explanations.
✅ Implement the transformer block of the model with attention & FFN.
✅ Read the GELU paper for activation function used in GPT2.
✅ Implement the FeedForwardBlock of the model with GELU activation.
✅ Implement the MultiHeadAttentionBlock of the model from scratch.
✅ Load the GPT-2 model checkpoints from HuggingFace using custom GPT architecture class.
✅ Implement the inference script to generate text from the model.
✅ Implement the data loading pipeline for the model and play with the data.
✅ Work on initializing the random weights as mentioned in GPT-2 paper.
✅ Learn and understand the concept of weight tying in the Transformer model.
✅ Read the Automatic Mixed Precision for enabling mixed precision training and automatic type casting.
✅ Enable the mixed precision training (TF32 & BF16) in the model training.
✅ Read the documentation of torch.compile from PyTorch.
✅ Implement the torch.compile and flash attention in the model training.
✅ Implement and understand global gradient clipping in the model training.
✅ Implement and understand the cosine learning rate scheduling in the model training.
✅ Implement and understand the weight decay using AdamW optimizer and fused optimizer.
✅ Implement and understand the gradient accumulation in the model training.
☑️ Read the documentation of DistributedDataParallel from PyTorch.
✅ Implement the distributed training using DDP in the model training.
✅ Reading the FineWeb blogpost for preparing the dataset at scale.
✅ Implement the script to download and preprocess the FineWeb dataset for training.
✅ Implement the code for the validation loop and sample generation from the model while training.
✅ Code the training script for GPT-2 model.
✅ Work on optimization and training the model on FineWeb dataset.

Github Repositories

🌐 nanoGPT - Implementation by Andrej Karpathy.
🌐 build-nanogpt - Implementation by Andrej Karpathy.
🌐 gpt-2 - TensorFlow implementation of GPT-2 by OpenAI.
🌐 modeling-gpt2 - PyTorch implementation of GPT-2 by HuggingFace.
🌐 Meta-llama - Implementation of Llama by Thinam Tamang.
🌐 flash-attention - Implementation of Flash Attention by Tri Dao.
🌐 hellaswag - Implementation of HellaSwag by Rowan Zellers.

Important Notes 🍀

💡 If we have a pretrained model, we can plot the weights of the positional embeddings + the weights of the token embeddings. If we see a lots of fluctuations and noise with the weights, then the model is not trained completely or the model is not converged yet. So, if the plot is smooth, then the model is trained well.

💡 We can calculate the reasonable starting point for the randomly initialized weights of the model by using the vocabulary size and the loss function used in the model. In this work on GPT2, the vocabulary size is 50257 and every vocab element is getting roughly 1/50257 probability of being selected which means that we are getting -log(1/50257) = 10.82 loss if we randomly initialize the weights keeping in mind that the cross entropy loss is used in the model. The cross entropy loss is the negative log likelihood of the true class. So, the loss should be around 10.82 if the model is randomly initialized.

💡 The input embeddings at the bottom of the Transformer and the output embeddings coming from the linear or projection layer at the top, contains 2D tensors exactly of same shape and elements pointing to the same data pointer. This is known as weight tying scheme with the intuition that if two tokens are very similar semantically then we would expect them to have similar probabilities at the output of the Transformer. You can find this implementation in original GPT2 source code from OpenAI.