A small (enough) transformer model trained on tiny shakespeare dataset.
Goals:
- Understand the minimum configurations that can make an autoregressive LM "work".
Implementations:
- A simple ASCII tokenizer using python's
ord()
andchr()
. - Top K sampling
Notes:
- Does training loss decrease and converge means a correct implementation or successful model?
- No! There was a bug in calculating attention mask, but the LM can still overfit training data with that bug. With that bug, the text generated by the model is completely nonsense.
- Is position embedding really necessary?
- No! It's not mandatory for an autoregressive model to work, if causal mask is used.
- Most commonly made implementation mistakes
- Must use
-inf
instead of0
to mask logits, e.g. attention mask, top k sampling.
- Must use
References:
- The code is written from scratch but several bugs were found when comparing with Andrej Karpathy's nanoGPT repo.