lm

A small (enough) transformer model trained on tiny shakespeare dataset.

Goals:

Understand the minimum configurations that can make an autoregressive LM "work".

Implementations:

Notes:

Does training loss decrease and converge means a correct implementation or successful model?
- No! There was a bug in calculating attention mask, but the LM can still overfit training data with that bug. With that bug, the text generated by the model is completely nonsense.
Is position embedding really necessary?
- No! It's not mandatory for an autoregressive model to work, if causal mask is used.
Most commonly made implementation mistakes
- Must use -inf instead of 0 to mask logits, e.g. attention mask, top k sampling.

References:

The code is written from scratch but several bugs were found when comparing with Andrej Karpathy's nanoGPT repo.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
tinyshakespeare.txt		tinyshakespeare.txt

Provide feedback