- Prompt: the initial text or instruction given to the model.
- Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
- Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
- Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
- KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
- Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
- Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
- Post-training quantization: quantizing the weights and activations of the model after the model has been trained.
- Quantization-Aware Training: incorporating quantization considerations during training.
Name | Stars | Hardware | Org |
---|---|---|---|
Transformers | CPU / NVIDIA GPU / TPU / AMD GPU | Hugging Face | |
Text Generation Inference | CPU / NVIDIA GPU / AMD GPU | Hugging Face | |
gpt-fast | CPU / NVIDIA GPU / AMD GPU | PyTorch | |
TensorRT-LLM | NVIDIA GPU | NVIDIA | |
vLLM | NVIDIA GPU | UC Berkeley | |
llama.cpp / ggml | CPU / Apple Silicon / NVIDIA GPU / AMD GPU | ggml | |
ctransformers | CPU / Apple Silicon / NVIDIA GPU / AMD GPU | Ravindra Marella | |
DeepSpeed | CPU / NVIDIA GPU | Microsoft | |
FastChat | CPU / NVIDIA GPU / Apple Silicon | lmsys.org | |
MLC-LLM | CPU / NVIDIA GPU | MLC | |
LightLLM | CPU / NVIDIA GPU | SenseTime | |
LMDeploy | CPU / NVIDIA GPU | Shanghai AI Lab & SenseTime | |
OpenLLM | CPU / NVIDIA GPU / AMD GPU | BentoML | |
OpenPPL.nn / OpenPPL.nn.llm | CPU / NVIDIA GPU | OpenMMLab & SenseTime | |
ScaleLLM | NVIDIA GPU | Vectorch | |
RayLLM | CPU / NVIDIA GPU / AMD GPU | Anyscale | |
Xorbits Inference | CPU / NVIDIA GPU / AMD GPU | Xorbits |
Name | Paper Title | Paper Link | Artifact | Keywords | Recommend |
---|---|---|---|---|---|
LLaMA | LLaMA: Open and Efficient Foundation Language Models | arXiv 23 | Code | Pre-training | ⭐️⭐️⭐️⭐️⭐️ |
Llama 2 | Llama 2: Open Foundation and Fine-Tuned Chat Models | arXiv 23 | Model | Pre-training / Fine-tuning / Safety | ⭐️⭐️⭐️⭐️ |
Multi-Query | Fast Transformer Decoding: One Write-Head is All You Need | arXiv 19 | Architecture | ⭐️⭐️⭐️ | |
Grouped-Query | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | arXiv 23 | Architecture | ⭐️⭐️⭐️ | |
RoPE | Roformer: Enhanced transformer with rotary position embedding | arXiv 21 | Position Encoding | ⭐️⭐️⭐️⭐️ | |
Megatron-LM | Efficient large-scale language model training on GPU clusters using megatron-LM | SC 21 | Code | Parallelism | ⭐️⭐️⭐️⭐️⭐️ |
Google's Practice | Efficiently Scaling Transformer Inference | MLSys 23 | Parallelism | ⭐️⭐️⭐️⭐️ | |
FlashAttention | Fast and Memory-Efficient Exact Attention with IO-Awareness | NeurIPS 23 | Code | Effiencent Attention / GPU | ⭐️⭐️⭐️⭐️⭐️ |
Orca | Orca: A distributed serving system for Transformer-Based generative models | OSDI 22 | Code | Continuous Batching | ⭐️⭐️⭐️⭐️⭐️ |
PagedAttention | Efficient Memory Management for Large Language Model Serving with PagedAttention | SOSP 23 | Code | Effiencent Attention / Continuous Batching / | ⭐️⭐️⭐️⭐️⭐️ |
FlexGen | FlexGen: High-throughput generative inference of large language models with a single GPU | ICML 23 | Code | Offloading | ⭐️⭐️⭐️ |
Speculative Decoding | Fast Inference from Transformers via Speculative Decoding | ICML 23 | Sampling | ⭐️⭐️⭐️⭐️ | |
LLM.int8() | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | NeurIPS 22 | Code | Quantization | ⭐️⭐️⭐️⭐️ |
Alpa | Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | osdi22 | Code | Parallelism | ⭐️⭐️⭐️ |
Gpipe | Easy Scaling with Micro-Batch Pipeline Parallelism | arXiv19 | Paralleism | ⭐️⭐️⭐️⭐️ | |
Beam Search | Beam Search Strategies for Neural Machine Translation | arXiv17 | Parallism | ⭐️⭐️⭐️ |