Skip to content

Llama 3.3 + Dynamic 4bit Quants

Latest
Compare
Choose a tag to compare
@danielhanchen danielhanchen released this 04 Dec 13:59
· 457 commits to main since this release
9dc399a

We provide dynamic 4bit quants which uses a bit more memory, but vastly improves accuracy for finetuning and inference. Unsloth will now default to these versions! See https://unsloth.ai/blog/dynamic-4bit for more details.

Llama 3.3 is out now! Read our blog: https://unsloth.ai/blog/llama3-3

  • You can now fine-tune Llama 3.3 (70B) up to 90,000 context lengths with Unsloth, which is 13x longer than what Hugging Face + FA2 supports at 6,900 on a 80GB GPU.
  • For Llama 3.1 (8B), Unsloth can now do a whopping 342,000 context length, which exceeds the 128K context lengths Llama 3.1 natively supported. HF + FA2 can only do 28,000 on a 80GB GPU, so Unsloth supports 12x context lengths.
  • 70B models can now fit on 41GB of VRAM - nearly 40GB!

All notebooks now use these dynamic quants:

Experiments

Train
Quantizing Qwen2-VL-2B Instruct down to 4 bits breaks the model entirely.

Qwen2-VL-2B-Instruct Description Size Result
16bit The image shows a train traveling on tracks. 4.11GB
Default 4bit all layers The image depicts a vibrant and colorful scene of a coastal area. 1.36GB
Unsloth quant The image shows a train traveling on tracks. 1.81GB

Merging to 16bit now works as expected.

Fixed a major bug which caused merges to not function correctly for vision models.

Llama.cpp GGUF saving now uses cmake.

All saving modules are also updated inside of Unsloth!

Apple Cut Cross Entropy

We worked with Apple to add Cut Cross Entropy into Unsloth which reduces VRAM use and increase context length further.

QwQ 4bit quants and GGUFs

Try a O1 test time compute LLM out! See https://huggingface.co/unsloth

What's Changed

Full Changelog: November-2024...December-2024