llamafile v0.4
llamafile lets you distribute and run LLMs with a single file
This release features Mixtral support. Support has been added for Qwen
models too. The --chatml
, --samplers
, and other flags are added.
- 820d42d Synchronize with llama.cpp upstream
GPU now works out of the box on Windows. You still need to pass the
-ngl 35
flag, but you're no longer required to install CUDA/MSVC.
- a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
- 9d85a72 Improve GEMM performance by nearly 2x (#93)
- 72e1c72 Support CUDA without cuBLAS (#82)
- 2849b08 Make it possible for CUDA to extract prebuilt DSOs
Additional fixes and improvements:
- c236a71 Improve markdown and syntax highlighting in server (#88)
- 69ec1e4 Update the llamafile manual
- 782c81c Add SD ops, kernels
- 93178c9 Polyfill $HOME on some Windows systems
- fcc727a Write log to /dev/null when main.log fails to open
- 77cecbe Fix handling of characters that span multiple tokens when streaming
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here: