llamafile v0.4

jart released this 14 Dec 09:23

· 573 commits to main since this release

188f7fc

llamafile lets you distribute and run LLMs with a single file

This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.

820d42d Synchronize with llama.cpp upstream

GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.

a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
9d85a72 Improve GEMM performance by nearly 2x (#93)
72e1c72 Support CUDA without cuBLAS (#82)
2849b08 Make it possible for CUDA to extract prebuilt DSOs

Additional fixes and improvements:

c236a71 Improve markdown and syntax highlighting in server (#88)
69ec1e4 Update the llamafile manual
782c81c Add SD ops, kernels
93178c9 Polyfill $HOME on some Windows systems
fcc727a Write log to /dev/null when main.log fails to open
77cecbe Fix handling of characters that span multiple tokens when streaming

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

Assets 6