Skip to content

llamafile v0.4

Compare
Choose a tag to compare
@jart jart released this 14 Dec 09:23
· 573 commits to main since this release
188f7fc

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.

  • 820d42d Synchronize with llama.cpp upstream

GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.

  • a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
  • 9d85a72 Improve GEMM performance by nearly 2x (#93)
  • 72e1c72 Support CUDA without cuBLAS (#82)
  • 2849b08 Make it possible for CUDA to extract prebuilt DSOs

Additional fixes and improvements:

  • c236a71 Improve markdown and syntax highlighting in server (#88)
  • 69ec1e4 Update the llamafile manual
  • 782c81c Add SD ops, kernels
  • 93178c9 Polyfill $HOME on some Windows systems
  • fcc727a Write log to /dev/null when main.log fails to open
  • 77cecbe Fix handling of characters that span multiple tokens when streaming

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here: