Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to use llama.cpp/master-aacdbd4 #8

Open
wants to merge 355 commits into
base: v2
Choose a base branch
from

Conversation

alexrozanski
Copy link
Owner

Updates the implementation to wrap master-aacdbd4.

Status: Still some work to go.

  • Need to ensure Metal path is compiled and used.
  • Need to handle old models gracefully (currently we just run to the end of the file for old models).
  • Need to decide how to handle new params added since last merge from llama.cpp

xloem and others added 30 commits May 1, 2023 15:58
* cuBLAS: refactor, convert fp16 to fp32 on device

* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16

* fix build

* cuBLAS: update block_q5_1
…1232)

* Add git-based build information for better issue tracking

* macOS fix

* "build (hash)" and "CMAKE_SOURCE_DIR" changes

* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages

* Fix conditional dependency on missing target

* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile

* 4 space indenting for cmake, attempt to clean up my mess in Makefile

* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
* ggml: add names to tensors

* minor improvements to dot file formatting
…ggerganov#1284)

* Fix ppc64le build issue

* Added support to detect ppc64* processors
* make git build info work with submodules

---------

Co-authored-by: Green Sky <[email protected]>
* llama : only copy used KV cache in get / set state

* switch to ggml for copying k, v

* avoid designated initializers
* fix dan.txt

* miku prompt improvements

* use common characters
…rganov#1203)

* python script to verify the checksum of the llama models

Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability.

* Update README.md

update to the readme for improved readability and to explain the usage of the python checksum verification script

* update the verification script

I've extended the script based on suggestions by @prusnak

The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks.

* minor improvment

small change so that the available ram is checked and not the total ram

* remove the part of the code that reads the file at once if enough ram is available

based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks.

* Update verify-checksum-models.py

quick fix to pass the git check
* fix reverse prompt and multi line

* Code Formatting

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* change immintrin.h to intrin.h for compatibility

Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm

* conditional def of intrin.h

* fix typo in ggml.c
* adding --in-suffix option

* print input suffix before generation
howard0su and others added 25 commits June 18, 2023 07:29
…v#1826)

* metal : handle buffers larger than device's maxBufferLength

* metal : print more verbose device info + handle errors

* metal : fix prints for overlapping views

* metal : minimize view overlap to try to utilize device memory better
…of 256 (ggerganov#1921)

* Fix examples/metal

* k-quants: prevent usage when tensor size is not divisible by 256

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Add steps for using termux on android devices to prevent common errors.
* Convert vector to f16 for dmmv

* compile option

* Added compilation option description to README

* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
* ggml : sync latest ggml repo

* ggml : remove unused comments

* ggml : asserts
* k_quants: hopefully much faster Q4_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!

* k_quants: hopefully much faster Q3_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!

* k_quants: faster Q2_K on older GPUs

It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.

* k_quants: faster Q5_K on older GPUs

68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.

It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
…f 256 (ggerganov#1932)

* Only use Q6_K for output weights if tensor size is multiple of 256

* Fixed copy/paste mistake

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
…essions (ggerganov#1934)

* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions

* - removed commented out old code from fix
- updated another instance of same issue below original
* Add back embedding feature

* Update README
* Workaround struct misalignment during value-copy

Signed-off-by: mudler <[email protected]>

* Move booleans at the bottom of the structure

Signed-off-by: mudler <[email protected]>

* Add comment

Signed-off-by: mudler <[email protected]>

---------

Signed-off-by: mudler <[email protected]>
@kohlivarun5
Copy link

Hi, checking if this is something that is planned for llama.swift or something that would only be supported in CameLLM ?

@aehlke
Copy link

aehlke commented Sep 24, 2023

Llama.cpp has advanced quite a lot for apple platforms, would be good to update to latest again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.