Update to use `llama.cpp/master-aacdbd4` #8

alexrozanski · 2023-06-20T23:48:50Z

Updates the implementation to wrap master-aacdbd4.

Status: Still some work to go.

Need to ensure Metal path is compiled and used.
Need to handle old models gracefully (currently we just run to the end of the file for old models).
Need to decide how to handle new params added since last merge from llama.cpp

Co-authored-by: John Doe <[email protected]>

* cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1

…1232) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it

* ggml: add names to tensors * minor improvements to dot file formatting

…v#979) Signed-off-by: deadprogram <[email protected]>

…ggerganov#1284) * Fix ppc64le build issue * Added support to detect ppc64* processors

) Signed-off-by: deadprogram <[email protected]>

…v#1290) Signed-off-by: deadprogram <[email protected]>

* make git build info work with submodules --------- Co-authored-by: Green Sky <[email protected]>

* llama : only copy used KV cache in get / set state * switch to ggml for copying k, v * avoid designated initializers

* fix dan.txt * miku prompt improvements * use common characters

@prusnak

…rganov#1203) * python script to verify the checksum of the llama models Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability. * Update README.md update to the readme for improved readability and to explain the usage of the python checksum verification script * update the verification script I've extended the script based on suggestions by @prusnak The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks. * minor improvment small change so that the available ram is checked and not the total ram * remove the part of the code that reads the file at once if enough ram is available based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks. * Update verify-checksum-models.py quick fix to pass the git check

ggerganov/ggml#127 (comment)

* fix reverse prompt and multi line * Code Formatting Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c

* adding --in-suffix option * print input suffix before generation

Co-authored-by: Pavol Rusnak <[email protected]>

)

…v#1826) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better

Co-authored-by: Iwan Kawrakow <[email protected]>

…of 256 (ggerganov#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Add steps for using termux on android devices to prevent common errors.

* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"

Co-authored-by: Georgi Gerganov <[email protected]>

* ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts

* k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

…f 256 (ggerganov#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <[email protected]>

…essions (ggerganov#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original

* Add back embedding feature * Update README

* Workaround struct misalignment during value-copy Signed-off-by: mudler <[email protected]> * Move booleans at the bottom of the structure Signed-off-by: mudler <[email protected]> * Add comment Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]>

kohlivarun5 · 2023-09-17T00:13:21Z

Hi, checking if this is something that is planned for llama.swift or something that would only be supported in CameLLM ?

aehlke · 2023-09-24T12:56:16Z

Llama.cpp has advanced quite a lot for apple platforms, would be good to update to latest again

xloem and others added 30 commits May 1, 2023 15:58

llama : update stubs for systems without mmap and mlock (ggerganov#1266)

ea3a0ad

Co-authored-by: John Doe <[email protected]>

cuBLAS: refactor and optimize f16 mat mul performance (ggerganov#1259)

58b367c

* cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1

ggml: add names to tensors (ggerganov#1268)

2d099e5

* ggml: add names to tensors * minor improvements to dot file formatting

main : switch input_noecho to input_echo to remove negation (ggergano…

e2cd506

…v#979) Signed-off-by: deadprogram <[email protected]>

llama : allow 0 as a seed number. (ggerganov#1275)

2bb992f

ggml : fix ppc64le build error and make cmake detect Power processors (…

cc0bb72

…ggerganov#1284) * Fix ppc64le build issue * Added support to detect ppc64* processors

examples : improve vertical alignment of a few variables (ggerganov#1286

8c9be35

) Signed-off-by: deadprogram <[email protected]>

ggml : fix 32-bit ARM

5d5817c

llama : fix compile warnings

0e6cbff

examples : add llama_init_from_gpt_params() common function (ggergano…

67c7779

…v#1290) Signed-off-by: deadprogram <[email protected]>

fix missing parameters in llama_init_from_gpt_params (ggerganov#1293)

bf4b22f

fix build-info.h for git submodules (ggerganov#1289)

9daff41

* make git build info work with submodules --------- Co-authored-by: Green Sky <[email protected]>

Call sh on build-info.sh (ggerganov#1294)

55bc5f0

Handle signals properly on Windows (ggerganov#1123)

13b0c68

Process escape sequences given in prompts (ggerganov#1173)

2485d7a

llama : only copy used KV cache in get / set state (ggerganov#1272)

e216aa0

* llama : only copy used KV cache in get / set state * switch to ggml for copying k, v * avoid designated initializers

examples : various prompt and example fixes (ggerganov#1298)

a8a2efd

* fix dan.txt * miku prompt improvements * use common characters

minor : fix trailing whitespaces

e2a937c

minor : fix whitespaces (ggerganov#1302)

bca9ad9

examples : read chat prompts from a template file (ggerganov#1196)

6daa09d

ggml : vectorize Q8_0 quantization

799fdc1

ggerganov/ggml#127 (comment)

fix ggerganov#1224 reverse prompt and multi line (ggerganov#1297)

f647ce0

* fix reverse prompt and multi line * Code Formatting Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

Update main's README.md with new features (ggerganov#1296)

c65a7fb

Only escape prompts when used with -e (ggerganov#1311)

db10808

ggml : change immintrin.h to intrin.h for compatibility (ggerganov#1307)

20fbf2a

* change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c

main : add --in-suffix option (ggerganov#1318)

2edbdb0

* adding --in-suffix option * print input suffix before generation

readme : add OpenBuddy link (ggerganov#1321)

360cfe5

convert: support DT_BF16 tensors (ggerganov#1309)

d3e8093

Co-authored-by: Pavol Rusnak <[email protected]>

howard0su and others added 25 commits June 18, 2023 07:29

cmake : add CUDA_ARCHITECTURES to new target ggml_static (ggerganov#1917

57cd694

)

examples : fix examples/metal (ggerganov#1920)

90cc59d

Co-authored-by: Iwan Kawrakow <[email protected]>

llama : prevent usage of k-quants when tensor size is not a multiple …

8ab8ba6

…of 256 (ggerganov#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <[email protected]>

readme : update Android build instructions (ggerganov#1922)

e1886cf

Add steps for using termux on android devices to prevent common errors.

ggml : fix bug in ggml_compute_forward_add_q_f32 (ggerganov#1918)

8596af4

Fixed incorrectly applying RMS norm twice (ggerganov#1925)

0ede372

Added tokens per second to info prints (ggerganov#1928)

b24c304

Convert vector to f16 for dequantize mul mat vec (ggerganov#1913)

16b9cd1

* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"

cmake : fix build shared ggml when CUDA is enabled (ggerganov#1929)

1e3abfc

Co-authored-by: Georgi Gerganov <[email protected]>

ggml : sync latest ggml repo (ggerganov#1924)

b97ca43

* ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts

llama : only use Q6_K for output weights if tensor size is multiple o…

cb40dfc

…f 256 (ggerganov#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <[email protected]>

cmake : fix trailing whitespaces

23fc5c2

ggml : fix bug in LBFGS optimizer (found by ggml tests)

18b3562

[Fix] Reenable server embedding endpoint (ggerganov#1937)

20568fe

* Add back embedding feature * Update README

Merge tag 'master-aacdbd4' into v2

de1cd08

remove scripts/ dir

d12a031

move metal sources to llamaObjCxx sources dir

d153e95

remove ggml-opencl

0c7a9f1

fix build after merge from master-aacdbd4

3021d8d

add missing llama_tokenize()

75090a9

fix llama_tokenize() implementation

9f55b61

alexrozanski self-assigned this Jun 20, 2023

This was referenced Jun 20, 2023

Metal support from upstream llama.cpp #7

Open

Add Metal/GPU support for running model inference alexrozanski/LlamaChat#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to use `llama.cpp/master-aacdbd4` #8

Update to use `llama.cpp/master-aacdbd4` #8

alexrozanski commented Jun 20, 2023

kohlivarun5 commented Sep 17, 2023

aehlke commented Sep 24, 2023

Update to use llama.cpp/master-aacdbd4 #8

Are you sure you want to change the base?

Update to use llama.cpp/master-aacdbd4 #8

Conversation

alexrozanski commented Jun 20, 2023

kohlivarun5 commented Sep 17, 2023

aehlke commented Sep 24, 2023

Update to use `llama.cpp/master-aacdbd4` #8

Update to use `llama.cpp/master-aacdbd4` #8