forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to use llama.cpp/master-aacdbd4
#8
Open
alexrozanski
wants to merge
355
commits into
v2
Choose a base branch
from
update-llama-cpp-aacdbd4
base: v2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: John Doe <[email protected]>
* cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1
…1232) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
* ggml: add names to tensors * minor improvements to dot file formatting
…v#979) Signed-off-by: deadprogram <[email protected]>
…ggerganov#1284) * Fix ppc64le build issue * Added support to detect ppc64* processors
) Signed-off-by: deadprogram <[email protected]>
…v#1290) Signed-off-by: deadprogram <[email protected]>
* make git build info work with submodules --------- Co-authored-by: Green Sky <[email protected]>
* llama : only copy used KV cache in get / set state * switch to ggml for copying k, v * avoid designated initializers
* fix dan.txt * miku prompt improvements * use common characters
…rganov#1203) * python script to verify the checksum of the llama models Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability. * Update README.md update to the readme for improved readability and to explain the usage of the python checksum verification script * update the verification script I've extended the script based on suggestions by @prusnak The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks. * minor improvment small change so that the available ram is checked and not the total ram * remove the part of the code that reads the file at once if enough ram is available based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks. * Update verify-checksum-models.py quick fix to pass the git check
* fix reverse prompt and multi line * Code Formatting Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c
* adding --in-suffix option * print input suffix before generation
Co-authored-by: Pavol Rusnak <[email protected]>
…v#1826) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better
Co-authored-by: Iwan Kawrakow <[email protected]>
…of 256 (ggerganov#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <[email protected]>
Add steps for using termux on android devices to prevent common errors.
* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
Co-authored-by: Georgi Gerganov <[email protected]>
* ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts
* k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <[email protected]>
…f 256 (ggerganov#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <[email protected]>
…essions (ggerganov#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original
* Add back embedding feature * Update README
* Workaround struct misalignment during value-copy Signed-off-by: mudler <[email protected]> * Move booleans at the bottom of the structure Signed-off-by: mudler <[email protected]> * Add comment Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]>
This was referenced Jun 20, 2023
Hi, checking if this is something that is planned for |
Llama.cpp has advanced quite a lot for apple platforms, would be good to update to latest again |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates the implementation to wrap
master-aacdbd4
.Status: Still some work to go.