-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama/ggml: add LLM training support #10544
base: master
Are you sure you want to change the base?
Conversation
I pushed a version that I think is in a state where it could be merged.
My immediate next goals will be:
On a somewhat related note, it may make sense to refactor the file |
@JohannesGaessler you may see #10902 |
The link doesn't work. |
@JohannesGaessler sorry, #10902 |
I've started working on this again, I rebased my local branch onto master and am currently adding the missing ops for CUDA training. This PR is getting quite large; in terms of reviewing, would you prefer if I split off things that can be reviewed and merged on their own? |
If you can separate things in standalone PRs, it's always helpful (maybe the CUDA ops can be in a standalone PR). |
866da9b
to
a315cac
Compare
I pushed an update where the finetuning of Stories 260k and more relevantly LLaMA 3.2 1b works either on CPU or with CUDA and 24 GB VRAM. For LLaMA 3.2 1b one epoch over the Wikitext-2 test set takes ~3 minutes on an RTX 4090, ~15 hours on an Epyc 7742. The finetuned model should then have a lower perplexity score when given the text it was finetuned on again. For Stories 260k the speed is mostly the same due to its diminutive size. I will soon have more time for llama.cpp, I will try to get this PR into a state where it can be merged. My goal is simply to have finetuning technically functional for CPU and CUDA with a single GPU and max. GPU layers. I will work on partial GPU layers and multi GPU in later PRs. My immediate next goal after having a technically functional finetuning setup will be to implement methods for actually evaluating the quality of a finetuned model using language model benchmarks such as MMLU. |
I'm kinda confused was training removed and now being added back? I just want to train Qwen 7B on dataset of Japanese sentence grammar explanations This seems outdated |
There was at some point limited training support that was single threaded and only worked with the CPU backend. This was at some point removed because it was broken and unmaintained. I am currently working on adding back training support in a way that is compatible with all backends. |
Neato I'll give it a go then and see if anything explodes, does it support all models llama cpp already supports? |
No, the support is currently extremely limited and I think you will just waste your time trying to use the current state of the code for anything other than testing. |
more compact progress bar refactor: llama_prepare_sbatch/ubatch llama_save_model_to_file gqa_mode arg for repeat_back llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt
a315cac
to
c255573
Compare
From my end I would now consider this PR ready to be merged. Things are still relatively janky but I don't think that will change in a reasonable time frame. My next goals will be better support for model quality evaluation and then better performance for training. I can already work on these things regardless of what happens with this PR so it's fine if you just proceed in way that's convenient for you. Question regarding the header files: right now I put the |
IMO it's fine as it is. We can split the header in the future if it becomes too heavy, but for now I think it is still quite manageable. |
See ggerganov/ggml#1025 except I decided to implement the training directly in llama.cpp after all because the GPT-2 GGML example is already pretty complex, would require a significant amount of effort to refactor, and I'm not familiar with the codebase at all.
The goal of this PR is to add general training support to llama.cpp using
ggml_opt
. CPU training seems to work, other backends are missing support for some GGML ops. It's currently not possible to actually save the finetuned model to disk but you can confirm that the finetuning works by doing one epoch over the input text prior to perplexity calculation (or by observing how the loss goes down with the new finetune example). One epoch over the test set of Wikitext-2 (with the stride chosen in such a way that each token is used twice per epoch) currently takes ~1 minute with Stories 260k or ~20 hours and ~100 GB RAM with LLaMA 3 8b. For the user-facing API my concrete plans are:n_ctx
determines the max. sequence length with which the model is trained.n_batch
determines how many tokens are consumed per optimizer step.n_ubatch
determines the number of tokens in parallel, enables speed <-> memory use tradeoff, should have no effect on the result except for differences in floating point rounding error.std::vector<llama_token>
. Currently I have this as part ofllama.h
but maybe this would make more sense to put incommon.h
?llama_opt_init
that prepares allama_context
for training and lets the user define things like the learning rate or which tensors should be trainable parameters.llama_opt_epoch
that performs one epoch over aggml_opt_dataset
, equivalent toggml_opt_epoch
.llama_opt_fit
equivalent toggml_opt_fit
that is even more high-level?Currently, while functional, the PR is in a bad state in terms of software design and is in need of a refactor. The reason I'm already opening it now is because I want to ask for advice regarding how to best implement
llama_opt_epoch
. My current approach was to try and hijack the first half ofllama_decode_internal
but I found that in the end all I needed from it was the generation of the nextllama_ubatch
and the corresponding manipulation of the KV cache. But maybe it would make more sense to instead write a function likellama_prepare_next_ubatch
and to use that function inllama_decode_internal
andllama_opt_epoch
?