update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Makes a minor fix to the parsing of the
--custom_dataset.file
flag. The documentation says you can add a colon in this value to specify a custom name to replace theget_custom_dataset
function.Unfortunately, the string after the colon is ALSO used to set a custom data collator name. This update forces the data collator name to always be "get_data_collator", and updates the documentionation to reflect that.
Fixes #828
Feature/Issue validation/testing
Use the "custom_dataset" provided from the recipes. Copy it to a local directory as
cp ../llama-recipes/recipes/quickstart/finetuning/datasets/custom_dataset.py .
get_custom_dataset
. Notice that the functionget_custom_dataset
is called by the codepath that's actually trying to call the data collator.get_data_collator
function, instead ofget_custom_dataset
, does not find it, and uses the default. Then it successfully fine tunes the model using the custom dataset.Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!