Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Hf data preprocessing: Truncate dataset if
max_samples
is provided (#…
…1561) ## Describe your changes - For huggingface data preprocessing (except text-gen which has it's own logic), truncate the data before tokenization if `max_samples` is provided. - There is no need to tokenizer and process the whole dataset if only a subset is going to be used. This is useful for large datasets where the tokenized data might be too large to fit in memory. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
- Loading branch information