-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Batching After Every Block #484
Adding Batching After Every Block #484
Conversation
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
d94c659
to
2b999bf
Compare
Signed-off-by: eshwarprasadS <[email protected]>
Let's make sure we pull the jinja template change into a separate PR so that we can track/merge that fix separately from this batching change. |
Got it, will revert that change, so it can be addressed separately. |
This reverts commit c7066c3. Signed-off-by: eshwarprasadS <[email protected]>
@eshwarprasadS If you need to reproduce the ruff errors locally that happened during the CI run, see https://github.com/instructlab/sdg/blob/d59731f2eb4c1dec66549d0ce4bf031ead8d7088/CONTRIBUTING.md#running-code-formatting-and-linting-checks-locally on how to do that. Basically, |
Signed-off-by: eshwarprasadS <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small question around how we check if batching is enabled, but otherwise the logic looks good to me and the new tests look great. Thanks for the thoroughness in fixing and testing this!
Signed-off-by: eshwarprasadS <[email protected]>
@eshwarprasadS Please don't include references to RHEL AI Jira tickets in PRs - changed the title of this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks for getting this fix in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The Problem:
Resolves #480, related to #424
Key Changes Implemented
Re-batch After Each Block:
After processing the dataset through a block, split it into batches based on the current batch_size (reuses
._split_dataset
within classPipeline
)Sequential Execution for Each Block:
Process batches sequentially within each block, and does not spawn any more threads of execution.
Recombine Batch Outputs:
Combine the processed batches back into a single dataset before sending it to the next block. (reuses
concatenate_datasets()
)