Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logging the number of generated samples for each leaf node #445

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bbrowning
Copy link
Contributor

We were logging the length of our generated_data list instead of the number of newly generated samples in each leaf node, causing confusion as this is typically 1, 2, 3, etc instead of 50, 500, 1000, etc that users would expect.

Fixes #227

We were logging the length of our `generated_data` list instead of the
number of newly generated samples in each leaf node, causing confusion
as this is typically 1, 2, 3, etc instead of 50, 500, 1000, etc that
users would expect.

Fixes instructlab#227

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor Author

CI runs are showing this is fixed with this PR - seeing log messages such like this which have the intended output:

2024-12-10T23:40:10.3629160Z INFO 2024-12-10 23:40:10,329 instructlab.sdg.generate_data:429: Generated 149 samples
2024-12-10T23:40:10.3629885Z DEBUG 2024-12-10 23:40:10,329 instructlab.sdg.generate_data:430: Generated data: Dataset({
2024-12-10T23:40:10.3630610Z     features: ['task_description', 'seed_question', 'seed_response', 'output'],
2024-12-10T23:40:10.3631105Z     num_rows: 149
2024-12-10T23:40:10.3631335Z })

@bbrowning bbrowning requested a review from a team December 11, 2024 01:18
Copy link
Member

@jaideepr97 jaideepr97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning should we also emit the leaf node that this number corresponds to? so users actually know what the leaf node is and how many sampels were generated for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ilab data generate does not specify the correct num of samples generated
2 participants