-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dataset formatting for pipeline differences #57
Conversation
When generating samples for a knowledge pipeline, we have to chunk the document down to a size that will fit within the model's context size. There was a hack in place that only used a single chunk. The code now iterates over all chunks of the document for creating samples to send through the pipeline. The commit also separates the code for knowledge and skills since the differences between the formats is growing. Closes instructlab#52 Signed-off-by: Russell Bryant <[email protected]>
PR instructlab#50 changed the format used in the full knowledge pipeline. Change the simple pipelines to match. Part of issue instructlab#55. Signed-off-by: Russell Bryant <[email protected]>
The full skills pipelines expect a single seed question and response in each sample in the dataset. Change the simple skills pipelines to match and update the code to generate the samples in the expected format. Closes instructlab#55 (the short term needs at least) Signed-off-by: Russell Bryant <[email protected]>
Looks good to me as a short term fix It does all seem arbitrarily divergent from the taxonomy format though (as you say in #52) - e.g. icl vs seed, query/response vs question/answer Since the taxonomy format is basically |
Totally agree. I was thinking of splitting that part out of #55 into a new issue to handle as another PR. |
I filed #59 to follow-up on naming and structure of the data |
samples = [{}] | ||
|
||
# document is the same for the whole leaf node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a bit of clarification here --
Are we expecting just one document at a time? Because in a leaf node, we could have multiple documents as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_document()
handles multiple documents. take a look here:
sdg/src/instructlab/sdg/utils/chunking.py
Lines 46 to 48 in 1f71fb6
for docs in documents: | |
temp = text_splitter.create_documents([docs]) | |
content.extend([item.page_content for item in temp]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aakankshaduggal i'm going too hold off merging until you confirm this makes sense to you -- code may not be super clear, but I do think it's handling multiple documents properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks for clarifying @russellb!
Makes sense. 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @russellb
Just had a question regarding the document input. Otherwise 🚢
This PR
dataset of seed examples.
Closes #52
Closes #55
8112123 Re-introduce document chunking for knowledge
15ae2b9 Change question/response to icl_query/icl_response
bba13d3 Create a sample per seed example for skills
commit 8112123
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:02:12 2024 -0400
commit 15ae2b9
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:08:39 2024 -0400
commit bba13d3ae7eb96846ef8aa830a787d49aa693b55
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:25:25 2024 -0400