Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset formatting for pipeline differences #57

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

russellb
Copy link
Member

This PR

  • re-introduces document chunk handling for knowledge docs
  • fixes a mismatch between pipeline expectations and the formatting of the
    dataset of seed examples.

Closes #52
Closes #55

8112123 Re-introduce document chunking for knowledge
15ae2b9 Change question/response to icl_query/icl_response
bba13d3 Create a sample per seed example for skills

commit 8112123
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:02:12 2024 -0400

Re-introduce document chunking for knowledge

When generating samples for a knowledge pipeline, we have to chunk the
document down to a size that will fit within the model's context size.
There was a hack in place that only used a single chunk. The code now
iterates over all chunks of the document for creating samples to send
through the pipeline.

The commit also separates the code for knowledge and skills since the
differences between the formats is growing.

Closes #52

Signed-off-by: Russell Bryant <[email protected]>

commit 15ae2b9
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:08:39 2024 -0400

Change question/response to icl_query/icl_response

PR #50 changed the format used in the full knowledge pipeline. Change
the simple pipelines to match.

Part of issue #55.

Signed-off-by: Russell Bryant <[email protected]>

commit bba13d3ae7eb96846ef8aa830a787d49aa693b55
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:25:25 2024 -0400

Create a sample per seed example for skills

The full skills pipelines expect a single seed question and response
in each sample in the dataset. Change the simple skills pipelines to
match and update the code to generate the samples in the expected
format.

Closes #55 (the short term needs at least)

Signed-off-by: Russell Bryant <[email protected]>

russellb added 3 commits June 30, 2024 14:27
When generating samples for a knowledge pipeline, we have to chunk the
document down to a size that will fit within the model's context size.
There was a hack in place that only used a single chunk. The code now
iterates over all chunks of the document for creating samples to send
through the pipeline.

The commit also separates the code for knowledge and skills since the
differences between the formats is growing.

Closes instructlab#52

Signed-off-by: Russell Bryant <[email protected]>
PR instructlab#50 changed the format used in the full knowledge pipeline. Change
the simple pipelines to match.

Part of issue instructlab#55.

Signed-off-by: Russell Bryant <[email protected]>
The full skills pipelines expect a single seed question and response
in each sample in the dataset. Change the simple skills pipelines to
match and update the code to generate the samples in the expected
format.

Closes instructlab#55 (the short term needs at least)

Signed-off-by: Russell Bryant <[email protected]>
@markmc
Copy link
Contributor

markmc commented Jul 1, 2024

Looks good to me as a short term fix

It does all seem arbitrarily divergent from the taxonomy format though (as you say in #52) - e.g. icl vs seed, query/response vs question/answer

Since the taxonomy format is basically seed_examples: [context, question, answer] I think settling on seed_{context,question,answer}(_N) for both knowledge and skills would be a nice cleanup, even in the short-term

@russellb
Copy link
Member Author

russellb commented Jul 1, 2024

Looks good to me as a short term fix

It does all seem arbitrarily divergent from the taxonomy format though (as you say in #52) - e.g. icl vs seed, query/response vs question/answer

Since the taxonomy format is basically seed_examples: [context, question, answer] I think settling on seed_{context,question,answer}(_N) for both knowledge and skills would be a nice cleanup, even in the short-term

Totally agree. I was thinking of splitting that part out of #55 into a new issue to handle as another PR.

@russellb
Copy link
Member Author

russellb commented Jul 1, 2024

I filed #59 to follow-up on naming and structure of the data

@russellb russellb mentioned this pull request Jul 1, 2024
samples = [{}]

# document is the same for the whole leaf node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a bit of clarification here --
Are we expecting just one document at a time? Because in a leaf node, we could have multiple documents as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_document() handles multiple documents. take a look here:

for docs in documents:
temp = text_splitter.create_documents([docs])
content.extend([item.page_content for item in temp])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakankshaduggal i'm going too hold off merging until you confirm this makes sense to you -- code may not be super clear, but I do think it's handling multiple documents properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks for clarifying @russellb!
Makes sense. 💯

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @russellb
Just had a question regarding the document input. Otherwise 🚢

@russellb russellb merged commit 45ecc73 into instructlab:main Jul 1, 2024
11 checks passed
@russellb russellb added this to the 0.1.0 milestone Jul 8, 2024
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants