Fix dataset formatting for pipeline differences #57

russellb · 2024-06-30T18:31:33Z

This PR

re-introduces document chunk handling for knowledge docs
fixes a mismatch between pipeline expectations and the formatting of the
dataset of seed examples.

Closes #52
Closes #55

8112123 Re-introduce document chunking for knowledge
15ae2b9 Change question/response to icl_query/icl_response
bba13d3 Create a sample per seed example for skills

commit 8112123
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:02:12 2024 -0400

Re-introduce document chunking for knowledge

When generating samples for a knowledge pipeline, we have to chunk the
document down to a size that will fit within the model's context size.
There was a hack in place that only used a single chunk. The code now
iterates over all chunks of the document for creating samples to send
through the pipeline.

The commit also separates the code for knowledge and skills since the
differences between the formats is growing.

Closes #52

Signed-off-by: Russell Bryant <[email protected]>

commit 15ae2b9
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:08:39 2024 -0400

Change question/response to icl_query/icl_response

PR #50 changed the format used in the full knowledge pipeline. Change
the simple pipelines to match.

Part of issue #55.

Signed-off-by: Russell Bryant <[email protected]>

commit bba13d3ae7eb96846ef8aa830a787d49aa693b55
Author: Russell Bryant [email protected]
Date: Sun Jun 30 14:25:25 2024 -0400

Create a sample per seed example for skills

The full skills pipelines expect a single seed question and response
in each sample in the dataset. Change the simple skills pipelines to
match and update the code to generate the samples in the expected
format.

Closes #55 (the short term needs at least)

Signed-off-by: Russell Bryant <[email protected]>

When generating samples for a knowledge pipeline, we have to chunk the document down to a size that will fit within the model's context size. There was a hack in place that only used a single chunk. The code now iterates over all chunks of the document for creating samples to send through the pipeline. The commit also separates the code for knowledge and skills since the differences between the formats is growing. Closes instructlab#52 Signed-off-by: Russell Bryant <[email protected]>

PR instructlab#50 changed the format used in the full knowledge pipeline. Change the simple pipelines to match. Part of issue instructlab#55. Signed-off-by: Russell Bryant <[email protected]>

The full skills pipelines expect a single seed question and response in each sample in the dataset. Change the simple skills pipelines to match and update the code to generate the samples in the expected format. Closes instructlab#55 (the short term needs at least) Signed-off-by: Russell Bryant <[email protected]>

markmc · 2024-07-01T09:56:14Z

Looks good to me as a short term fix

It does all seem arbitrarily divergent from the taxonomy format though (as you say in #52) - e.g. icl vs seed, query/response vs question/answer

Since the taxonomy format is basically seed_examples: [context, question, answer] I think settling on seed_{context,question,answer}(_N) for both knowledge and skills would be a nice cleanup, even in the short-term

russellb · 2024-07-01T12:42:58Z

Looks good to me as a short term fix

It does all seem arbitrarily divergent from the taxonomy format though (as you say in #52) - e.g. icl vs seed, query/response vs question/answer

Since the taxonomy format is basically seed_examples: [context, question, answer] I think settling on seed_{context,question,answer}(_N) for both knowledge and skills would be a nice cleanup, even in the short-term

Totally agree. I was thinking of splitting that part out of #55 into a new issue to handle as another PR.

russellb · 2024-07-01T12:54:03Z

I filed #59 to follow-up on naming and structure of the data

aakankshaduggal · 2024-07-01T14:16:21Z

src/instructlab/sdg/utils/taxonomy.py

    samples = [{}]

+    # document is the same for the whole leaf node


Need a bit of clarification here --
Are we expecting just one document at a time? Because in a leaf node, we could have multiple documents as well.

chunk_document() handles multiple documents. take a look here:

sdg/src/instructlab/sdg/utils/chunking.py

Lines 46 to 48 in 1f71fb6

for docs in documents:

temp = text_splitter.create_documents([docs])

content.extend([item.page_content for item in temp])

@aakankshaduggal i'm going too hold off merging until you confirm this makes sense to you -- code may not be super clear, but I do think it's handling multiple documents properly

Okay thanks for clarifying @russellb!
Makes sense. 💯

aakankshaduggal

Looks good @russellb
Just had a question regarding the document input. Otherwise 🚢

russellb added 3 commits June 30, 2024 14:27

Change question/response to icl_query/icl_response

15ae2b9

PR instructlab#50 changed the format used in the full knowledge pipeline. Change the simple pipelines to match. Part of issue instructlab#55. Signed-off-by: Russell Bryant <[email protected]>

russellb force-pushed the chunking branch from bba13d3 to e606811 Compare June 30, 2024 19:36

russellb mentioned this pull request Jul 1, 2024

updates to grounded flow #53

Merged

aakankshaduggal reviewed Jul 1, 2024

View reviewed changes

aakankshaduggal approved these changes Jul 1, 2024

View reviewed changes

aakankshaduggal requested a review from oindrillac July 1, 2024 15:00

russellb merged commit 45ecc73 into instructlab:main Jul 1, 2024
11 checks passed

russellb added this to the 0.1.0 milestone Jul 8, 2024

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#57 from mrutkows/domain-candidates

12e6398

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset formatting for pipeline differences #57

Fix dataset formatting for pipeline differences #57

russellb commented Jun 30, 2024

markmc commented Jul 1, 2024 •

edited

Loading

russellb commented Jul 1, 2024 •

edited

Loading

russellb commented Jul 1, 2024

aakankshaduggal Jul 1, 2024

russellb Jul 1, 2024

russellb Jul 1, 2024

aakankshaduggal Jul 1, 2024

aakankshaduggal left a comment

		samples = [{}]

		# document is the same for the whole leaf node

	for docs in documents:
	temp = text_splitter.create_documents([docs])
	content.extend([item.page_content for item in temp])

Fix dataset formatting for pipeline differences #57

Fix dataset formatting for pipeline differences #57

Conversation

russellb commented Jun 30, 2024

markmc commented Jul 1, 2024 • edited Loading

russellb commented Jul 1, 2024 • edited Loading

russellb commented Jul 1, 2024

aakankshaduggal Jul 1, 2024

Choose a reason for hiding this comment

russellb Jul 1, 2024

Choose a reason for hiding this comment

russellb Jul 1, 2024

Choose a reason for hiding this comment

aakankshaduggal Jul 1, 2024

Choose a reason for hiding this comment

aakankshaduggal left a comment

Choose a reason for hiding this comment

markmc commented Jul 1, 2024 •

edited

Loading

russellb commented Jul 1, 2024 •

edited

Loading