Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add v3 knowledge schema support #161

Merged
merged 9 commits into from
Jul 23, 2024
Merged

Conversation

russellb
Copy link
Member

@russellb russellb commented Jul 17, 2024

This PR includes the changes necessary to move to the v3 schema version for
knowledge taxonomy files.

See epic #160 for more information.

b4e8bf1 full: Adjust knowledge prompt for v3 schema
33abe1e simple: Adapt simple knowledge pipeline for v3 schema
94a7a5e utils: Update taxonomy reading code to handle knowledge v3
aaf0283 requirements: use schema version that includes v3
7d56e88 utils: drop an unused import
b228f3f e2e: Temporarily install instructlab from a PR
c83220a generate_data: Account for knowledge format when generating test data
5eac452 Report error if knowledge taxonomy version is < 3

commit b4e8bf1
Author: Russell Bryant [email protected]
Date: Wed Jul 17 17:48:53 2024 -0400

full: Adjust knowledge prompt for v3 schema

For more information on the v3 schema, see this issue:

  https://github.com/instructlab/sdg/issues/160

This change to the prompt does a couple of important things:

- Make use of document-specific context for the provided sample q&a.

- Add the new `document_outline` field which provides a summary of the
  document.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>

commit 33abe1e
Author: Russell Bryant [email protected]
Date: Wed Jul 17 19:19:07 2024 -0400

simple: Adapt simple knowledge pipeline for v3 schema

The v3 knowledge schema includes context specific to the sample
questionsa nd answers. Include that context in the prompt.

The schema also includes a new `document_outline`. Give it prior to
the document chunk in the same way as the `full` pipeline prompt,
`generate_questions_responses.yaml`.

Signed-off-by: Russell Bryant <[email protected]>

commit 94a7a5e
Author: Russell Bryant [email protected]
Date: Thu Jul 18 15:30:16 2024 -0400

utils: Update taxonomy reading code to handle knowledge v3

This is part of https://github.com/instructlab/sdg/issues/160

The changes here originated from https://github.com/aakankshaduggal/sdg/commit/5baf6dfde8334fa52a4ffe38e9dc9121dfb468aa

There are two major changes here.

- When parsing a `qna.yaml` file from a taxonomy tree, adjust for the
  new schema for knowledge. There is no attempt to maintain
  compatibility with prior versions of the schema (v1, v2).

- Change how we translate the taxonomy data into the dataset sent into
  the pipeline as input. Instead of implementing a sliding window
  approach of 3 sample qna pairs at a time over all chunks of the
  document, we now create a row per seed_example (context and
  associated qna pairs) for each chunk of knowledge docs.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>

commit aaf0283
Author: Russell Bryant [email protected]
Date: Thu Jul 18 15:37:43 2024 -0400

requirements: use schema version that includes v3

instructlab-schema 0.3.1 includes the v3 schema for knowledge that is
now required.

Fix a typo in a comment in this file while we're at it.

Signed-off-by: Russell Bryant <[email protected]>

commit 7d56e88
Author: Russell Bryant [email protected]
Date: Thu Jul 18 16:15:25 2024 -0400

utils: drop an unused import

Signed-off-by: Russell Bryant <[email protected]>

commit b228f3f
Author: Russell Bryant [email protected]
Date: Thu Jul 18 16:28:11 2024 -0400

e2e: Temporarily install instructlab from a PR

This PR includes the changes to account for schema v3.
This should be removed later.

Signed-off-by: Russell Bryant <[email protected]>

commit c83220a
Author: Russell Bryant [email protected]
Date: Fri Jul 19 14:10:54 2024 -0400

generate_data: Account for knowledge format when generating test data

_gen_test_data() needed fixes to account for the different format
coming from a knowledge doc vs skills.

Signed-off-by: Russell Bryant <[email protected]>

commit 5eac452005ba455e771a82a02729bdf163cc0311
Author: Russell Bryant [email protected]
Date: Mon Jul 22 17:38:09 2024 -0400

Report error if knowledge taxonomy version is < 3

We no longer support versions 1 and 2, so detect and give a clear
error message when this occurs.

Signed-off-by: Russell Bryant <[email protected]>

@russellb
Copy link
Member Author

More changes are needed. I just wanted to get a draft PR up that we can add stuff to and collaborate on.

@russellb
Copy link
Member Author

This now passes CI, but it's blocked on an instructlab-schema release that includes instructlab/schema#39

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow failed on this PR: View run, please investigate.

@russellb
Copy link
Member Author

e2e workflow failed on this PR: View run, please investigate.

This is expected to fail because it's not using my instructlab/instructlab branch.

An equivalent job is running on the instructlab repo with this branch and the instructlab repo changes here: https://github.com/instructlab/instructlab/actions/runs/10042419982/job/27752975101

russellb and others added 7 commits July 22, 2024 17:50
For more information on the v3 schema, see this issue:

  instructlab#160

This change to the prompt does a couple of important things:

- Make use of document-specific context for the provided sample q&a.

- Add the new `document_outline` field which provides a summary of the
  document.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
The v3 knowledge schema includes context specific to the sample
questionsa nd answers. Include that context in the prompt.

The schema also includes a new `document_outline`. Give it prior to
the document chunk in the same way as the `full` pipeline prompt,
`generate_questions_responses.yaml`.

Signed-off-by: Russell Bryant <[email protected]>
This is part of instructlab#160

The changes here originated from aakankshaduggal@5baf6df

There are two major changes here.

- When parsing a `qna.yaml` file from a taxonomy tree, adjust for the
  new schema for knowledge. There is no attempt to maintain
  compatibility with prior versions of the schema (v1, v2).

- Change how we translate the taxonomy data into the dataset sent into
  the pipeline as input. Instead of implementing a sliding window
  approach of 3 sample qna pairs at a time over all chunks of the
  document, we now create a row per seed_example (context and
  associated qna pairs) for each chunk of knowledge docs.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
instructlab-schema 0.3.1 includes the v3 schema for knowledge that is
now required.

Fix a typo in a comment in this file while we're at it.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
This PR includes the changes to account for schema v3.
This should be removed later.

Signed-off-by: Russell Bryant <[email protected]>
_gen_test_data() needed fixes to account for the different format
coming from a knowledge doc vs skills.

Signed-off-by: Russell Bryant <[email protected]>
@mergify mergify bot added the ci-failure label Jul 22, 2024
@russellb russellb requested a review from markmc July 22, 2024 22:02
@russellb russellb marked this pull request as ready for review July 22, 2024 22:02
@mergify mergify bot added ci-failure and removed ci-failure labels Jul 22, 2024
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/taxonomy.py Show resolved Hide resolved
src/instructlab/sdg/utils/taxonomy.py Show resolved Hide resolved
src/instructlab/sdg/utils/taxonomy.py Outdated Show resolved Hide resolved
russellb and others added 2 commits July 23, 2024 10:25
We no longer support versions 1 and 2, so detect and give a clear
error message when this occurs.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
This is just unnecessarily confusing.

Signed-off-by: Mark McLoughlin <[email protected]>
@derekhiggins
Copy link
Contributor

trying this out I got the same error CI is now getting

The conflict is caused by:
instructlab 0.18.0a5.dev17 depends on instructlab-schema 0.1.dev60 (from git+https://github.com/russellb/instructlab-schema.git@v3)
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
instructlab-sdg 0.1.dev265 depends on instructlab-schema>=0.3.1

Once thats resolved it lgtm, (although I do wonder about the additional overhead required of users (now need to manually extract 5 sections of the knowledge, put them into the yaml and provide 3 same questions on each), this may be a lost of overhead for small documents. But I guess its less of a chore of large documents.

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

ah, need to change this in instructlab/instructlab#1790:

# TODO: change this once instructlab-schema 0.3.0 is released
#instructlab-schema>=0.3.0
instructlab-schema @ git+https://github.com/russellb/instructlab-schema.git@v3

@mergify mergify bot removed the ci-failure label Jul 23, 2024
@markmc markmc merged commit a1a7599 into instructlab:main Jul 23, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants