-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v3 knowledge schema support #161
Conversation
More changes are needed. I just wanted to get a draft PR up that we can add stuff to and collaborate on. |
180365c
to
2c6b323
Compare
This now passes CI, but it's blocked on an instructlab-schema release that includes instructlab/schema#39 |
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
e2e workflow failed on this PR: View run, please investigate. |
This is expected to fail because it's not using my An equivalent job is running on the |
For more information on the v3 schema, see this issue: instructlab#160 This change to the prompt does a couple of important things: - Make use of document-specific context for the provided sample q&a. - Add the new `document_outline` field which provides a summary of the document. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
The v3 knowledge schema includes context specific to the sample questionsa nd answers. Include that context in the prompt. The schema also includes a new `document_outline`. Give it prior to the document chunk in the same way as the `full` pipeline prompt, `generate_questions_responses.yaml`. Signed-off-by: Russell Bryant <[email protected]>
This is part of instructlab#160 The changes here originated from aakankshaduggal@5baf6df There are two major changes here. - When parsing a `qna.yaml` file from a taxonomy tree, adjust for the new schema for knowledge. There is no attempt to maintain compatibility with prior versions of the schema (v1, v2). - Change how we translate the taxonomy data into the dataset sent into the pipeline as input. Instead of implementing a sliding window approach of 3 sample qna pairs at a time over all chunks of the document, we now create a row per seed_example (context and associated qna pairs) for each chunk of knowledge docs. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
instructlab-schema 0.3.1 includes the v3 schema for knowledge that is now required. Fix a typo in a comment in this file while we're at it. Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
This PR includes the changes to account for schema v3. This should be removed later. Signed-off-by: Russell Bryant <[email protected]>
_gen_test_data() needed fixes to account for the different format coming from a knowledge doc vs skills. Signed-off-by: Russell Bryant <[email protected]>
We no longer support versions 1 and 2, so detect and give a clear error message when this occurs. Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>
This is just unnecessarily confusing. Signed-off-by: Mark McLoughlin <[email protected]>
trying this out I got the same error CI is now getting
Once thats resolved it lgtm, (although I do wonder about the additional overhead required of users (now need to manually extract 5 sections of the knowledge, put them into the yaml and provide 3 same questions on each), this may be a lost of overhead for small documents. But I guess its less of a chore of large documents. |
ah, need to change this in instructlab/instructlab#1790:
|
This PR includes the changes necessary to move to the v3 schema version for
knowledge taxonomy files.
See epic #160 for more information.
b4e8bf1 full: Adjust knowledge prompt for v3 schema
33abe1e simple: Adapt simple knowledge pipeline for v3 schema
94a7a5e utils: Update taxonomy reading code to handle knowledge v3
aaf0283 requirements: use schema version that includes v3
7d56e88 utils: drop an unused import
b228f3f e2e: Temporarily install instructlab from a PR
c83220a generate_data: Account for knowledge format when generating test data
5eac452 Report error if knowledge taxonomy version is < 3
commit b4e8bf1
Author: Russell Bryant [email protected]
Date: Wed Jul 17 17:48:53 2024 -0400
commit 33abe1e
Author: Russell Bryant [email protected]
Date: Wed Jul 17 19:19:07 2024 -0400
commit 94a7a5e
Author: Russell Bryant [email protected]
Date: Thu Jul 18 15:30:16 2024 -0400
commit aaf0283
Author: Russell Bryant [email protected]
Date: Thu Jul 18 15:37:43 2024 -0400
commit 7d56e88
Author: Russell Bryant [email protected]
Date: Thu Jul 18 16:15:25 2024 -0400
commit b228f3f
Author: Russell Bryant [email protected]
Date: Thu Jul 18 16:28:11 2024 -0400
commit c83220a
Author: Russell Bryant [email protected]
Date: Fri Jul 19 14:10:54 2024 -0400
commit 5eac452005ba455e771a82a02729bdf163cc0311
Author: Russell Bryant [email protected]
Date: Mon Jul 22 17:38:09 2024 -0400