E2e knowledge Pipeline Run #21

abhi1092 · 2024-07-24T17:51:44Z

Fixed edge case handling in RAFT dataset creating
Fixed auxiliary dataset bug fix
Fixed SDG init to use save_freq, num_worker and batch size. Value for this has been hardcoded

…, adding print statments for data validations Signed-off-by: abhi1092 <[email protected]>

Signed-off-by: abhi1092 <[email protected]>

This incorporates fixes from aakankshaduggal#21 to the way we select the expanded contexts during the knowledge data generation, cleaning up the logic around how we select which other documents to include in the expanded context during edge cases where we have low numbers of unique documents Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. The auxiliary instructions are a new part of the pipeline config, as they are tightly coupled to the pipeline config. An example, where you'll note the `spellcheck` value from the pipeline config has to match across both the pipeline config and the new auxiliary instructions, so we just list both in the same config file. version: "1.0" blocks: ... - name: flatten_auxiliary_columns type: FlattenColumnsBlock config: var_cols: - spellcheck - base_document value_name: corrected_document var_name: dataset_type ... datamixing: auxiliary_instructions: spellcheck: - Correct any spelling errors in the document and output the corrected version. - Rewrite the document to remove any spelling errors. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

abhi1092 added 3 commits July 24, 2024 17:10

minor fix, fixing edge cases in raft function, removing infinity loop…

391ba60

…, adding print statments for data validations Signed-off-by: abhi1092 <[email protected]>

logic bug fix in the raft dataset

389d857

Signed-off-by: abhi1092 <[email protected]>

adding new parameter to sdg init, hardcoded the values for them

a06fc68

Signed-off-by: abhi1092 <[email protected]>

abhi1092 requested review from aakankshaduggal and shivchander July 24, 2024 17:53

shivchander mentioned this pull request Jul 24, 2024

data mixing - Fix duplicate context issue by taking set of all context, using sampling without replacement, and comparing text directly instead of row_idx instructlab/sdg#200

Closed

aakankshaduggal approved these changes Jul 24, 2024

View reviewed changes

aakankshaduggal merged commit f80675e into main Jul 24, 2024
7 of 9 checks passed

bbrowning mentioned this pull request Jul 25, 2024

Add support for auxiliary dataset generation instructlab/sdg#204

Merged

bbrowning mentioned this pull request Jul 25, 2024

Incorporate knowledge generation context selection improvements instructlab/sdg#215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2e knowledge Pipeline Run #21

E2e knowledge Pipeline Run #21

abhi1092 commented Jul 24, 2024 •

edited

Loading

E2e knowledge Pipeline Run #21

E2e knowledge Pipeline Run #21

Conversation

abhi1092 commented Jul 24, 2024 • edited Loading

abhi1092 commented Jul 24, 2024 •

edited

Loading