Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2e knowledge Pipeline Run #21

Merged
merged 3 commits into from
Jul 24, 2024
Merged

E2e knowledge Pipeline Run #21

merged 3 commits into from
Jul 24, 2024

Conversation

abhi1092
Copy link
Collaborator

@abhi1092 abhi1092 commented Jul 24, 2024

  • Fixed edge case handling in RAFT dataset creating
  • Fixed auxiliary dataset bug fix
  • Fixed SDG init to use save_freq, num_worker and batch size. Value for this has been hardcoded

@aakankshaduggal aakankshaduggal merged commit f80675e into main Jul 24, 2024
7 of 9 checks passed
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this pull request Jul 25, 2024
This incorporates fixes from
aakankshaduggal#21 to the way we select
the expanded contexts during the knowledge data generation, cleaning
up the logic around how we select which other documents to include in
the expanded context during edge cases where we have low numbers of
unique documents

Co-authored-by: abhi1092 <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this pull request Jul 26, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
markmc pushed a commit to bbrowning/instructlab-sdg that referenced this pull request Jul 29, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this pull request Jul 29, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

The auxiliary instructions are a new part of the pipeline config, as
they are tightly coupled to the pipeline config. An example, where
you'll note the `spellcheck` value from the pipeline config has to match
across both the pipeline config and the new auxiliary instructions, so
we just list both in the same config file.

version: "1.0"
blocks:
...
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type
...
datamixing:
  auxiliary_instructions:
    spellcheck:
      - Correct any spelling errors in the document and output the corrected version.
      - Rewrite the document to remove any spelling errors.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants