Add a YAML based file format for pipelines #86

markmc · 2024-07-05T15:34:21Z

In order to support custom pipelines, add a YAML based file format.

However, to make the default pipelines easier to reason about and develop,
also convert them to the YAML file format.

This changes the top-level API from:

mmlu_block_configs = MMLUBenchFlow().get_flow()
knowledge_block_configs = SynthKnowledgeFlow().get_flow()
knowledge_pipe = Pipeline(ctx, mmlu_flow + knowledge_flow)

to:

knowledge_pipe = Pipeline.from_flows(
    ctx, [pipeline.MMLU_BENCH_FLOW, pipeline.SYNTH_KNOWLEDGE_FLOW]
)

mergify · 2024-07-06T23:20:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. @markmc please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · 2024-07-07T02:41:19Z

I looked at the rebase. It's mostly because of #82, but a little bit from #78.

I pushed a rebase here - https://github.com/russellb/sdg/tree/pr-86-rebase - not pushing to the PR branch since I haven't reviewed the code yet to know that I got it right

russellb

This is looking really nice!

I have one question about the PipelineContext. This moves the teacher model_id to the pipeline context, though one feature we want to support is having a different model id on a per LLMBlock basis. This indeed is not needed by any of our current flows in the tree, but we have a need for it in a downstream pipeline. We want to support having vllm running a model, as well as a model + an adapter. Those two cases will be served up under different model ids.

The simplest option seems to be to push that field back to be per-LLMBlock. Let me know if I've missed another way to accomplish this.

The other big thing is that I'm eager to run this in a CI job that can test the full pipeline. I've been working on that a lot in the last week. I think I'm close.

markmc · 2024-07-08T10:38:57Z

I looked at the rebase. It's mostly because of #82, but a little bit from #78.

I pushed a rebase here - https://github.com/russellb/sdg/tree/pr-86-rebase - not pushing to the PR branch since I haven't reviewed the code yet to know that I got it right

Thanks, I took that and rebased again.

commit 5c46881 is significant - relates to #82

oindrillac

I like the general direction but some of the designs being changed, such as the ones here (yaml vs py) were recommended design proposals we previously agreed upon or adapted our designs to from a previous format here #98.

At this point, I would prioritize core functionality changes over major changes to the library design, especially since this blocks and delays the required features. I suggest going through a dev-docs proposal for such changes.

markmc · 2024-07-08T15:20:52Z

I like the general direction but some of the designs being changed, such as the ones here (yaml vs py) were recommended design proposals we previously agreed upon or adapted our designs to from a previous format here #98.

At this point, I would prioritize core functionality changes over major changes to the library design, especially since this blocks and delays the required features. I suggest going through a dev-docs proposal for such changes.

Yes, see instructlab/dev-docs#109 for that design discussion

My understanding is that custom flows is a "required feature" - see the "Alternative Approaches" section in the dev-doc for some of the thinking on that

See also instructlab/instructlab#1546

markmc · 2024-07-08T15:56:15Z

I have one question about the PipelineContext. This moves the teacher model_id to the pipeline context, though one feature we want to support is having a different model id on a per LLMBlock basis. This indeed is not needed by any of our current flows in the tree, but we have a need for it in a downstream pipeline. We want to support having vllm running a model, as well as a model + an adapter. Those two cases will be served up under different model ids.

It's important to consider we have 2 personas - the author of a pipeline config, and the user executing the pipeline (e.g. with ilab data generate

Using the model name/id in the pipeline config is a problem because user can choose a different model and so we can't require the pipeline author to have knowledge of it

I'm trying to avoid going do a path where the pipeline author has to do some templating to take into account user-defined runtime parameters

I think the use case here is likely to be:

Most LLMBlock definitions will use the default teacher model - and we can make the semantic that if the pipeline author doesn't specify a model, the default is used. And we get the default model name from the PipelineContext
In cases where a model+adapter is served, it hopefully will be the same pipeline author defining both its serving configuration and its use in an LLMBlock. So the pipeline author can name it in the serving config and reference it by that name in the pipeline config

We can add support for (2) when we add model serving config support

Does that make sense?

The simplest option seems to be to push that field back to be per-LLMBlock. Let me know if I've missed another way to accomplish this.

Thank you - I've added a comment in the dev-doc as a todo to capture the above detail

russellb · 2024-07-08T16:00:26Z

@markmc thanks, that mostly makes sense, but one question

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

mergify · 2024-07-08T17:21:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. @markmc please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2024-07-08T17:34:18Z

@markmc thanks, that mostly makes sense, but one question

I may be totally missing something!

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

I guess I'd like to know how those non-default model IDs are currently defined ... oh is this assuming someone running vLLM manually?

In that case, I think the pipeline author needs to be able to define required models and the user needs some way to specify a mapping of model IDs to those required models

Something like:

version: "1.0"
models:
  - name: myfunkyadaptor
    description: a funky adaptor for generating questions
block_configs:
  - block_type: LLMBlock
    block_config:
      block_name: gen_questions
      config_path: configs/skills/freeform_questions.yaml
      add_num_samples: True
      model: myfunkyadatpor
      output_cols:
        - question
    drop_duplicates:
      - question

and then something like:

$ ilab data generate --model=myfunkyadaptor:model1

Or just require the user to ensure the model is called myfunkyadatpor

The crucial point IMO is ... the pipeline author has to define their expectations ... right?

russellb · 2024-07-08T17:42:33Z

@markmc thanks, that mostly makes sense, but one question

I may be totally missing something!

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

I guess I'd like to know how those non-default model IDs are currently defined ... oh is this assuming someone running vLLM manually?

Yeah. It would have to be manual right now.

In that case, I think the pipeline author needs to be able to define required models and the user needs some way to specify a mapping of model IDs to those required models

Something like:
version: "1.0"
models:
  - name: myfunkyadaptor
    description: a funky adaptor for generating questions
block_configs:
  - block_type: LLMBlock
    block_config:
      block_name: gen_questions
      config_path: configs/skills/freeform_questions.yaml
      add_num_samples: True
      model: myfunkyadatpor
      output_cols:
        - question
    drop_duplicates:
      - question
and then something like:
$ ilab data generate --model=myfunkyadaptor:model1
Or just require the user to ensure the model is called myfunkyadatpor

The crucial point IMO is ... the pipeline author has to define their expectations ... right?

Yes, that makes sense. and I think the proposed method here makes sense. We could assume the model ID is what the pipeline author put in the config, but an option like you propose could optionally provide a mapping to something else. That seems like a good solution to provide the flexibility, but not require any additional config in the best case.

Looking at some custom WIP code, it looks like it's taking two model IDs into the pipeline context instead of one (effectively, implementation details based on older code ...)

github-actions · 2024-07-08T22:43:40Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-08T22:53:40Z

e2e workflow failed on this PR: View run, please investigate.

src/instructlab/sdg/flows/synth_grounded_skills.yaml

russellb · 2024-07-12T17:08:17Z

src/instructlab/sdg/flows/synth_grounded_skills.yaml

@@ -0,0 +1,65 @@
+version: "1.0"
+block_configs:
+  - block_type: LLMBlock


This block needs:

drop_duplicates: - context

russellb · 2024-07-12T17:12:37Z

src/instructlab/sdg/pipeline.py

+    if major > _FLOW_PARSER_MAJOR:
+        raise FlowParserError(
+            "The custom flow file format is from a future major version."
+        )
+    if major <= _FLOW_PARSER_MAJOR and minor > _FLOW_PARSER_MINOR:
+        logger.warning(
+            "The custom flow file may have new features that will be ignored."
+        )
+
+    if not "block_configs" in content:
+        raise FlowParserError(
+            "The custom flow file contains no 'block_configs' section"
+        )


This is not a blocking thing, but a nice minor improvement to these Execption messages would be to include the specified version from the file we loaded as well as the latest version the code understands

tests/test_importblock.py

russellb

I caught a couple things lost in the yaml conversion. Other comments are non-blocking. Awesome work

Prior to converting to yaml format, we were setting `n` to the value of `num_instructions_to_generate`. It was dropped from the yaml since it's a runtime configuration value. We need to set it here so it's set like it was before. Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Russell Bryant <[email protected]>

The full grounded skills pipeline begins by generating context. This block had "drop_duplicates: context" in its config, but it was accidentally dropped in the conversion to yaml. Signed-off-by: Russell Bryant <[email protected]>

src/instructlab/sdg/filterblock.py

Update the documentation for the parameters to reflect the updated types (strings) after the move to yaml based block configuration. While we're at it, document a list of oeprations that make sense to use with this block. Also include some examples for cases that warrant some more detailed examples: - The `contains` operation only works with strings. - All operations can take multiple candidates for the right side of the operation (filter value) and the block will check all of them and treat the result as True if any are true. - filter_column operator filter_value Signed-off-by: Russell Bryant <[email protected]>

github-actions · 2024-07-12T19:04:21Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-12T19:17:26Z

e2e workflow succeeded on this PR: View run, congrats!

russellb · 2024-07-12T21:52:47Z

src/instructlab/sdg/llmblock.py

@@ -78,7 +78,6 @@ def __init__(
            "model": self.ctx.model_id,
            "temperature": 0,
            "max_tokens": 12000,
-            "n": self.ctx.num_instructions_to_generate,


quoting the commit message i made this change in ...

I made some past changes to how we set `n` that were not correct. The fixes here include: - Re-add the one place where `n` was hard coded to 10. This was intentional and should be kept as-is. - Fix the `n` logic to be: - use what's specified for `n` in config if present - otherwise set it to 1 We never want to specify n>1 when also using a prompt that makes use of `num_samples`, as we effectively end up with `n` * `num_samples` results. This restores intended behavior of the `full` pipeline, but it also breaks applying `--num-instructions` from the CLI to be the `n` value used with the simple pipeline. That needs to be fixed in a follow-up commit.

so now I need to figure out something for the simple pipelines so that n is set based on runtime configuration (num_instructions_to_generate), but only for specific blocks where we want that to occur (the simple pipelines).

follow up issue filed here so it doesn't have to block merge of this if everything else is working -- #130

github-actions · 2024-07-12T23:51:04Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

I made some past changes to how we set `n` that were not correct. The fixes here include: - Re-add the one place where `n` was hard coded to 10. This was intentional and should be kept as-is. - Fix the `n` logic to be: - use what's specified for `n` in config if present - otherwise set it to 1 We never want to specify n>1 when also using a prompt that makes use of `num_samples`, as we effectively end up with `n` * `num_samples` results. This restores intended behavior of the `full` pipeline, but it also breaks applying `--num-instructions` from the CLI to be the `n` value used with the simple pipeline. That needs to be fixed in a follow-up commit. Signed-off-by: Russell Bryant <[email protected]>

The choice of number of samples turns out to be a pipeline author thing, and shouldn't be affected by runtime parameters. Restore the original behavior. Signed-off-by: Mark McLoughlin <[email protected]>

Given --pipeline=/some/random/dir/for/pipelines it doesn't make sense for config_path to be relative to /some/random/dir/ - the obvious thing you'd expect is it to be relative to /some/random/dir/for/pipelines. This means config that looks like this: ``` - name: gen_questions type: LLMBlock config: config_path: ../../configs/skills/freeform_questions.yaml ``` Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2024-07-13T00:30:25Z

e2e workflow succeeded on this PR: View run, congrats!

Fix a couple of calls where it's being passed as a positional arg, and the second positional arg is with_`indices`. Signed-off-by: Mark McLoughlin <[email protected]>

`field: YES` will be parsed to boolean in YAML, and resulting `"field": True` in Python. This makes any use of "field" as a string problematic in the code. This commit fixes this bug by quoting it properly. Signed-off-by: Kai Xu <[email protected]>

github-actions · 2024-07-13T05:54:46Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

russellb · 2024-07-13T05:58:09Z

As of this commit from @xukai92 -- 2c52770

The testing that compared the results with this fork (https://github.com/aakankshaduggal/sdg) was successful. @xukai92 confirmed that "with this we have parity" and "#86 should be good to merge"

I'm waiting for CI to finish one more time (both e2e jobs). If they're successful, then we can merge.

github-actions · 2024-07-13T07:09:19Z

e2e workflow succeeded on this PR: View run, congrats!

russellb

ready to merge!

markmc force-pushed the flow-config-file-format branch from 3a07116 to 43b7b2b Compare July 5, 2024 15:37

mergify bot added the needs-rebase label Jul 6, 2024

russellb reviewed Jul 7, 2024

View reviewed changes

markmc force-pushed the flow-config-file-format branch from 43b7b2b to 68a714d Compare July 8, 2024 10:37

mergify bot removed the needs-rebase label Jul 8, 2024

markmc force-pushed the flow-config-file-format branch from 68a714d to 3d94e9b Compare July 8, 2024 10:39

mergify bot added the ci-failure label Jul 8, 2024

markmc force-pushed the flow-config-file-format branch from 3d94e9b to 0d6b125 Compare July 8, 2024 13:27

mergify bot added testing Relates to testing and removed ci-failure labels Jul 8, 2024

oindrillac reviewed Jul 8, 2024

View reviewed changes

mergify bot added the needs-rebase label Jul 8, 2024

markmc force-pushed the flow-config-file-format branch from 16e927d to 694f2e0 Compare July 9, 2024 16:09

russellb reviewed Jul 12, 2024

View reviewed changes

src/instructlab/sdg/flows/synth_grounded_skills.yaml Outdated Show resolved Hide resolved

russellb reviewed Jul 12, 2024

View reviewed changes

tests/test_importblock.py Show resolved Hide resolved

russellb requested changes Jul 12, 2024

View reviewed changes

russellb force-pushed the flow-config-file-format branch from ec68413 to 5a0b7a6 Compare July 12, 2024 18:52

russellb reviewed Jul 12, 2024

View reviewed changes

src/instructlab/sdg/filterblock.py Show resolved Hide resolved

russellb force-pushed the flow-config-file-format branch from 0c4cdc2 to 04f7baa Compare July 12, 2024 18:59

russellb reviewed Jul 12, 2024

View reviewed changes

russellb mentioned this pull request Jul 12, 2024

Fix applying num_instructions_to_generate to simple pipelines #130

Closed

markmc mentioned this pull request Jul 13, 2024

Add versioning to Pipeline config format #54

Closed

russellb and others added 3 commits July 13, 2024 01:24

Re-instate batch_kwargs.num_samples

88f5003

The choice of number of samples turns out to be a pipeline author thing, and shouldn't be affected by runtime parameters. Restore the original behavior. Signed-off-by: Mark McLoughlin <[email protected]>

markmc force-pushed the flow-config-file-format branch from b1b3b4b to 804ee3a Compare July 13, 2024 00:25

Ensure num_proc is passed as a keyword arg to Dataset.map()

d1c5d5b

Fix a couple of calls where it's being passed as a positional arg, and the second positional arg is with_`indices`. Signed-off-by: Mark McLoughlin <[email protected]>

russellb mentioned this pull request Jul 13, 2024

feat: allow passing file as pipeline instructlab/instructlab#1668

Merged

5 tasks

russellb approved these changes Jul 13, 2024

View reviewed changes

russellb merged commit 07a17ed into instructlab:main Jul 13, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a YAML based file format for pipelines #86

Add a YAML based file format for pipelines #86

markmc commented Jul 5, 2024 •

edited

Loading

mergify bot commented Jul 6, 2024

russellb commented Jul 7, 2024

russellb left a comment

markmc commented Jul 8, 2024

oindrillac left a comment

markmc commented Jul 8, 2024 •

edited

Loading

markmc commented Jul 8, 2024

russellb commented Jul 8, 2024

mergify bot commented Jul 8, 2024

markmc commented Jul 8, 2024

russellb commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

russellb Jul 12, 2024

russellb Jul 12, 2024

russellb left a comment

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

russellb Jul 12, 2024

russellb Jul 12, 2024

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 13, 2024

github-actions bot commented Jul 13, 2024

russellb commented Jul 13, 2024 •

edited

Loading

github-actions bot commented Jul 13, 2024

russellb left a comment

Add a YAML based file format for pipelines #86

Add a YAML based file format for pipelines #86

Conversation

markmc commented Jul 5, 2024 • edited Loading

mergify bot commented Jul 6, 2024

russellb commented Jul 7, 2024

russellb left a comment

Choose a reason for hiding this comment

markmc commented Jul 8, 2024

oindrillac left a comment

Choose a reason for hiding this comment

markmc commented Jul 8, 2024 • edited Loading

markmc commented Jul 8, 2024

russellb commented Jul 8, 2024

mergify bot commented Jul 8, 2024

markmc commented Jul 8, 2024

russellb commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

russellb Jul 12, 2024

Choose a reason for hiding this comment

russellb Jul 12, 2024

Choose a reason for hiding this comment

russellb left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

russellb Jul 12, 2024

Choose a reason for hiding this comment

russellb Jul 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 13, 2024

github-actions bot commented Jul 13, 2024

russellb commented Jul 13, 2024 • edited Loading

github-actions bot commented Jul 13, 2024

russellb left a comment

Choose a reason for hiding this comment

markmc commented Jul 5, 2024 •

edited

Loading

markmc commented Jul 8, 2024 •

edited

Loading

russellb commented Jul 13, 2024 •

edited

Loading