Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a YAML based file format for pipelines #86

Merged
merged 28 commits into from
Jul 13, 2024

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Jul 5, 2024

See instructlab/dev-docs#109

In order to support custom pipelines, add a YAML based file format.

However, to make the default pipelines easier to reason about and develop,
also convert them to the YAML file format.

This changes the top-level API from:

mmlu_block_configs = MMLUBenchFlow().get_flow()
knowledge_block_configs = SynthKnowledgeFlow().get_flow()
knowledge_pipe = Pipeline(ctx, mmlu_flow + knowledge_flow)

to:

knowledge_pipe = Pipeline.from_flows(
    ctx, [pipeline.MMLU_BENCH_FLOW, pipeline.SYNTH_KNOWLEDGE_FLOW]
)

Copy link
Contributor

mergify bot commented Jul 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @markmc please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@russellb
Copy link
Member

russellb commented Jul 7, 2024

I looked at the rebase. It's mostly because of #82, but a little bit from #78.

I pushed a rebase here - https://github.com/russellb/sdg/tree/pr-86-rebase - not pushing to the PR branch since I haven't reviewed the code yet to know that I got it right

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really nice!

I have one question about the PipelineContext. This moves the teacher model_id to the pipeline context, though one feature we want to support is having a different model id on a per LLMBlock basis. This indeed is not needed by any of our current flows in the tree, but we have a need for it in a downstream pipeline. We want to support having vllm running a model, as well as a model + an adapter. Those two cases will be served up under different model ids.

The simplest option seems to be to push that field back to be per-LLMBlock. Let me know if I've missed another way to accomplish this.

The other big thing is that I'm eager to run this in a CI job that can test the full pipeline. I've been working on that a lot in the last week. I think I'm close.

@markmc markmc force-pushed the flow-config-file-format branch from 43b7b2b to 68a714d Compare July 8, 2024 10:37
@mergify mergify bot removed the needs-rebase label Jul 8, 2024
@markmc
Copy link
Contributor Author

markmc commented Jul 8, 2024

I looked at the rebase. It's mostly because of #82, but a little bit from #78.

I pushed a rebase here - https://github.com/russellb/sdg/tree/pr-86-rebase - not pushing to the PR branch since I haven't reviewed the code yet to know that I got it right

Thanks, I took that and rebased again.

commit 5c46881 is significant - relates to #82

@markmc markmc force-pushed the flow-config-file-format branch from 68a714d to 3d94e9b Compare July 8, 2024 10:39
@mergify mergify bot added the ci-failure label Jul 8, 2024
@markmc markmc force-pushed the flow-config-file-format branch from 3d94e9b to 0d6b125 Compare July 8, 2024 13:27
@mergify mergify bot added testing Relates to testing and removed ci-failure labels Jul 8, 2024
Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general direction but some of the designs being changed, such as the ones here (yaml vs py) were recommended design proposals we previously agreed upon or adapted our designs to from a previous format here #98.

At this point, I would prioritize core functionality changes over major changes to the library design, especially since this blocks and delays the required features. I suggest going through a dev-docs proposal for such changes.

@markmc
Copy link
Contributor Author

markmc commented Jul 8, 2024

I like the general direction but some of the designs being changed, such as the ones here (yaml vs py) were recommended design proposals we previously agreed upon or adapted our designs to from a previous format here #98.

At this point, I would prioritize core functionality changes over major changes to the library design, especially since this blocks and delays the required features. I suggest going through a dev-docs proposal for such changes.

Yes, see instructlab/dev-docs#109 for that design discussion

My understanding is that custom flows is a "required feature" - see the "Alternative Approaches" section in the dev-doc for some of the thinking on that

See also instructlab/instructlab#1546

@markmc
Copy link
Contributor Author

markmc commented Jul 8, 2024

I have one question about the PipelineContext. This moves the teacher model_id to the pipeline context, though one feature we want to support is having a different model id on a per LLMBlock basis. This indeed is not needed by any of our current flows in the tree, but we have a need for it in a downstream pipeline. We want to support having vllm running a model, as well as a model + an adapter. Those two cases will be served up under different model ids.

It's important to consider we have 2 personas - the author of a pipeline config, and the user executing the pipeline (e.g. with ilab data generate

Using the model name/id in the pipeline config is a problem because user can choose a different model and so we can't require the pipeline author to have knowledge of it

I'm trying to avoid going do a path where the pipeline author has to do some templating to take into account user-defined runtime parameters

I think the use case here is likely to be:

  1. Most LLMBlock definitions will use the default teacher model - and we can make the semantic that if the pipeline author doesn't specify a model, the default is used. And we get the default model name from the PipelineContext
  2. In cases where a model+adapter is served, it hopefully will be the same pipeline author defining both its serving configuration and its use in an LLMBlock. So the pipeline author can name it in the serving config and reference it by that name in the pipeline config

We can add support for (2) when we add model serving config support

Does that make sense?

The simplest option seems to be to push that field back to be per-LLMBlock. Let me know if I've missed another way to accomplish this.

Thank you - I've added a comment in the dev-doc as a todo to capture the above detail

@russellb
Copy link
Member

russellb commented Jul 8, 2024

@markmc thanks, that mostly makes sense, but one question

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

Copy link
Contributor

mergify bot commented Jul 8, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @markmc please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 8, 2024
@markmc
Copy link
Contributor Author

markmc commented Jul 8, 2024

@markmc thanks, that mostly makes sense, but one question

I may be totally missing something!

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

I guess I'd like to know how those non-default model IDs are currently defined ... oh is this assuming someone running vLLM manually?

In that case, I think the pipeline author needs to be able to define required models and the user needs some way to specify a mapping of model IDs to those required models

Something like:

version: "1.0"
models:
  - name: myfunkyadaptor
    description: a funky adaptor for generating questions
block_configs:
  - block_type: LLMBlock
    block_config:
      block_name: gen_questions
      config_path: configs/skills/freeform_questions.yaml
      add_num_samples: True
      model: myfunkyadatpor
      output_cols:
        - question
    drop_duplicates:
      - question

and then something like:

$ ilab data generate --model=myfunkyadaptor:model1

Or just require the user to ensure the model is called myfunkyadatpor

The crucial point IMO is ... the pipeline author has to define their expectations ... right?

@russellb
Copy link
Member

russellb commented Jul 8, 2024

@markmc thanks, that mostly makes sense, but one question

I may be totally missing something!

We can add support for (2) when we add model serving config support

but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging?

I guess I'd like to know how those non-default model IDs are currently defined ... oh is this assuming someone running vLLM manually?

Yeah. It would have to be manual right now.

In that case, I think the pipeline author needs to be able to define required models and the user needs some way to specify a mapping of model IDs to those required models

Something like:

version: "1.0"
models:
  - name: myfunkyadaptor
    description: a funky adaptor for generating questions
block_configs:
  - block_type: LLMBlock
    block_config:
      block_name: gen_questions
      config_path: configs/skills/freeform_questions.yaml
      add_num_samples: True
      model: myfunkyadatpor
      output_cols:
        - question
    drop_duplicates:
      - question

and then something like:

$ ilab data generate --model=myfunkyadaptor:model1

Or just require the user to ensure the model is called myfunkyadatpor

The crucial point IMO is ... the pipeline author has to define their expectations ... right?

Yes, that makes sense. and I think the proposed method here makes sense. We could assume the model ID is what the pipeline author put in the config, but an option like you propose could optionally provide a mapping to something else. That seems like a good solution to provide the flexibility, but not require any additional config in the best case.

Looking at some custom WIP code, it looks like it's taking two model IDs into the pipeline context instead of one (effectively, implementation details based on older code ...)

Copy link

github-actions bot commented Jul 8, 2024

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

github-actions bot commented Jul 8, 2024

e2e workflow failed on this PR: View run, please investigate.

@markmc markmc force-pushed the flow-config-file-format branch from 16e927d to 694f2e0 Compare July 9, 2024 16:09
@@ -0,0 +1,65 @@
version: "1.0"
block_configs:
- block_type: LLMBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block needs:

    drop_duplicates:
      - context

Comment on lines 117 to 129
if major > _FLOW_PARSER_MAJOR:
raise FlowParserError(
"The custom flow file format is from a future major version."
)
if major <= _FLOW_PARSER_MAJOR and minor > _FLOW_PARSER_MINOR:
logger.warning(
"The custom flow file may have new features that will be ignored."
)

if not "block_configs" in content:
raise FlowParserError(
"The custom flow file contains no 'block_configs' section"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking thing, but a nice minor improvement to these Execption messages would be to include the specified version from the file we loaded as well as the latest version the code understands

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught a couple things lost in the yaml conversion. Other comments are non-blocking. Awesome work

Prior to converting to yaml format, we were setting `n` to the value
of `num_instructions_to_generate`. It was dropped from the yaml since
it's a runtime configuration value. We need to set it here so it's set
like it was before.

Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb force-pushed the flow-config-file-format branch from ec68413 to 5a0b7a6 Compare July 12, 2024 18:52
The full grounded skills pipeline begins by generating context. This
block had "drop_duplicates: context" in its config, but it was
accidentally dropped in the conversion to yaml.

Signed-off-by: Russell Bryant <[email protected]>
Update the documentation for the parameters to reflect the updated
types (strings) after the move to yaml based block configuration.

While we're at it, document a list of oeprations that make sense to
use with this block. Also include some examples for cases that warrant
some more detailed examples:

- The `contains` operation only works with strings.

- All operations can take multiple candidates for the right side of
  the operation (filter value) and the block will check all of them
  and treat the result as True if any are true.

     - filter_column operator filter_value

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb force-pushed the flow-config-file-format branch from 0c4cdc2 to 04f7baa Compare July 12, 2024 18:59
Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

@@ -78,7 +78,6 @@ def __init__(
"model": self.ctx.model_id,
"temperature": 0,
"max_tokens": 12000,
"n": self.ctx.num_instructions_to_generate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting the commit message i made this change in ...

I made some past changes to how we set `n` that were not correct. The
fixes here include:

 - Re-add the one place where `n` was hard coded to 10. This was
   intentional and should be kept as-is.

 - Fix the `n` logic to be:

   - use what's specified for `n` in config if present
   - otherwise set it to 1

We never want to specify n>1 when also using a prompt that makes use
of `num_samples`, as we effectively end up with `n` * `num_samples`
results.

This restores intended behavior of the `full` pipeline, but it also
breaks applying `--num-instructions` from the CLI to be the `n` value
used with the simple pipeline. That needs to be fixed in a follow-up
commit.

so now I need to figure out something for the simple pipelines so that n is set based on runtime configuration (num_instructions_to_generate), but only for specific blocks where we want that to occur (the simple pipelines).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up issue filed here so it doesn't have to block merge of this if everything else is working -- #130

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

russellb and others added 3 commits July 13, 2024 01:24
I made some past changes to how we set `n` that were not correct. The
fixes here include:

 - Re-add the one place where `n` was hard coded to 10. This was
   intentional and should be kept as-is.

 - Fix the `n` logic to be:

   - use what's specified for `n` in config if present
   - otherwise set it to 1

We never want to specify n>1 when also using a prompt that makes use
of `num_samples`, as we effectively end up with `n` * `num_samples`
results.

This restores intended behavior of the `full` pipeline, but it also
breaks applying `--num-instructions` from the CLI to be the `n` value
used with the simple pipeline. That needs to be fixed in a follow-up
commit.

Signed-off-by: Russell Bryant <[email protected]>
The choice of number of samples turns out to be a pipeline author
thing, and shouldn't be affected by runtime parameters. Restore
the original behavior.

Signed-off-by: Mark McLoughlin <[email protected]>
Given --pipeline=/some/random/dir/for/pipelines it doesn't make sense for
config_path to be relative to /some/random/dir/ - the obvious thing you'd
expect is it to be relative to /some/random/dir/for/pipelines.

This means config that looks like this:

```
  - name: gen_questions
    type: LLMBlock
    config:
      config_path: ../../configs/skills/freeform_questions.yaml
```

Signed-off-by: Mark McLoughlin <[email protected]>
@markmc markmc force-pushed the flow-config-file-format branch from b1b3b4b to 804ee3a Compare July 13, 2024 00:25
Copy link

e2e workflow succeeded on this PR: View run, congrats!

Fix a couple of calls where it's being passed as a positional arg,
and the second positional arg is with_`indices`.

Signed-off-by: Mark McLoughlin <[email protected]>
`field: YES` will be parsed to boolean in YAML, and
resulting `"field": True` in Python. This makes any
use of "field" as a string problematic in the code.
This commit fixes this bug by quoting it properly.

Signed-off-by: Kai Xu <[email protected]>
Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

@russellb
Copy link
Member

russellb commented Jul 13, 2024

As of this commit from @xukai92 -- 2c52770

The testing that compared the results with this fork (https://github.com/aakankshaduggal/sdg) was successful. @xukai92 confirmed that "with this we have parity" and "#86 should be good to merge"

I'm waiting for CI to finish one more time (both e2e jobs). If they're successful, then we can merge.

Copy link

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to merge!

@russellb russellb merged commit 07a17ed into instructlab:main Jul 13, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants