-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a YAML based file format for pipelines #86
Conversation
3a07116
to
43b7b2b
Compare
This pull request has merge conflicts that must be resolved before it can be |
I looked at the rebase. It's mostly because of #82, but a little bit from #78. I pushed a rebase here - https://github.com/russellb/sdg/tree/pr-86-rebase - not pushing to the PR branch since I haven't reviewed the code yet to know that I got it right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really nice!
I have one question about the PipelineContext. This moves the teacher model_id to the pipeline context, though one feature we want to support is having a different model id on a per LLMBlock basis. This indeed is not needed by any of our current flows in the tree, but we have a need for it in a downstream pipeline. We want to support having vllm running a model, as well as a model + an adapter. Those two cases will be served up under different model ids.
The simplest option seems to be to push that field back to be per-LLMBlock. Let me know if I've missed another way to accomplish this.
The other big thing is that I'm eager to run this in a CI job that can test the full pipeline. I've been working on that a lot in the last week. I think I'm close.
43b7b2b
to
68a714d
Compare
Thanks, I took that and rebased again. commit 5c46881 is significant - relates to #82 |
68a714d
to
3d94e9b
Compare
3d94e9b
to
0d6b125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general direction but some of the designs being changed, such as the ones here (yaml vs py) were recommended design proposals we previously agreed upon or adapted our designs to from a previous format here #98.
At this point, I would prioritize core functionality changes over major changes to the library design, especially since this blocks and delays the required features. I suggest going through a dev-docs proposal for such changes.
Yes, see instructlab/dev-docs#109 for that design discussion My understanding is that See also instructlab/instructlab#1546 |
It's important to consider we have 2 personas - the author of a pipeline config, and the user executing the pipeline (e.g. with Using the model name/id in the pipeline config is a problem because user can choose a different model and so we can't require the pipeline author to have knowledge of it I'm trying to avoid going do a path where the pipeline author has to do some templating to take into account user-defined runtime parameters I think the use case here is likely to be:
We can add support for (2) when we add model serving config support Does that make sense?
Thank you - I've added a comment in the dev-doc as a todo to capture the above detail |
@markmc thanks, that mostly makes sense, but one question
but how about someone working on a downstream pipeline already making use of different model IDs per block? I don't want to break them. Do you mean we'd implement this before merging? |
This pull request has merge conflicts that must be resolved before it can be |
I may be totally missing something!
I guess I'd like to know how those non-default model IDs are currently defined ... oh is this assuming someone running vLLM manually? In that case, I think the pipeline author needs to be able to define required models and the user needs some way to specify a mapping of model IDs to those required models Something like:
and then something like:
Or just require the user to ensure the model is called The crucial point IMO is ... the pipeline author has to define their expectations ... right? |
Yeah. It would have to be manual right now.
Yes, that makes sense. and I think the proposed method here makes sense. We could assume the model ID is what the pipeline author put in the config, but an option like you propose could optionally provide a mapping to something else. That seems like a good solution to provide the flexibility, but not require any additional config in the best case. Looking at some custom WIP code, it looks like it's taking two model IDs into the pipeline context instead of one (effectively, implementation details based on older code ...) |
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
e2e workflow failed on this PR: View run, please investigate. |
16e927d
to
694f2e0
Compare
@@ -0,0 +1,65 @@ | |||
version: "1.0" | |||
block_configs: | |||
- block_type: LLMBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block needs:
drop_duplicates:
- context
src/instructlab/sdg/pipeline.py
Outdated
if major > _FLOW_PARSER_MAJOR: | ||
raise FlowParserError( | ||
"The custom flow file format is from a future major version." | ||
) | ||
if major <= _FLOW_PARSER_MAJOR and minor > _FLOW_PARSER_MINOR: | ||
logger.warning( | ||
"The custom flow file may have new features that will be ignored." | ||
) | ||
|
||
if not "block_configs" in content: | ||
raise FlowParserError( | ||
"The custom flow file contains no 'block_configs' section" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocking thing, but a nice minor improvement to these Execption messages would be to include the specified version from the file we loaded as well as the latest version the code understands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I caught a couple things lost in the yaml conversion. Other comments are non-blocking. Awesome work
Prior to converting to yaml format, we were setting `n` to the value of `num_instructions_to_generate`. It was dropped from the yaml since it's a runtime configuration value. We need to set it here so it's set like it was before. Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
ec68413
to
5a0b7a6
Compare
The full grounded skills pipeline begins by generating context. This block had "drop_duplicates: context" in its config, but it was accidentally dropped in the conversion to yaml. Signed-off-by: Russell Bryant <[email protected]>
Update the documentation for the parameters to reflect the updated types (strings) after the move to yaml based block configuration. While we're at it, document a list of oeprations that make sense to use with this block. Also include some examples for cases that warrant some more detailed examples: - The `contains` operation only works with strings. - All operations can take multiple candidates for the right side of the operation (filter value) and the block will check all of them and treat the result as True if any are true. - filter_column operator filter_value Signed-off-by: Russell Bryant <[email protected]>
0c4cdc2
to
04f7baa
Compare
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
src/instructlab/sdg/llmblock.py
Outdated
@@ -78,7 +78,6 @@ def __init__( | |||
"model": self.ctx.model_id, | |||
"temperature": 0, | |||
"max_tokens": 12000, | |||
"n": self.ctx.num_instructions_to_generate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quoting the commit message i made this change in ...
I made some past changes to how we set `n` that were not correct. The
fixes here include:
- Re-add the one place where `n` was hard coded to 10. This was
intentional and should be kept as-is.
- Fix the `n` logic to be:
- use what's specified for `n` in config if present
- otherwise set it to 1
We never want to specify n>1 when also using a prompt that makes use
of `num_samples`, as we effectively end up with `n` * `num_samples`
results.
This restores intended behavior of the `full` pipeline, but it also
breaks applying `--num-instructions` from the CLI to be the `n` value
used with the simple pipeline. That needs to be fixed in a follow-up
commit.
so now I need to figure out something for the simple pipelines so that n
is set based on runtime configuration (num_instructions_to_generate), but only for specific blocks where we want that to occur (the simple pipelines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up issue filed here so it doesn't have to block merge of this if everything else is working -- #130
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
I made some past changes to how we set `n` that were not correct. The fixes here include: - Re-add the one place where `n` was hard coded to 10. This was intentional and should be kept as-is. - Fix the `n` logic to be: - use what's specified for `n` in config if present - otherwise set it to 1 We never want to specify n>1 when also using a prompt that makes use of `num_samples`, as we effectively end up with `n` * `num_samples` results. This restores intended behavior of the `full` pipeline, but it also breaks applying `--num-instructions` from the CLI to be the `n` value used with the simple pipeline. That needs to be fixed in a follow-up commit. Signed-off-by: Russell Bryant <[email protected]>
The choice of number of samples turns out to be a pipeline author thing, and shouldn't be affected by runtime parameters. Restore the original behavior. Signed-off-by: Mark McLoughlin <[email protected]>
Given --pipeline=/some/random/dir/for/pipelines it doesn't make sense for config_path to be relative to /some/random/dir/ - the obvious thing you'd expect is it to be relative to /some/random/dir/for/pipelines. This means config that looks like this: ``` - name: gen_questions type: LLMBlock config: config_path: ../../configs/skills/freeform_questions.yaml ``` Signed-off-by: Mark McLoughlin <[email protected]>
b1b3b4b
to
804ee3a
Compare
e2e workflow succeeded on this PR: View run, congrats! |
Fix a couple of calls where it's being passed as a positional arg, and the second positional arg is with_`indices`. Signed-off-by: Mark McLoughlin <[email protected]>
`field: YES` will be parsed to boolean in YAML, and resulting `"field": True` in Python. This makes any use of "field" as a string problematic in the code. This commit fixes this bug by quoting it properly. Signed-off-by: Kai Xu <[email protected]>
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
As of this commit from @xukai92 -- 2c52770 The testing that compared the results with this fork (https://github.com/aakankshaduggal/sdg) was successful. @xukai92 confirmed that "with this we have parity" and "#86 should be good to merge" I'm waiting for CI to finish one more time (both e2e jobs). If they're successful, then we can merge. |
e2e workflow succeeded on this PR: View run, congrats! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready to merge!
See instructlab/dev-docs#109
In order to support custom pipelines, add a YAML based file format.
However, to make the default pipelines easier to reason about and develop,
also convert them to the YAML file format.
This changes the top-level API from:
to: