-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Iterblock with LLMBlock #58
Conversation
src/instructlab/sdg/default_flows.py
Outdated
"gen_kwargs": { | ||
"max_tokens": 2048, | ||
"temperature": 0.7, | ||
"n": 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be self.num_iters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a vllm specific parameter - we are replacing the IterBlock with this. This parameter essentialy controls the number of return sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, these parameters get passed to the openai client directly as kwargs. Ref to openai's n
parameter https://platform.openai.com/docs/api-reference/completions/create#completions-create-n. In theory, openai's functionality of n
is the same as what we were doing with num_iters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running ilab data generate
with this branch gives:
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 257, in generate_data
sdg_knowledge, sdg_freeform_skill, sdg_grounded_skill = _sdg_init(
^^^^^^^^^^
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 144, in _sdg_init
[
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 146, in <listcomp>
flow_type(
TypeError: Flow.__init__() takes from 4 to 5 positional arguments but 6 were given
(after I fix the iterblock import error)
What should the num_instructions_to_generate
parameter to generate_data()
be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivchander or @oindrillac any thoughts here? As far as I know the generate_data.py
file still contains older sdg code right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like generate_data.py
has been updated to use the newer pipeline structure. So I guess I can go ahead and fix the num_iter
in generate_data()
API call. Just want to get confirmation from @oindrillac / @aakankshaduggal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is that num_iters
was a constructor arg to the Flow
class before and it's not clear why that's changed. Is it the intention to change this from being parameterized to being 1? If it's always 1 then there's no need to specify n
at all. If it stays as a param to the Flow class then it might be best to rename it since num_iters
isn't really correct.. should be something more like num_parallel_outputs_per_input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, after discussing with @shivchander, we think that the client num_instructions_to_generate
can be captured by openai's n
parameter. Currently, the num_samples
are hardcoded in the default_flows.py
which is overwriting the num_instructions_to_generate
/num_iters
passed from generate_data()
. It should be fixed in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that n
parameter imply that the server supports batching? llama-cpp does not -- either way, just need to make sure it works with that backend too and not just vllm
src/instructlab/sdg/llmblock.py
Outdated
@@ -123,8 +123,11 @@ def generate(self, samples, **gen_kwargs) -> Dataset: | |||
outputs = [self._generate([sample], **gen_kwargs)[0] for sample in samples] | |||
logger.debug("Generated outputs: {}".format(outputs)) | |||
|
|||
num_parallel_samples = gen_kwargs.get("n", 1) | |||
n_samples = [item for item in samples for i in range(num_parallel_samples)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this super confusing
What is the "n" in n_samples supposed to stand for? It reads to me like "number of samples", suggesting an integer
And the nested list comprehension is ... too clever
Suggest
extended_samples = []
for item in samples:
extended_samples.extend([item] * num_parallel_samples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry I couldn't come up with a better name. I agree extended_samples
is better.
Some context for this change:
Openai's n
parameter determines how many responses to generate for each input sample. This change fixes the zip
functionality we use just below this code.
If we have n
higher than 1 then we end up with scenarios something like this:
>>> input = ["s1", "s2"]
>>> output = ["o11", "o12", "o13", "o21", "o22", "o23"] # n==3
>>> c = zip(input, output)
>>> for item in c:
... print(item)
...
('s1', 'o11')
('s2', 'o12')
Instead, we want
>>> n_samples = [item for item in input for i in range(3)]
>>> n_samples
['s1', 's1', 's1', 's2', 's2', 's2']
>>> c = zip(n_samples, output)
>>>
>>> for item in c:
... print(item)
...
('s1', 'o11')
('s1', 'o12')
('s1', 'o13')
('s2', 'o21')
('s2', 'o22')
('s2', 'o23')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO just some code comments would help with this.
The commit log could be much improved. The logical changes I see, with the explanations I'd expect:
|
Ah, I see these three are from #53 |
@@ -39,12 +39,6 @@ def generate(self, dataset) -> Dataset: | |||
drop_duplicates_cols = block_prop.get("drop_duplicates", False) | |||
block = block_type(**block_config) | |||
|
|||
if block_type == IterBlock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e2e is failing with:
File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/pipeline.py", line 6, in <module>
from .iterblock import IterBlock
ModuleNotFoundError: No module named 'instructlab.sdg.iterblock'
see above warnings
E0401: Unable to import 'instructlab.sdg.iterblock' (import-error)
W0611: Unused IterBlock imported from iterblock (unused-import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, the pipeline file changes didn't get pushed, sigh!
07f876a
to
6bf23ce
Compare
@npalaska you should rebase this on the latest main branch or merge latest main into this branch, right now the diff/PR includes unrelated changes. |
Does the llama-cpp backend support this parameter? See the instructlab changelog:
i.e. the llama-cpp backend will need to be supported |
I don't think so, the best thing here would be to wrap the client in this case - the generate method can be overridden to just make sequential calls in a loop and return the equivalent response when |
Right, so something like this in IterBlock:
(I don't actually know the best way to detect whether the client supports this parameter though) |
This pull request has merge conflicts that must be resolved before it can be |
This is a step towards converting it to the final format (messages) required for training. Signed-off-by: Oindrilla Chatterjee <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Shiv <[email protected]>
Changed the prompt templates and alignment with expected outputs. Conducted stress testing across various leaf nodes to ensure accuracy and relevance. Signed-off-by: Oindrilla Chatterjee <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Shiv <[email protected]>
261cbbe
to
0c90804
Compare
|
Ruff checking failing:
|
It may only support a subset of parameters, we can't assume it supports
This PR should not merge without this testing Since
but also note:
Just a little more investigation is required to confirm that llama-cpp is handling this correctly |
Hmm, maybe this isn't so promising - it looks like we're creating 1 sample per seed example? There are 5 seed examples for the freeform skills test |
This pull request has merge conflicts that must be resolved before it can be |
@markmc yes, exactly, except that again "num_iters" might not be the best name since the fact that they are iters in one of the cases is an implementation detail. Also we could have a wrapper for the client that encapsulates this and thus isn't needed in the block impls (and wouldn't need to use an IterBlock). |
@markmc Yeah I tested with model served with llama.cpp and it seems like openai client supports it in a way that it doesn't break the request but the server ignores the
I guess the next thing would be to find out what is the best way to know whether the server supports the |
Yep. The most straightforward thing is to pass down a I'd prefer something more like |
Do you mean adding a new parameter |
no take it from the existing |
@markmc So we are doing it in a different way where we determine whether the backend server supports batching by making a test request. We don't need to pass any server configs. @njhill has opened a new PR #77 with these changes on top of changes from this PR and I will close this one. |
Replaced by #105 |
Update actionlint to v1.7.0
This PR is rebased on top of #53
Iterblock
and replaces it withLLMblock
.Iterblock
can be specified in the LLMBlock'sgen_kwargs
using openai'sn
parameter.n
parameter determines the number of output choices generated by the LLM service per sample in the batch.