-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove batch_kwargs #122
Remove batch_kwargs #122
Conversation
In preparation for custom pipeline configuration files, do not require model_prompt as an LLMBlock param - it can have built-in knowledge of the correct prompt to use per model_family. Signed-off-by: Mark McLoughlin <[email protected]>
In order to prepare for pipeline definitions in YAML, remove runtime parameters like the OpenAI client, model ID, and model family from the pipeline definition into a PipelineContext object that all blocks have access to. Signed-off-by: Mark McLoughlin <[email protected]>
This addresses issues with using num_proc>1 with Dataset.map() and Dataset.filter(). The first issue is: ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` What was happening here is that the entire FilterByValueBlock object was being serialized to send to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 The second issue is specific to map(): ``` ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null"). ``` It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new dtype. This second issue deserves further investigation and may require a fix to the datasets library. Signed-off-by: Mark McLoughlin <[email protected]>
Address the following issue with using num_proc>1 with Dataset.map(): ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` The entire block object is being serialized to sent to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 Signed-off-by: Mark McLoughlin <[email protected]>
In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files. Signed-off-by: Mark McLoughlin <[email protected]>
All Block subclasses but LLMBlock are failing to pass the block_name from block_config down to the base class, instead they are incorrectly passing the block type as its name. Signed-off-by: Mark McLoughlin <[email protected]>
In every use of FilterByValue in the default flows, we use batch_kwargs to set num_proc=8. This doesn't appear to be a pipeline author concern, but rather a runtime parameter which should in future be based on the number of available CPUs and (perhaps) user configuration. For now, just move it from batch_kwargs to PipelineContext. Signed-off-by: Mark McLoughlin <[email protected]>
Two pipelines include an LLMBlock which use `{num_samples}` in their instructions to the teacher model. There needs to be some way to configure the LLMBlock so that `num_samples` will be included, but as per instructlab#82 (commit a01b04e) the value of `num_samples` should be based on the `num_instructions_to_generate` parameter. Signed-off-by: Mark McLoughlin <[email protected]>
It's hard to spot, but this: def __init__(self, ..., **batch_kwargs): ... self.batch_params = batch_kwargs.get("batch_kwargs", {}) is equivalent to this: def __init__(self, ..., **kwargs): ... self.batch_params = kwargs.get("batch_kwargs", {}) which is equivalent to this: def __init__(self, ..., batch_kwargs={}, **kwargs): ... self.batch_params = batch_kwargs except that trailing **kwargs meant we were silently accepting unknown block_config parameters. Signed-off-by: Mark McLoughlin <[email protected]>
This appears to be unused now - now pipeline definitions include it, and it's not used in LLMBlock anywhere. Signed-off-by: Mark McLoughlin <[email protected]>
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks fine, though I'd request putting the extra explanation from the PR description in the commit message, as well
Yeah, there's two commits in this PR, and the first one has that explanation. Easy to miss with the stack of PRs - see commit 18f1513 |
…ment add requirements management policy
(Depends on #121)
This appears to be unused now - now pipeline definitions include it, and it's not used in LLMBlock anywhere.
This change might throw up cases where we're passing unknown configuration to LLMBlock. Here's why ...
It's hard to spot, but this:
is equivalent to this:
which is equivalent to this:
except that trailing
**kwargs
meant we were silently accepting unknown block_config parameters.