Add `add_num_samples` to LLMBlock config #121

markmc · 2024-07-11T23:51:12Z

(Depends on #120)

Two pipelines include an LLMBlock which use {num_samples} in their instructions to the teacher model. There needs to be some way to configure the LLMBlock so that num_samples will be included, but as per #82 (commit a01b04e) the value of num_samples should be based on the num_instructions_to_generate parameter.

In preparation for custom pipeline configuration files, do not require model_prompt as an LLMBlock param - it can have built-in knowledge of the correct prompt to use per model_family. Signed-off-by: Mark McLoughlin <[email protected]>

In order to prepare for pipeline definitions in YAML, remove runtime parameters like the OpenAI client, model ID, and model family from the pipeline definition into a PipelineContext object that all blocks have access to. Signed-off-by: Mark McLoughlin <[email protected]>

This addresses issues with using num_proc>1 with Dataset.map() and Dataset.filter(). The first issue is: ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` What was happening here is that the entire FilterByValueBlock object was being serialized to send to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 The second issue is specific to map(): ``` ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null"). ``` It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new dtype. This second issue deserves further investigation and may require a fix to the datasets library. Signed-off-by: Mark McLoughlin <[email protected]>

Address the following issue with using num_proc>1 with Dataset.map(): ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` The entire block object is being serialized to sent to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 Signed-off-by: Mark McLoughlin <[email protected]>

In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files. Signed-off-by: Mark McLoughlin <[email protected]>

All Block subclasses but LLMBlock are failing to pass the block_name from block_config down to the base class, instead they are incorrectly passing the block type as its name. Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2024-07-11T23:54:41Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

In every use of FilterByValue in the default flows, we use batch_kwargs to set num_proc=8. This doesn't appear to be a pipeline author concern, but rather a runtime parameter which should in future be based on the number of available CPUs and (perhaps) user configuration. For now, just move it from batch_kwargs to PipelineContext. Signed-off-by: Mark McLoughlin <[email protected]>

Two pipelines include an LLMBlock which use `{num_samples}` in their instructions to the teacher model. There needs to be some way to configure the LLMBlock so that `num_samples` will be included, but as per instructlab#82 (commit a01b04e) the value of `num_samples` should be based on the `num_instructions_to_generate` parameter. Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2024-07-12T00:01:29Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-12T00:03:30Z

e2e workflow failed on this PR: View run, please investigate.

github-actions · 2024-07-12T00:15:25Z

e2e workflow succeeded on this PR: View run, congrats!

russellb

I'm OK with this as a step toward the need to pull runtime configuration out of the flow configuration.

I do wonder if it could be even simpler and num_samples is just always set instead of having it configured like this. Doing it this way seems to be the safer change for the moment, since it maintains existing behavior.

markmc · 2024-07-12T07:44:57Z

I'm OK with this as a step toward the need to pull runtime configuration out of the flow configuration.

I do wonder if it could be even simpler and num_samples is just always set instead of having it configured like this. Doing it this way seems to be the safer change for the moment, since it maintains existing behavior.

Yep, this is firmly in the realm of "I don't feel safe touching this right now" for me

markmc · 2024-07-12T16:11:33Z

Thanks for the review @russellb! Closing in favor of just merging #86 in one shot

…_actions/step-security/harden-runner-2.9.0 Bump step-security/harden-runner from 2.8.1 to 2.9.0

markmc added 6 commits July 11, 2024 23:32

Allow block_config.config_path to be relative

23dd08e

In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files. Signed-off-by: Mark McLoughlin <[email protected]>

Fix block_name handling

9fc272c

All Block subclasses but LLMBlock are failing to pass the block_name from block_config down to the base class, instead they are incorrectly passing the block type as its name. Signed-off-by: Mark McLoughlin <[email protected]>

mergify bot added the testing Relates to testing label Jul 11, 2024

markmc added 2 commits July 12, 2024 00:56

markmc force-pushed the num-samples branch from e084161 to b956643 Compare July 11, 2024 23:57

markmc mentioned this pull request Jul 12, 2024

Remove batch_kwargs #122

Closed

russellb approved these changes Jul 12, 2024

View reviewed changes

markmc mentioned this pull request Jul 12, 2024

Add a YAML based file format for pipelines #86

Merged

markmc closed this Jul 12, 2024

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#121 from instructlab/dependabot/github…

425adac

…_actions/step-security/harden-runner-2.9.0 Bump step-security/harden-runner from 2.8.1 to 2.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `add_num_samples` to LLMBlock config #121

Add `add_num_samples` to LLMBlock config #121

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

russellb left a comment

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

Add add_num_samples to LLMBlock config #121

Add add_num_samples to LLMBlock config #121

Conversation

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

russellb left a comment

Choose a reason for hiding this comment

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

Add `add_num_samples` to LLMBlock config #121

Add `add_num_samples` to LLMBlock config #121