Allow block_config.config_path to be relative #118

markmc · 2024-07-11T23:06:06Z

(Depends on #117)

In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files.

In preparation for custom pipeline configuration files, do not require model_prompt as an LLMBlock param - it can have built-in knowledge of the correct prompt to use per model_family. Signed-off-by: Mark McLoughlin <[email protected]>

In order to prepare for pipeline definitions in YAML, remove runtime parameters like the OpenAI client, model ID, and model family from the pipeline definition into a PipelineContext object that all blocks have access to. Signed-off-by: Mark McLoughlin <[email protected]>

This addresses issues with using num_proc>1 with Dataset.map() and Dataset.filter(). The first issue is: ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` What was happening here is that the entire FilterByValueBlock object was being serialized to send to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 The second issue is specific to map(): ``` ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null"). ``` It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new dtype. This second issue deserves further investigation and may require a fix to the datasets library. Signed-off-by: Mark McLoughlin <[email protected]>

Address the following issue with using num_proc>1 with Dataset.map(): ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` The entire block object is being serialized to sent to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 Signed-off-by: Mark McLoughlin <[email protected]>

In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files. Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2024-07-11T23:10:26Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-11T23:26:30Z

e2e workflow succeeded on this PR: View run, congrats!

russellb

just one minor non-blocking thought. I don't know what the error would look like if the path doesn't exist. It'd be good to check and ensure it's clear enough from the CLI what was wrong. Maybe it needs a friendly log message like I suggested inline, or maybe something else ...

this isn't super relevant until later when custom configs are supported, though. we can do some testing and see what error handling looks like once that's in place

russellb · 2024-07-12T01:38:49Z

src/instructlab/sdg/block.py

        :param config_path: The path to the configuration file.
        :return: The loaded configuration.
        """
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(self.ctx.sdg_base, config_path)


a non-blocking improvement here would be to check if config_path exists prior to open() on the next line. If it doesn't exist, we could log a friendly message about it.

or maybe the pythonic way is to catch the exception from open() and log from there (and then re-raise it)

Yep, similar to Dan's error handling work in instructlab/eval#61

markmc · 2024-07-12T07:56:22Z

just one minor non-blocking thought. I don't know what the error would look like if the path doesn't exist. It'd be good to check and ensure it's clear enough from the CLI what was wrong. Maybe it needs a friendly log message like I suggested inline, or maybe something else ...

this isn't super relevant until later when custom configs are supported, though. we can do some testing and see what error handling looks like once that's in place

In general, I think when there is an error with a block, we don't make it easy to know which block in which pipeline caused it

This case looks like

  File "/home/markmc/sdg/src/instructlab/sdg/sdg.py", line 19, in generate
    dataset = pipeline.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/markmc/sdg/src/instructlab/sdg/pipeline.py", line 57, in generate
    block = block_type(self.ctx, **block_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/markmc/sdg/src/instructlab/sdg/llmblock.py", line 66, in __init__
    self.block_config = self._load_config(config_path)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/markmc/sdg/src/instructlab/sdg/block.py", line 54, in _load_config
    with open(config_path, "r", encoding="utf-8") as config_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/markmc/sdg/src/instructlab/sdg/configs/skills/freeform_questions.yaml.foobar'

Which isn't so bad, because you just have to find the block with that path

Filed #128

markmc · 2024-07-12T16:12:01Z

Thanks for the review @russellb! Closing in favor of just merging #86 in one shot

markmc added 5 commits July 11, 2024 23:32

Allow block_config.config_path to be relative

23dd08e

In order to remove another runtime parameter from pipeline definitions to allow us to move to using YAML files. Signed-off-by: Mark McLoughlin <[email protected]>

mergify bot added the testing Relates to testing label Jul 11, 2024

markmc mentioned this pull request Jul 11, 2024

Fix block_name handling #119

Closed

russellb approved these changes Jul 12, 2024

View reviewed changes

markmc mentioned this pull request Jul 12, 2024

Error handling - identify which block in which pipeline triggers an exception #128

Closed

markmc mentioned this pull request Jul 12, 2024

Add a YAML based file format for pipelines #86

Merged

markmc closed this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow block_config.config_path to be relative #118

Allow block_config.config_path to be relative #118

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb left a comment

russellb Jul 12, 2024

markmc Jul 12, 2024

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

Allow block_config.config_path to be relative #118

Allow block_config.config_path to be relative #118

Conversation

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb left a comment

Choose a reason for hiding this comment

russellb Jul 12, 2024

Choose a reason for hiding this comment

markmc Jul 12, 2024

Choose a reason for hiding this comment

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024