Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block Name In Errors #155

Merged
merged 5 commits into from
Jul 18, 2024
Merged

Conversation

gabe-l-hart
Copy link
Contributor

Description

This PR adds try/except handling around the core loop in Pipeline. In the handler clause, it creates a wrapper exception, attempting to match the type of the internal error, with the block_name and block_type included.

Closes: #128

@mergify mergify bot added the testing Relates to testing label Jul 16, 2024
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one suggestion to localize the try/except handling to only the parts calling block-specific code.

Also, do you have an example of what an error looked like before and after the change?

Comment on lines 75 to 82
block = block_type(self.ctx, self, block_name, **block_config)

logger.info("Running block: %s", block_name)
logger.info(dataset)

dataset = block.generate(dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these lines capture the block-specific code, so you could localize your addition to this part, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Do you think it's worth containing the config parsing as well? It seems like the use of direct indexing (vs .get) for "name", "type", and "config" implies that these are all required fields, but I don't see anywhere that they're validated (though that might happen elsewhere that I haven't found yet?). If it's not done elsewhere, it may be worth doing that validation in the __init__ to avoid getting part-way through the expensive pipeline operation before discovering configuration errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now have some config validation, at least for the configs we have in the tree. See make validate-pipelines. It uses a jsonschema definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cool, makes sense!

@gabe-l-hart gabe-l-hart force-pushed the BlockNameInErrors-128 branch from 46f3e0d to 10955fe Compare July 16, 2024 19:40
@gabe-l-hart
Copy link
Contributor Author

Thanks for the suggestion! I've localized the error handling to just the generate call.

Also, do you have an example of what an error looked like before and after the change?

I don't have a concrete example of before/after, but this uses raise ... from ... syntax. The parent exception will attempt to match the exception type of the one raised in the block with a message "BLOCK ERROR [{block_type class name}/{block_name}]: {child exc message}". The original exception can be accessed via the __cause__ attribute of the wrapper exception.

@markmc
Copy link
Contributor

markmc commented Jul 17, 2024

Also, do you have an example of what an error looked like before and after the change?

Here's what an exception from LLMBlock.generate() looks like:

INFO 2024-07-17 07:08:14,120 pipeline.py:77: generate Running block: gen_questions
INFO 2024-07-17 07:08:14,120 pipeline.py:78: generate Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 5
})
Traceback (most recent call last):
...
  File "/home/markmc/sdg/src/instructlab/sdg/sdg.py", line 20, in generate
    dataset = pipeline.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/markmc/sdg/src/instructlab/sdg/pipeline.py", line 80, in generate
    dataset = block.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/markmc/sdg/src/instructlab/sdg/llmblock.py", line 181, in generate
    raise Exception("TEST")
Exception: TEST

This isn't actually so problematic since we log which block is being executed, so the user has some context to use in order to debug the error

The example I gave in #128 is failing to load a config file when instantiating a block - an error with the pipeline definition, but also one that can't be detected by the schema

"""

def __init__(self, block: Block, exception: Exception):
self.block = block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: underscore prefix this if you want just .block_name and .block_type to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left that one "public" thinking that some might like to be able to actually muck with the block itself when handling the exception.

src/instructlab/sdg/pipeline.py Outdated Show resolved Hide resolved
src/instructlab/sdg/pipeline.py Outdated Show resolved Hide resolved
Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @gabe-l-hart !

@markmc markmc merged commit 263372b into instructlab:main Jul 18, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error handling - identify which block in which pipeline triggers an exception
4 participants