Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix for Filterblocks #72

Closed
wants to merge 8 commits into from

Conversation

shivchander
Copy link
Member

@shivchander shivchander commented Jul 3, 2024

Major changes:

  • Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.
  • Added a private method _fiter_invalid_values to filter out samples with invalid values based on the valid_values list.
  • Added a private method _convert_dtype to handle potential ValueError during data type conversion, setting the value to None if an error occurs.
  • Modified the generate method to include a call to _fiter_invalid_values after the data type conversion (if applicable), ensuring that only valid samples are processed further.

Minor changes:

  • Slightly better logging added to pipeline.py - added input and output logging and some new lines to make it more readable
  • Updated test script imports to use instructlab.sdg instead of importing from src
  • Fixed default flows to use ints and not floats as expected in the template

@mergify mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024
src/instructlab/sdg/filterblock.py Show resolved Hide resolved
src/instructlab/sdg/default_flows.py Show resolved Hide resolved
@@ -194,6 +194,7 @@ def get_flow(self) -> list:
"block_name": "filter_faithfulness",
"filter_column": "judgment",
"filter_value": "YES",
"valid_values": ["YES", "NO"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter currently only accepts judgement=YES ... so adding is a no-op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the in operation to check if the generated output belongs in a list of strings.

As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)

But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:

    def filter_sample(sample):
        value = sample["judgment"]
        if value not in ["YES", "NO"]:
            return False
        if value != "YES":
            return False
        return True

Am I missing something?

(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it).

Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee

though note I've only tested via that unit test so far

src/instructlab/sdg/default_flows.py Show resolved Hide resolved
@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

Thanks for the explanations - please include these in the commit messages though, so they show up in git log and git blame

Major changes:

  • Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

  • Added a private method _convert_dtype to handle potential ValueError during data type conversion, setting the value to None if an error occurs.

We shouldn't be suppressing errors IMO - it will make things really hard to debug

…ror + minor linting

The FilterByValueBlock class now handles ValueError exceptions when converting data types. If a ValueError occurs, the block logs an error message and fills the column with None to be filtered later.

Signed-off-by: shiv <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024
@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

  • Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

A discussion on this point continues in a resolved comment

@mergify mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024
@mergify mergify bot removed the ci-failure label Jul 3, 2024
@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

  • Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

A discussion on this point continues in a resolved comment

I gather the point is that filter_value is actually required for downstream custom flows (see instructlab/dev-docs#109). That's an entirely different situation!

Please go ahead and add filter_value for the custom flow use case, but do not use it in the current flows where it is not required and just adds complexity

filter_column,
filter_value,
operation,
valid_values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that would help a lot here is adding some documentation of these parameters that explains how they're all to be used. That would, I hope, help some of the discussions we're having trying to understand the changes. Can you add a docstring to this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#76 Sounds good, lets address this in a follow up

Copy link
Contributor

@markmc markmc Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one thing that helps is ... filter_column, filter_value, and operation go together

filter_column and valid_values go together too

so, it's like:

  • filter on filter_column,
  • first (and optionally) by checking that filter_column contains one of valid_values
  • second by comparing the value in the column matches filter_value using the given operation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the confusion is "why do we need the second bullet again?" right?

is it that actually you really only need one or the other?

or what's the example for when you need them both together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or put another way, two types of filtering?

could they be collapsed into one, where it's always a set, and you might just have only one element in that set?

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making my review explicit while the active discussion continues in comment threads

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shivchander
looks good to me 🚢
Tested this using the vllm endpoint and works fine 💯

Approving this but can we please merge this after #77

russellb added a commit to russellb/sdg that referenced this pull request Jul 4, 2024
The previous code in this block did filtering assuming that all
samples had a value that was correct for the type. For example, when
filtering on an integer value, it assumed every row had a valid
integer, where it may instead have garbage.

This change introduces a new helper, _convert_dtype(), which properly
handles this condition. When the conversion fails on a `ValueError`
exception, it treats it as `None` instead of allowing the exception to
be raised up to the caller.

The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a
standalone commit.

Signed-off-by: Russell Bryant <[email protected]>
Co-authored-by: shiv <[email protected]>
@shivchander
Copy link
Member Author

Closing this PR in favor of #78 and will create a new one to handle the valid values as its own Block

@shivchander shivchander closed this Jul 4, 2024
russellb added a commit to russellb/sdg that referenced this pull request Jul 7, 2024
This block previously only accepted a single value to filter on. This
update makes it handle a list, as well. In that case, it will ensure
that the filter matches one of the values in the list.

This is an updated implementation of the feature originally proposed
in instructlab#72.

Signed-off-by: Russell Bryant <[email protected]>
oindrillac pushed a commit to oindrillac/sdg that referenced this pull request Jul 8, 2024
The previous code in this block did filtering assuming that all
samples had a value that was correct for the type. For example, when
filtering on an integer value, it assumed every row had a valid
integer, where it may instead have garbage.

This change introduces a new helper, _convert_dtype(), which properly
handles this condition. When the conversion fails on a `ValueError`
exception, it treats it as `None` instead of allowing the exception to
be raised up to the caller.

The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a
standalone commit.

Signed-off-by: Russell Bryant <[email protected]>
Co-authored-by: shiv <[email protected]>
oindrillac pushed a commit to oindrillac/sdg that referenced this pull request Jul 8, 2024
This block previously only accepted a single value to filter on. This
update makes it handle a list, as well. In that case, it will ensure
that the filter matches one of the values in the list.

This is an updated implementation of the feature originally proposed
in instructlab#72.

Signed-off-by: Russell Bryant <[email protected]>
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
naming: Add style guidance for Merlinite and Granite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants