-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix for Filterblocks #72
Conversation
Signed-off-by: shiv <[email protected]>
Signed-off-by: shiv <[email protected]>
Signed-off-by: shiv <[email protected]>
Signed-off-by: shiv <[email protected]>
Signed-off-by: shiv <[email protected]>
@@ -194,6 +194,7 @@ def get_flow(self) -> list: | |||
"block_name": "filter_faithfulness", | |||
"filter_column": "judgment", | |||
"filter_value": "YES", | |||
"valid_values": ["YES", "NO"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter currently only accepts judgement=YES
... so adding is a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the in
operation to check if the generated output belongs in a list of strings.
As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)
But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:
def filter_sample(sample):
value = sample["judgment"]
if value not in ["YES", "NO"]:
return False
if value != "YES":
return False
return True
Am I missing something?
(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this PR and reintroduce the retaining valid values as its own block (since we still need it).
Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee
though note I've only tested via that unit test so far
Thanks for the explanations - please include these in the commit messages though, so they show up in
Every instance of using this appears to be a no-op since it filters a subset of what
We shouldn't be suppressing errors IMO - it will make things really hard to debug |
…ror + minor linting The FilterByValueBlock class now handles ValueError exceptions when converting data types. If a ValueError occurs, the block logs an error message and fills the column with None to be filtered later. Signed-off-by: shiv <[email protected]>
A discussion on this point continues in a resolved comment |
Signed-off-by: shiv <[email protected]>
Signed-off-by: shiv <[email protected]>
I gather the point is that Please go ahead and add |
filter_column, | ||
filter_value, | ||
operation, | ||
valid_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that would help a lot here is adding some documentation of these parameters that explains how they're all to be used. That would, I hope, help some of the discussions we're having trying to understand the changes. Can you add a docstring to this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#76 Sounds good, lets address this in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, one thing that helps is ... filter_column
, filter_value
, and operation
go together
filter_column
and valid_values
go together too
so, it's like:
- filter on filter_column,
- first (and optionally) by checking that filter_column contains one of valid_values
- second by comparing the value in the column matches filter_value using the given operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the confusion is "why do we need the second bullet again?" right?
is it that actually you really only need one or the other?
or what's the example for when you need them both together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or put another way, two types of filtering?
could they be collapsed into one, where it's always a set, and you might just have only one element in that set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just making my review explicit while the active discussion continues in comment threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shivchander
looks good to me 🚢
Tested this using the vllm endpoint and works fine 💯
Approving this but can we please merge this after #77
The previous code in this block did filtering assuming that all samples had a value that was correct for the type. For example, when filtering on an integer value, it assumed every row had a valid integer, where it may instead have garbage. This change introduces a new helper, _convert_dtype(), which properly handles this condition. When the conversion fails on a `ValueError` exception, it treats it as `None` instead of allowing the exception to be raised up to the caller. The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a standalone commit. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: shiv <[email protected]>
Closing this PR in favor of #78 and will create a new one to handle the valid values as its own Block |
This block previously only accepted a single value to filter on. This update makes it handle a list, as well. In that case, it will ensure that the filter matches one of the values in the list. This is an updated implementation of the feature originally proposed in instructlab#72. Signed-off-by: Russell Bryant <[email protected]>
The previous code in this block did filtering assuming that all samples had a value that was correct for the type. For example, when filtering on an integer value, it assumed every row had a valid integer, where it may instead have garbage. This change introduces a new helper, _convert_dtype(), which properly handles this condition. When the conversion fails on a `ValueError` exception, it treats it as `None` instead of allowing the exception to be raised up to the caller. The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a standalone commit. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: shiv <[email protected]>
This block previously only accepted a single value to filter on. This update makes it handle a list, as well. In that case, it will ensure that the filter matches one of the values in the list. This is an updated implementation of the feature originally proposed in instructlab#72. Signed-off-by: Russell Bryant <[email protected]>
naming: Add style guidance for Merlinite and Granite
Major changes:
valid_values
in theFilterByValueBlock
class to specify a list of valid values for filtering._fiter_invalid_values
to filter out samples with invalid values based on thevalid_values
list._convert_dtype
to handle potentialValueError
during data type conversion, setting the value toNone
if an error occurs.generate
method to include a call to_fiter_invalid_values
after the data type conversion (if applicable), ensuring that only valid samples are processed further.Minor changes:
pipeline.py
- added input and output logging and some new lines to make it more readable