Hotfix for Filterblocks #72

shivchander · 2024-07-03T04:05:06Z

Major changes:

Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.
Added a private method _fiter_invalid_values to filter out samples with invalid values based on the valid_values list.
Added a private method _convert_dtype to handle potential ValueError during data type conversion, setting the value to None if an error occurs.
Modified the generate method to include a call to _fiter_invalid_values after the data type conversion (if applicable), ensuring that only valid samples are processed further.

Minor changes:

Slightly better logging added to pipeline.py - added input and output logging and some new lines to make it more readable
Updated test script imports to use instructlab.sdg instead of importing from src
Fixed default flows to use ints and not floats as expected in the template

Signed-off-by: shiv <[email protected]>

src/instructlab/sdg/filterblock.py

src/instructlab/sdg/default_flows.py

markmc · 2024-07-03T12:42:18Z

src/instructlab/sdg/default_flows.py

@@ -194,6 +194,7 @@ def get_flow(self) -> list:
                    "block_name": "filter_faithfulness",
                    "filter_column": "judgment",
                    "filter_value": "YES",
+                    "valid_values": ["YES", "NO"],


The filter currently only accepts judgement=YES ... so adding is a no-op?

We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.

This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the in operation to check if the generated output belongs in a list of strings.

As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust

Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)

But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:

def filter_sample(sample): value = sample["judgment"] if value not in ["YES", "NO"]: return False if value != "YES": return False return True

Am I missing something?

(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)

I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix

@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.

I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.

@shivchander

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it).

Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.

@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee

though note I've only tested via that unit test so far

src/instructlab/sdg/default_flows.py

markmc · 2024-07-03T12:55:42Z

Thanks for the explanations - please include these in the commit messages though, so they show up in git log and git blame

Major changes:

Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

Added a private method _convert_dtype to handle potential ValueError during data type conversion, setting the value to None if an error occurs.

We shouldn't be suppressing errors IMO - it will make things really hard to debug

…ror + minor linting The FilterByValueBlock class now handles ValueError exceptions when converting data types. If a ValueError occurs, the block logs an error message and fills the column with None to be filtered later. Signed-off-by: shiv <[email protected]>

markmc · 2024-07-03T17:28:44Z

Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

A discussion on this point continues in a resolved comment

Signed-off-by: shiv <[email protected]>

markmc · 2024-07-03T18:58:18Z

Introduced a new parameter valid_values in the FilterByValueBlock class to specify a list of valid values for filtering.

Every instance of using this appears to be a no-op since it filters a subset of what filter_value is already filtering?

A discussion on this point continues in a resolved comment

I gather the point is that filter_value is actually required for downstream custom flows (see instructlab/dev-docs#109). That's an entirely different situation!

Please go ahead and add filter_value for the custom flow use case, but do not use it in the current flows where it is not required and just adds complexity

russellb · 2024-07-03T19:00:12Z

src/instructlab/sdg/filterblock.py

+        filter_column,
+        filter_value,
+        operation,
+        valid_values,


Something that would help a lot here is adding some documentation of these parameters that explains how they're all to be used. That would, I hope, help some of the discussions we're having trying to understand the changes. Can you add a docstring to this method?

#76 Sounds good, lets address this in a follow up

Yeah, one thing that helps is ... filter_column, filter_value, and operation go together

filter_column and valid_values go together too

so, it's like:

filter on filter_column,

first (and optionally) by checking that filter_column contains one of valid_values

second by comparing the value in the column matches filter_value using the given operation

and the confusion is "why do we need the second bullet again?" right?

is it that actually you really only need one or the other?

or what's the example for when you need them both together?

... or put another way, two types of filtering?

could they be collapsed into one, where it's always a set, and you might just have only one element in that set?

russellb

just making my review explicit while the active discussion continues in comment threads

aakankshaduggal

Thanks @shivchander
looks good to me 🚢
Tested this using the vllm endpoint and works fine 💯

Approving this but can we please merge this after #77

The previous code in this block did filtering assuming that all samples had a value that was correct for the type. For example, when filtering on an integer value, it assumed every row had a valid integer, where it may instead have garbage. This change introduces a new helper, _convert_dtype(), which properly handles this condition. When the conversion fails on a `ValueError` exception, it treats it as `None` instead of allowing the exception to be raised up to the caller. The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a standalone commit. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: shiv <[email protected]>

shivchander · 2024-07-04T00:33:52Z

Closing this PR in favor of #78 and will create a new one to handle the valid values as its own Block

This block previously only accepted a single value to filter on. This update makes it handle a list, as well. In that case, it will ensure that the filter matches one of the values in the list. This is an updated implementation of the feature originally proposed in instructlab#72. Signed-off-by: Russell Bryant <[email protected]>

The previous code in this block did filtering assuming that all samples had a value that was correct for the type. For example, when filtering on an integer value, it assumed every row had a valid integer, where it may instead have garbage. This change introduces a new helper, _convert_dtype(), which properly handles this condition. When the conversion fails on a `ValueError` exception, it treats it as `None` instead of allowing the exception to be raised up to the caller. The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a standalone commit. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: shiv <[email protected]>

This block previously only accepted a single value to filter on. This update makes it handle a list, as well. In that case, it will ensure that the filter matches one of the values in the list. This is an updated implementation of the feature originally proposed in instructlab#72. Signed-off-by: Russell Bryant <[email protected]>

naming: Add style guidance for Merlinite and Granite

shivchander added 4 commits July 2, 2024 23:00

🚑 Making filterblock more robust by introducing valid values

6406d5e

Signed-off-by: shiv <[email protected]>

🔊 slightly better logging

507e325

Signed-off-by: shiv <[email protected]>

🔧 updating grounded synth skills default flow

bada90e

Signed-off-by: shiv <[email protected]>

✅ updating test script imports

d5fc59d

Signed-off-by: shiv <[email protected]>

shivchander requested review from aakankshaduggal and oindrillac July 3, 2024 04:05

mergify bot added the ci-failure label Jul 3, 2024

🚨 fixing linter issues

1af5b74

Signed-off-by: shiv <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024

markmc requested changes Jul 3, 2024

View reviewed changes

mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024

🚨 fixing linter issues + removing unnecessary import in test scripts

7d33095

Signed-off-by: shiv <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Jul 3, 2024

🚨 fix linter warnings

ed7d74b

Signed-off-by: shiv <[email protected]>

mergify bot removed the ci-failure label Jul 3, 2024

russellb reviewed Jul 3, 2024

View reviewed changes

russellb requested changes Jul 3, 2024

View reviewed changes

aakankshaduggal approved these changes Jul 3, 2024

View reviewed changes

russellb mentioned this pull request Jul 4, 2024

Handle type conversion errors in FilterByValueBlock #78

Merged

shivchander closed this Jul 4, 2024

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#72 from russellb/model-namews

43e679f

naming: Add style guidance for Merlinite and Granite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix for Filterblocks #72

Hotfix for Filterblocks #72

shivchander commented Jul 3, 2024 •

edited

Loading

markmc Jul 3, 2024

aakankshaduggal Jul 3, 2024

shivchander Jul 3, 2024

markmc Jul 3, 2024

russellb Jul 3, 2024

shivchander Jul 4, 2024

mairin Jul 4, 2024

russellb Jul 4, 2024

russellb Jul 4, 2024

russellb Jul 4, 2024

markmc commented Jul 3, 2024

Major changes:

markmc commented Jul 3, 2024

markmc commented Jul 3, 2024

russellb Jul 3, 2024

oindrillac Jul 3, 2024

markmc Jul 3, 2024 •

edited

Loading

russellb Jul 3, 2024

russellb Jul 3, 2024

russellb left a comment

aakankshaduggal left a comment

shivchander commented Jul 4, 2024

Hotfix for Filterblocks #72

Hotfix for Filterblocks #72

Conversation

shivchander commented Jul 3, 2024 • edited Loading

Major changes:

Minor changes:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markmc commented Jul 3, 2024

Major changes:

markmc commented Jul 3, 2024

markmc commented Jul 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markmc Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

russellb left a comment

Choose a reason for hiding this comment

aakankshaduggal left a comment

Choose a reason for hiding this comment

shivchander commented Jul 4, 2024

shivchander commented Jul 3, 2024 •

edited

Loading

markmc Jul 3, 2024 •

edited

Loading