Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix for Filterblocks #72

Closed
wants to merge 8 commits into from
7 changes: 4 additions & 3 deletions scripts/test_freeform_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import SynthSkillsFlow
from src.instructlab.sdg.pipeline import Pipeline
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import SynthSkillsFlow
from instructlab.sdg.pipeline import Pipeline


# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
Expand Down
7 changes: 4 additions & 3 deletions scripts/test_grounded_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import SynthGroundedSkillsFlow
from src.instructlab.sdg.pipeline import Pipeline
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import SynthGroundedSkillsFlow
from instructlab.sdg.pipeline import Pipeline


# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
Expand Down
9 changes: 5 additions & 4 deletions scripts/test_knowledge.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import MMLUBenchFlow, SynthKnowledgeFlow
from src.instructlab.sdg.pipeline import Pipeline
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import MMLUBenchFlow, SynthKnowledgeFlow
from instructlab.sdg.pipeline import Pipeline

# Please don't add you vLLM endpoint key here

# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
openai_api_base = "Add model endpoint here"

Expand Down
30 changes: 19 additions & 11 deletions src/instructlab/sdg/default_flows.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ def get_flow(self) -> list:
"block_name": "filter_faithfulness",
"filter_column": "judgment",
"filter_value": "YES",
"valid_values": ["YES", "NO"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter currently only accepts judgement=YES ... so adding is a no-op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the in operation to check if the generated output belongs in a list of strings.

As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)

But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:

    def filter_sample(sample):
        value = sample["judgment"]
        if value not in ["YES", "NO"]:
            return False
        if value != "YES":
            return False
        return True

Am I missing something?

(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it).

Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee

though note I've only tested via that unit test so far

"operation": operator.eq,
"batch_kwargs": {
"num_procs": 8,
Expand Down Expand Up @@ -226,9 +227,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_relevancy",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [0, 1, 2],
shivchander marked this conversation as resolved.
Show resolved Hide resolved
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -260,7 +262,8 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_verify_question",
"filter_column": "rating",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"batch_kwargs": {
Expand Down Expand Up @@ -312,9 +315,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_questions",
"filter_column": "score",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
shivchander marked this conversation as resolved.
Show resolved Hide resolved
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -356,9 +360,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_qa_pair",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [1, 2, 3],
"operation": operator.ge,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -406,6 +411,7 @@ def get_flow(self) -> list:
"output_cols": ["question"],
"batch_kwargs": {
"num_procs": 8,
"num_samples": 3,
"batched": self.batched,
},
},
Expand All @@ -432,9 +438,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_grounded_questions",
"filter_column": "score",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -476,9 +483,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_grounded_qa_pair",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [1, 2, 3],
"operation": operator.ge,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down
24 changes: 19 additions & 5 deletions src/instructlab/sdg/filterblock.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,39 @@

class FilterByValueBlock(Block):
def __init__(
self, filter_column, filter_value, operation, convert_dtype=None, **batch_kwargs
self, filter_column, filter_value, operation, valid_values, convert_dtype=None, **batch_kwargs
) -> None:
super().__init__(block_name=self.__class__.__name__)
self.value = filter_value
self.column_name = filter_column
self.operation = operation
self.valid_values = valid_values
self.convert_dtype = convert_dtype
self.num_procs = batch_kwargs.get("num_procs", 1)

def _fiter_invalid_values(self, samples):
samples = samples.filter(
lambda x: x[self.column_name] in self.valid_values,
num_proc=self.num_procs,
)
return samples

def _convert_dtype(self, sample):
try:
sample[self.column_name] = self.convert_dtype(sample[self.column_name])
except ValueError:
sample[self.column_name] = None
shivchander marked this conversation as resolved.
Show resolved Hide resolved
return sample

def generate(self, samples) -> Dataset:
if self.convert_dtype:
samples = samples.map(
lambda x: {
**x,
self.column_name: self.convert_dtype(x[self.column_name]),
},
self._convert_dtype,
num_proc=self.num_procs,
)

samples = self._fiter_invalid_values(samples)

return samples.filter(
lambda x: self.operation(x[self.column_name], self.value),
num_proc=self.num_procs,
Expand Down
4 changes: 3 additions & 1 deletion src/instructlab/sdg/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
block = block_type(**block_config)

logger.info("Running block: %s", block_config["block_name"])
logger.info(dataset)
logger.info("Input: %s", dataset)

dataset = block.generate(dataset, **gen_kwargs)

Expand All @@ -56,5 +56,7 @@

if drop_duplicates_cols:
dataset = self._drop_duplicates(dataset, cols=drop_duplicates_cols)

Check warning on line 59 in src/instructlab/sdg/pipeline.py

View workflow job for this annotation

GitHub Actions / lint

C0303: Trailing whitespace (trailing-whitespace)
logger.info("Output: %s\n\n", dataset)

return dataset
Loading