Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Dataset.map() multiprocessing failure #123

Closed
russellb opened this issue Jul 12, 2024 · 6 comments
Closed

Investigate Dataset.map() multiprocessing failure #123

russellb opened this issue Jul 12, 2024 · 6 comments

Comments

@russellb
Copy link
Member

This issue is called out in one of the commits in PR #117

The second issue is specific to map():

ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").

It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new type.

This second issue deserves further investigation and may require
a fix to the datasets library.

The related code in filterblock.py as of that PR is:

def _map_dtype(samples, column, dtype, num_proc=1):
    def convert_column(sample):
        try:
            sample[column] = dtype(sample[column])
        except ValueError as e:
            logger.error(
                "Error converting dtype: %s, filling with None to be filtered later", e
            )
            sample[column] = None
        return sample

    # FIXME: it appears multiprocessing map has issues with
    # None columns. If we pass num_proc>1 here and the error
    # case is triggered above, we get:
    #   ValueError: The features can't be aligned ...
    # because the column is still considered a string not
    # the new dtype.
    num_proc = 1

    return samples.map(convert_column, num_proc=num_proc)

We need to investigate this error more deeply to figure out the best fix

@markmc
Copy link
Contributor

markmc commented Jul 12, 2024

For reference, the type conversion failure handling case was introduced in #78

@markmc
Copy link
Contributor

markmc commented Jul 12, 2024

A tiny bit of searching led me to stuff like huggingface/datasets#3195 and huggingface/datasets#3676 - giving the impression that None is supposed to be handled, but it wouldn't surprise anyone if there were still latent bugs in None handling

@markmc
Copy link
Contributor

markmc commented Jul 12, 2024

From @russellb

Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.

Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.

block_type: FilterByValueBlock
   block_config:
     block_name: filter_questions
     filter_column: score
     filter_value: 1.0
     invalid_value: 0
     operation: eq
     convert_dtype: float

@russellb
Copy link
Member Author

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

@markmc
Copy link
Contributor

markmc commented Jul 15, 2024

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

From #110 I'm guessing it is to avoid:

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

i.e. what we default to on error needs to be a valid value for any of the supported operators

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

So, no ... it's any value of that type which will cause the operator to return False

@markmc
Copy link
Contributor

markmc commented Jul 30, 2024

Pretty sure this is resolved by #143

@markmc markmc closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants