Investigate Dataset.map() multiprocessing failure #123

russellb · 2024-07-12T01:25:33Z

This issue is called out in one of the commits in PR #117

The second issue is specific to map():
ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new type.

This second issue deserves further investigation and may require
a fix to the datasets library.

The related code in filterblock.py as of that PR is:

def _map_dtype(samples, column, dtype, num_proc=1):
    def convert_column(sample):
        try:
            sample[column] = dtype(sample[column])
        except ValueError as e:
            logger.error(
                "Error converting dtype: %s, filling with None to be filtered later", e
            )
            sample[column] = None
        return sample

    # FIXME: it appears multiprocessing map has issues with
    # None columns. If we pass num_proc>1 here and the error
    # case is triggered above, we get:
    #   ValueError: The features can't be aligned ...
    # because the column is still considered a string not
    # the new dtype.
    num_proc = 1

    return samples.map(convert_column, num_proc=num_proc)

We need to investigate this error more deeply to figure out the best fix

The text was updated successfully, but these errors were encountered:

markmc · 2024-07-12T07:24:11Z

For reference, the type conversion failure handling case was introduced in #78

markmc · 2024-07-12T07:29:03Z

A tiny bit of searching led me to stuff like huggingface/datasets#3195 and huggingface/datasets#3676 - giving the impression that None is supposed to be handled, but it wouldn't surprise anyone if there were still latent bugs in None handling

markmc · 2024-07-12T07:35:40Z

From @russellb

Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.

Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.

block_type: FilterByValueBlock
   block_config:
     block_name: filter_questions
     filter_column: score
     filter_value: 1.0
     invalid_value: 0
     operation: eq
     convert_dtype: float

russellb · 2024-07-13T07:44:02Z

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

markmc · 2024-07-15T10:32:49Z

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

From #110 I'm guessing it is to avoid:

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

i.e. what we default to on error needs to be a valid value for any of the supported operators

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

So, no ... it's any value of that type which will cause the operator to return False

markmc · 2024-07-30T08:05:45Z

Pretty sure this is resolved by #143

russellb mentioned this issue Jul 12, 2024

Add a PipelineContext class #117

Closed

This was referenced Jul 15, 2024

Improve FilterByValueBlock invalid value error handling #133

Closed

filterblock: add default_value for use with convert_dtype #143

Merged

markmc closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Dataset.map() multiprocessing failure #123

Investigate Dataset.map() multiprocessing failure #123

russellb commented Jul 12, 2024

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

russellb commented Jul 13, 2024

markmc commented Jul 15, 2024

markmc commented Jul 30, 2024

Investigate Dataset.map() multiprocessing failure #123

Investigate Dataset.map() multiprocessing failure #123

Comments

russellb commented Jul 12, 2024

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

markmc commented Jul 12, 2024

russellb commented Jul 13, 2024

markmc commented Jul 15, 2024

markmc commented Jul 30, 2024