fairseq integration: h5py error when using multiple GPUs #3

noe · 2019-09-10T16:18:47Z

When integrating an Hdf5RecordReader in a custom implementation of a fairseq dataset, the following error pops up as soon as multiple GPUs are used:

Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "stringsource", line 5, in h5py.h5f.__pyx_unpickle_FileID
  File "h5py/_objects.pyx", line 178, in h5py._objects.ObjectID.__cinit__
TypeError: __cinit__() takes exactly 1 positional argument (0 given)

The same code runs perfectly fine when only one GPU is used.

The text was updated successfully, but these errors were encountered:

noe · 2019-09-10T16:32:35Z

When using multiple GPUs, fairseq uses pytorch's multiprocessing to speed up the batch creation and processing (i.e. I/O).

The inter-process communication relies on pickle to exchange python objects. One of the objects exchanged is the dataset itself. This means that it must be possible to pickle and unpickle the dataset. However, if you use Hdf5RecordReader, it is possible that you store it as a member variable in the dataset. Unfortunately, Hdf5RecordReader cannot be pickled because internally it keeps h5py objects (see this issue in h5py).

The solution to this is to instruct pickle not to serialize the Hdf5RecordReader. One way for doing so is by overriding __getstate__ and __setstate__, as described in the documentation.

This way, you need to implement a a __getstate__ that removes the reader and a __setstate__ that re-creates it. This is a simplified example of how it would look like:

class MyCustomDataset(FairseqDataset):
    """
    Dataset to load a reader with records with fields.
    """

    def __init__(self, data_files: List[str]):
        self.data_files = data_files
        self.reader = Hdf5RecordReader(data_files)

    def __getstate__(self):
        state = self.__dict__.copy()
        # Don't pickle the reader (see https://github.com/h5py/h5py/issues/1092)
        del state["reader"]
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        # Add reader back since it doesn't exist in the pickle
        self.reader = Hdf5RecordReader(self.data_files)

    ...

noe · 2020-05-15T10:54:52Z

There is a related problem with reading the same HDF5 file from multiple threads within the same process. The issue is due to the compilation flags of the underlying HDF5 native library. You can see the details in this stackoverflow question.

As Soumith Chintala advised here, the solution for this is to have PyTorch use separate processes instead of threads for its data loaders. For this, we should add these lines before anything is executed:

import torch.multiprocessing as mp
mp.set_start_method('spawn')

If the main module is not under your control and you cannot change it (e.g. fairseq), I suggest using Python's sitecustomize mechanism (see Python docs): Python will import a module called sitecustomize if it finds it in the PYTHONPATH. Therefore, to make the default multiprocessing method be spawn you just need to create a file in your source directory called sitecustomize.py with the 2 python lines above.

noe closed this as completed Sep 10, 2019

noe pinned this issue May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fairseq integration: h5py error when using multiple GPUs #3

fairseq integration: h5py error when using multiple GPUs #3

noe commented Sep 10, 2019 •

edited

Loading

noe commented Sep 10, 2019 •

edited

Loading

noe commented May 15, 2020

fairseq integration: h5py error when using multiple GPUs #3

fairseq integration: h5py error when using multiple GPUs #3

Comments

noe commented Sep 10, 2019 • edited Loading

noe commented Sep 10, 2019 • edited Loading

noe commented May 15, 2020

noe commented Sep 10, 2019 •

edited

Loading

noe commented Sep 10, 2019 •

edited

Loading