Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fairseq integration: h5py error when using multiple GPUs #3

Closed
noe opened this issue Sep 10, 2019 · 2 comments
Closed

fairseq integration: h5py error when using multiple GPUs #3

noe opened this issue Sep 10, 2019 · 2 comments

Comments

@noe
Copy link
Owner

noe commented Sep 10, 2019

When integrating an Hdf5RecordReader in a custom implementation of a fairseq dataset, the following error pops up as soon as multiple GPUs are used:

Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "stringsource", line 5, in h5py.h5f.__pyx_unpickle_FileID
  File "h5py/_objects.pyx", line 178, in h5py._objects.ObjectID.__cinit__
TypeError: __cinit__() takes exactly 1 positional argument (0 given)

The same code runs perfectly fine when only one GPU is used.

@noe
Copy link
Owner Author

noe commented Sep 10, 2019

When using multiple GPUs, fairseq uses pytorch's multiprocessing to speed up the batch creation and processing (i.e. I/O).

The inter-process communication relies on pickle to exchange python objects. One of the objects exchanged is the dataset itself. This means that it must be possible to pickle and unpickle the dataset. However, if you use Hdf5RecordReader, it is possible that you store it as a member variable in the dataset. Unfortunately, Hdf5RecordReader cannot be pickled because internally it keeps h5py objects (see this issue in h5py).

The solution to this is to instruct pickle not to serialize the Hdf5RecordReader. One way for doing so is by overriding __getstate__ and __setstate__, as described in the documentation.

This way, you need to implement a a __getstate__ that removes the reader and a __setstate__ that re-creates it. This is a simplified example of how it would look like:

class MyCustomDataset(FairseqDataset):
    """
    Dataset to load a reader with records with fields.
    """

    def __init__(self, data_files: List[str]):
        self.data_files = data_files
        self.reader = Hdf5RecordReader(data_files)

    def __getstate__(self):
        state = self.__dict__.copy()
        # Don't pickle the reader (see https://github.com/h5py/h5py/issues/1092)
        del state["reader"]
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        # Add reader back since it doesn't exist in the pickle
        self.reader = Hdf5RecordReader(self.data_files)

    ...

@noe noe closed this as completed Sep 10, 2019
@noe noe pinned this issue May 15, 2020
@noe
Copy link
Owner Author

noe commented May 15, 2020

There is a related problem with reading the same HDF5 file from multiple threads within the same process. The issue is due to the compilation flags of the underlying HDF5 native library. You can see the details in this stackoverflow question.

As Soumith Chintala advised here, the solution for this is to have PyTorch use separate processes instead of threads for its data loaders. For this, we should add these lines before anything is executed:

import torch.multiprocessing as mp
mp.set_start_method('spawn')

If the main module is not under your control and you cannot change it (e.g. fairseq), I suggest using Python's sitecustomize mechanism (see Python docs): Python will import a module called sitecustomize if it finds it in the PYTHONPATH. Therefore, to make the default multiprocessing method be spawn you just need to create a file in your source directory called sitecustomize.py with the 2 python lines above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant