Reset DataLoader workers instead of creating new ones (#35795)

Summary: This PR needs discussion as it changes the behavior of `DataLoader`. It can be closed if its not considered a good practice. Currently, the `DataLoader` spawns a new `_BaseDataLoaderIter` object every epoch, In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original `Dataset` object. If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders. This PR keeps the `_BaseDataLoaderIter` object alive and just resets it within epochs, so the workers remain active and so their own `Dataset` objects. People seem to file issues about this often. Pull Request resolved: pytorch/pytorch#35795 Reviewed By: ailzhang Differential Revision: D23426612 Pulled By: VitalyFedyunin fbshipit-source-id: e16950036bae35548cd0cfa78faa06b6c232a2ea
uwsampl · Sep 1, 2020 · 5472426 · 5472426
1 parent db6bd9d
commit 5472426
Show file tree

Hide file tree

Showing 4 changed files with 301 additions and 129 deletions.
diff --git a/docs/source/data.rst b/docs/source/data.rst
@@ -22,7 +22,8 @@ These options are configured by the constructor arguments of a
     DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
                batch_sampler=None, num_workers=0, collate_fn=None,
                pin_memory=False, drop_last=False, timeout=0,
-               worker_init_fn=None, *, prefetch_factor=2)
+               worker_init_fn=None, *, prefetch_factor=2,
+               persistent_workers=False)
 
 The sections below describe in details the effects and usages of these options.