Deterministic, parallel data iteration #68

lorenzoh · 2022-03-18T08:25:31Z

The parallel eachobs implementation is not deterministic in that observations are returned as soon as they are loaded, so they may be returned out of order. This is very performant, and fine for some use cases like training, where data should be shuffled anyway.

To give the option to have a deterministic iteration would be helpful in many use cases, though.

This could be implemented as a wrapper around an existing iterator that does the following:

instead of iterating over data with the wrapped iterator, iterate over (1:nobs(data), data) to preserve ordering information
collect returned observations, stripping the index
return an observation only if all previous (by index) observations have been returned

I am unsure by how much this will affect performance and memory usage and how the interplay is with buffersize. Are there alternative approaches to this implementation?

The text was updated successfully, but these errors were encountered:

darsnack · 2022-03-18T13:16:59Z

I believe FFCV has the notion of a traversal order which we might want to look into. Apparently the quasi-random variant increases performance too, so there may be a third option between random and deterministic that we want to include here.

lorenzoh · 2022-03-18T18:11:13Z

Do you know what they are doing there specifically? I thought that was just pre-shuffling and then storing into contiguous memory, but from that page it seems like that the quasi-random order is only relevant when not in-memory.

lorenzoh added the enhancement New feature or request label Mar 18, 2022

lorenzoh mentioned this issue Mar 18, 2022

Reproductivity problem with multi-threading lorenzoh/DataLoaders.jl#32

Open

lorenzoh mentioned this issue Apr 30, 2022

Add parallel and shuffle support to eachobs and DataLoader #82

Merged

lorenzoh mentioned this issue May 28, 2022

Archiving DataLoaders.jl #90

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic, parallel data iteration #68

Deterministic, parallel data iteration #68

lorenzoh commented Mar 18, 2022

darsnack commented Mar 18, 2022

lorenzoh commented Mar 18, 2022 •

edited

Loading

Deterministic, parallel data iteration #68

Deterministic, parallel data iteration #68

Comments

lorenzoh commented Mar 18, 2022

darsnack commented Mar 18, 2022

lorenzoh commented Mar 18, 2022 • edited Loading

lorenzoh commented Mar 18, 2022 •

edited

Loading