Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic, parallel data iteration #68

Open
lorenzoh opened this issue Mar 18, 2022 · 2 comments
Open

Deterministic, parallel data iteration #68

lorenzoh opened this issue Mar 18, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@lorenzoh
Copy link
Contributor

The parallel eachobs implementation is not deterministic in that observations are returned as soon as they are loaded, so they may be returned out of order. This is very performant, and fine for some use cases like training, where data should be shuffled anyway.

To give the option to have a deterministic iteration would be helpful in many use cases, though.

This could be implemented as a wrapper around an existing iterator that does the following:

  • instead of iterating over data with the wrapped iterator, iterate over (1:nobs(data), data) to preserve ordering information
  • collect returned observations, stripping the index
  • return an observation only if all previous (by index) observations have been returned

I am unsure by how much this will affect performance and memory usage and how the interplay is with buffersize. Are there alternative approaches to this implementation?

@darsnack
Copy link
Member

I believe FFCV has the notion of a traversal order which we might want to look into. Apparently the quasi-random variant increases performance too, so there may be a third option between random and deterministic that we want to include here.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Mar 18, 2022

Do you know what they are doing there specifically? I thought that was just pre-shuffling and then storing into contiguous memory, but from that page it seems like that the quasi-random order is only relevant when not in-memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants