Terminator for Parallel Map Workflows #329
pavlis
started this conversation in
Design & Development
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I haven't directly encountered this problem yet, but I think there may be an issue with processing of large bag/rdd containers of data that are never reduced. A type example might be a large waveform collection like the usarray test data we've been using that has 2.5 million waveforms. Consider this generic workflow in pseudocode with a dask dialect:
Now I think that if a person wrote a workflow that way line 5 would cause a memory fault for a large data set because our
db.save_data
method returns a (slightly modified) copy of what it saved. If the compute method for bag (dask dialect) is called as above it will generate a copy of which I think it will try to put in memory.I'm not sure, but I think a solution to avoid a memory problem is to just replace the last line with no assignment (
5) data.compute()
) but I am not sure of that. I'm pretty sure the version with the assignment would cause a memory problem. Can anyone confirm or deny that?Now if my understanding is correct we could call this a user problem that can be solved by documentation, but the point of this post is to provide a suggestion for an alternative.
The idea i had was we should define a generic class we could call
Terminator
. The concept of a terminator is a reducer that can handle the expected output of a workflow and produce a summary useful to the user. A simple example might be one that counts the number of live and dead objects at the end of the workflow. That particular example could probably best be implemented with a preceding lambda to test if each bag/rdd element was live or dead returning True or False and then using a reduce that just counts (there are probably built ins for this). I wonder if it would be possible to do this in an OOP way with an abstract base classBasicTerminator
that has one required method we might callreduce
. The idea is that an instance could be used in a reduce operation at the end of a workflow with the reduce method being the function used for the reduce.Thoughts on this idea? Do we need it? What would be the best way to implement it?
Beta Was this translation helpful? Give feedback.
All reactions