Terminator for Parallel Map Workflows #329

pavlis · 2022-09-01T12:19:15Z

pavlis
Sep 1, 2022
Maintainer

I haven't directly encountered this problem yet, but I think there may be an issue with processing of large bag/rdd containers of data that are never reduced. A type example might be a large waveform collection like the usarray test data we've been using that has 2.5 million waveforms. Consider this generic workflow in pseudocode with a dask dialect:

1) data = read_distributed_data(args)
2) data = data.map(process1)(args)
3) data = data.map(process2)(args)
4) data = data.map(db.save_data)()
5) data = data.compute()

Now I think that if a person wrote a workflow that way line 5 would cause a memory fault for a large data set because our db.save_data method returns a (slightly modified) copy of what it saved. If the compute method for bag (dask dialect) is called as above it will generate a copy of which I think it will try to put in memory.

I'm not sure, but I think a solution to avoid a memory problem is to just replace the last line with no assignment (5) data.compute()) but I am not sure of that. I'm pretty sure the version with the assignment would cause a memory problem. Can anyone confirm or deny that?

Now if my understanding is correct we could call this a user problem that can be solved by documentation, but the point of this post is to provide a suggestion for an alternative.

The idea i had was we should define a generic class we could call Terminator. The concept of a terminator is a reducer that can handle the expected output of a workflow and produce a summary useful to the user. A simple example might be one that counts the number of live and dead objects at the end of the workflow. That particular example could probably best be implemented with a preceding lambda to test if each bag/rdd element was live or dead returning True or False and then using a reduce that just counts (there are probably built ins for this). I wonder if it would be possible to do this in an OOP way with an abstract base class BasicTerminator that has one required method we might call reduce. The idea is that an instance could be used in a reduce operation at the end of a workflow with the reduce method being the function used for the reduce.

Thoughts on this idea? Do we need it? What would be the best way to implement it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminator for Parallel Map Workflows #329

{{title}}

Replies: 0 comments

Select a reply

Terminator for Parallel Map Workflows #329

pavlis Sep 1, 2022 Maintainer

Replies: 0 comments

pavlis
Sep 1, 2022
Maintainer