Accumulators that != reduce #285
Replies: 3 comments
-
Correction to above. Could have edited that box, I suppose, but preserving this to emphasize this point is probably important. Correction is that the generic issue above breaks spark and dask because the binary operator is not associative. In normal arithmetic, addition is associative, for example, because c=b+a=a+b. When a and b are not simple types that cannot be converted, as in the examples above, that rule is broken. We can cast this as a different abstract rule. We need a generalization of c+=a where c and a can be different types but for which the c+=a operation makes sense (could literally mean it is defined). For example 1 above c is an ensemble object and a is one of our atomic data types (TimeSeries or Seismogram). For example 2, c is the grid object into which data are accumulated and "a" is the seismogram like data migrated from time to depth. |
Beta Was this translation helpful? Give feedback.
-
This discussion has been inactive for months and I want to bring it to the front of our discussion. In working on updates to the User Manual I realized the functionality discussed in this page defines some essential tools we need to have in MsPASS.
We need to come to closure on both of these issues. Propose we start in our next weekly meeting and preserve our decision here for the record. |
Beta Was this translation helpful? Give feedback.
-
For the record in our call yesterday I agreed to add to this discussion to design the api for problem 2 in the box above: standard methods for parallel loading of a data set of ensembles driven by a list of something small enough to always fit in memory. The type example would be common source gathers defined at startup by a list of source_ids. I think first this problem can be made more generic in the same way we have implemented editing and normalization: using a base class and a standard api for a class that defines the query generator. Here is an initial suggestion of the basic content of the base class:
This design allows a generic algorithm to load based on this pseudocode using a dask construct:
The concrete implementations of this base class I think we should produce immediately are:
Looking forward to feedback on this design. Also think about what other implementations might be helpful. In writing this I realzed item 4 above (the time interval query) is a huge hole for a large class of algorithms. |
Beta Was this translation helpful? Give feedback.
-
I am becoming convinced we have a general class of problems for which we need a generic solution in MsPASS. That problem is one that acts like a reduce operator, in some sense, but for which I am pretty sure neither dask nor spark can handle. I have run across two examples in recent weeks:
First, is my understanding of this correct? Maybe there is some other function the dask or spark api that can handle this, but I've not found it.
The solution I used in pwmig is not ideal, but is a workaround. It is also a dask specific solution because I don't think spark has the equivalent of delayed. Rather than show the pwmig code for mspass it might be more informative to show a rough implementation of part of problem 1 above (take a bag of TimeSeries objects and produce a TimeSeriesEnsemble.)
We could use that approach to build a generic bag to ensemble function for dask. I don't know how to do the equivalent in spark. That particular functionality probably should be included in MsPASS.
Is there a more generic solution to this problem?
Beta Was this translation helpful? Give feedback.
All reactions