Need for join operator #462

pavlis · 2023-10-02T13:13:33Z

pavlis
Oct 2, 2023
Collaborator

Both dask and spark have a "join" operator that can be applied to a bag or rdd respectively. The documentation for a spark join is here and the documentation for dask is here.

There are two applications I can see for this operator:

It could potentially be used as an alternative for normalization.
The application I was seeking was merging the content of a table with one tuple per bag/rdd element.

I think the former is less important as we have a solid, focused approach that probably matches mspass better than this generic operator. The later is a useful way, I think, to handle a class of problems where there is an outside thing with one per datum that needs to be merged into the data object. The example that caused me to post this discussion is the following skeleton of an example I was working on for a new section of the user manual on how to handle continuous data with MsPASS:

    from mspasspy.db.DBClient import DBClient
    import dask.bag as dbg
    dbclient=DBClient()
    # we need two database handles.  One for the continuous data (dbc)
    # and one to save the segments  (dbo).
    dbc = dbclient.get_database("TA2010")   # TA continuous data from 2010
    dbo = dbclient.get_database("Pdata2010")  # arrivals from ANF picks

    def query_generator(input_tuple,stwin,etwin):
      """
      Generates a MongoDB query to run against wf_miniseed for waveform
      segments containing any of the time interval time+stwin<=t<=time+etwin.
      Returns a python dict that is used by read_distributed_data to
      generate a dask bag of ensembles.  Note this is an illustrative example
      and makes no sanity checks on inputs for simplicity.

      The input is an awkward tuple for this illustration to avoid details
      of how the arrival data is reformatted and handled.  We assume the
      tuple was created from some arrival table and has net as component 0,
      sta as component 1, the arrival time to use as component 2.  stwin
      and etwin are the times relative to the arrival that are to be cut
      from continuous data.
      """
      net=input_tuple[0]
      sta=input_tuple[1]
      time=input_tuple[2]
      query["net"]=net
      query["sta"]=sta
      stime=time+stwin
      etime=time+etwin
      # TODO this query is not corect but illustrates complex clauses required
      query = {
        $and [
          { "sta" : {"$eq" :wi sta}},
          { "net" : {"$eq" : net},
          { "starttime" : {"$le" : stime}
          { "endtime" : {"$ge" : etime}
        ]
      }
      return query

    def make_segments(bag_tuple,stwin,etwin):
      ensemble = bag_tuple[0]
      # handle dead (empty) ensembles cleanly returning a default constructed
      # datum dead by definition
      if ensemble.dead():
        return TimeSeries()
      ensout=TimeSeriesEnsemble()
      # join operation makes component 1 a copy of the arrival list tuple
      net = bag_tuple[1][0] # we don't actually need this here but shows structure
      sta = bag_tuple[1][1]
      time = bag_tuple[1][2]
      # the ensemble will usually contain multiple channels.  We have to
      # handle each independently
      chanset = set()
      for d in ensemble.member:
        chan = d["chan"]
        if loc in d:
          loc=d["loc"]
        else:
          loc=None
        chanset.add([chan,loc])
      for chan,loc in chanset:
        enstmp=TimeSeriesEnsemble()
        for d in ensemble.member:
          if d["chan"] == chan:
            if loc:
              if d.is_defined("loc"):
                if d["loc"] == loc:
                  enstmp.member.append((d))
        # enstmp now has only members match chan and loc - now we can run merge
        # if needed.
        if len(enstmp.member)>1:
          d = merge(enstmp.member,time+stwin,time+etwin,fix_overlaps=True)
          ensout.member.append(d)
        else:
          # above logic means this only happens if there is only one segment
          # in that case we can just use WindowData
          d = WindowData(enstmp.member,time+stwin,time+etwin)
          ensout.member.append(d)
      return ensout


    # This undefined function would read the arrival time data
    # stored in some external form and return a list of tuples
    # with one tuple for each arrival.  Tuple assumed here to be
    # net, sta, time
    arrival_list = arrival2list(args)
    window_start_time = -100.0   # time of window start relative to arrival
    window_end_time = 300.0   # time of window end relative to arrival
    mybag = dbg.from_sequence(arrival_list)
    mybag = mybag.map(query_generator,window_start_time,window_end_time)
    qlist=mybag.compute() # run parallel but convert mybag to list to pass on to reader
    mybag = read_distributed_data(qlist,dbc,collection="wf_miniseed")
    #TODO this is wrong - needs a matching function
    mybag = mybag.join(arrival_list,NEEDED)
    mybag = mybag.map(make_segments)
    # note the output of this function, with default here, is a list of
    # objectids of the saved waveforms
    out_ids = write_distributed_data(mybag,dbo,collection="wf_TimeSeries")

I know there is a lot there and as the TODOs say it is not complete, but the idea of the example is a parallel job to extract fixed time windows of data from a large table of "picks" like the Earthscope database of picks from the Array Network Facility. The same functionality would be common for a long long list of workflows if the new Earthscope cloud system allows us to efficiently access their waveform archive. To help you along the algorithm has these steps I'll describe in simple prose to complement the code:

Create a list of tuples with content [net, sta, pick_time] (note could also be a theoretical arrival time but this example would pull this from Antelope arrival table with a function that is not defined called "arrival2list".
The query_generator function defined above would exploit the new functionality of the pending branch for revising Database. The new version of read_distributed_data allows an input of a list of python dict containers defining mongoDB queries. Each dict in his implementation generates a query used to load an ensemble. Hence, the output of read_distributed_data above would be a bag of TimeSeriesEnsemble objects.
The point of this post is the line calling the bag join method. The idea here was to allow a way to merge the arrival times into the workflow where the map call that follows to the function make_segments could acess the time allow it to cut out the segment desired and post the arrival time somewhere to each datum.
It then writes the results with write_distributed_data.

There are numerous things in the above that are broken or incomplete so don't pick on details.

There are some issues I came across that caused me to write this post rather than just finalize an implemention of the above example.

I don't think I can even make join work for the above example with a dask bag. Follow the link I give above and you wills see the documentation page for this method is particularly awful. The problem I ran into was that I could not understand what exactly the required function "on_self" and the optional "on_other" where required to be. The language is purely generic. I think it is fairly limited to single string matching. The source code can be viewd here searching for "def join". Maybe this will work with the right "on_self" and "on_other" function, but I think this the matching implicit in the way this was implemented is limited to string matches. Could someone else in the group confirm or deny my understanding of that code? If my understanding is correct that restriction is a serious limitation.
The spark RDD join is much more functional. It, in fact, is functionally more akin to the dask DataFrame join. Both have essentially standard relational database join functionality of specified keys. There are two issues that brings up. First, it seems to be a mismatch in the pyspark and dask apis for rdd relative to bag. It is not a mismatch, however, betwen an rdd an a dask DataFrame as those to versions of join have comparable functionality.

The main design issue this brings up then is this question: how should one implement the algorithm above and do we need to add something to MsPASS to support it? I reiterate that type of algorithm is important for the seismology community. I am 100% sure of that statement.

A completely different way to do this process may be to write a map function that has a python array/tuple or a DataFrame as an argument. If bag had a method to do the equivalent of an index position for (i.e. respond that a datum is the ith item in the bag) it would trivial to get the ith component of an array/tuple and do something with it. If any of you know how to do that, education me and the community by responding below.

pavlis · 2023-10-02T14:08:15Z

pavlis
Oct 2, 2023
Collaborator Author

Short update. On further reading, I think all "join" operations are a mismatch to the need described above. On further reading, I realize that both dask and spark are treating join in the relational database sense and the operations make sense only when doing relational database type operations. Various forms of join could then be used for input algorithms that use a dataframe as an intermediary. read_distributed_data has a DataFrame as a possible input operation. Hence, join operations could be used to do normalization with a DataFrame prior to calling read_distributed_data. I think less and less that the algorithm I described above is solvable with any version of "join" operators. Don't, however, assume my conclusion is correct here. If anyone has more familiarity with join than me please educate me.

Hoping for creative solutions from someone else in the group.

0 replies

pavlis · 2023-10-04T13:36:32Z

pavlis
Oct 4, 2023
Collaborator Author

Had not heard aything from any of you, but I had an idea that worked. I haven't checked out the equivalent for pyspark yet, but a more careful reading of dask bag map operator documentation here showed a key phrase that map allowed two bags. The example there shows a usage of the stock add function but that led me to create the following that is a test implementation of exactly what was discussed above:

from mspasspy.ccore.seismic import TimeSeriesEnsemble,TimeSeries
from mspasspy.ccore.utility import Metadata
import dask.bag as dbg
x=TimeSeries(100)
x.set_live()
e = TimeSeriesEnsemble()
for i in range(5):
    e.member.append(x)
e.set_live()
print(len(e.member))
md=Metadata()
md["foo"]="bar"
mdl=[]
for i in range(5):
    mdl.append(md)
enslist=[]
for i in range(5):
    enslist.append(e)

b1 = dbg.from_sequence(enslist)
b2 = dbg.from_sequence(mdl)
def post(e,md):
    for k in md:
        e[k]=md[k]
    return e

bout = dbg.map(post,b1,b2)
bout.compute()

It produces this output:

[TimeSeriesEnsemble({'foo': 'bar'}),
 TimeSeriesEnsemble({'foo': 'bar'}),
 TimeSeriesEnsemble({'foo': 'bar'}),
 TimeSeriesEnsemble({'foo': 'bar'}),
 TimeSeriesEnsemble({'foo': 'bar'})]

Hence, the unambiguous solution here is to use the bag module's map function in this form.

This will get me past the original problem that created this post. We should, however, continue a discussion of this generic problem of merging parallel data sets like this. I wonder if we need to create a generic function for mspass to simplify this process for users?

0 replies

wangyinz · 2023-10-04T15:22:12Z

wangyinz
Oct 4, 2023
Maintainer

Oh, this is cool. I was not aware of the capability to operate on two bags. This seems to be something unique to dask, but I do figured out the equivalence in pyspark by asking chatGPT. Below is the example given by chatGPT:

# Initialize SparkContext
sc = SparkContext("local", "array_addition")

# Create two RDDs, each containing arrays
rdd1 = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rdd2 = sc.parallelize([[9, 8, 7], [6, 5, 4], [3, 2, 1]])

# Use the map operation to add corresponding arrays element-wise
result_rdd = rdd1.zip(rdd2).map(lambda x: [sum(pair) for pair in zip(x[0], x[1])])

# Show the result
result = result_rdd.collect()
for arr in result:
    print(arr)

I have not tested it, but by reading the document of the zip method here, I think it is correct. Basically, we will need to combine zip with a regular map to achieve the same thing, but it is doable.

Back to your question, I think it will be helpful to provide such thing either as a function or at least a documented example. I guess the question is how we can make this generic. It seems to me that the need of a "merge" could be pretty different depends on the data and the workflow. Still, I think adding new metadata entry is a pretty common need.

1 reply

pavlis Oct 4, 2023
Collaborator Author

Ahh, that is the solution I needed. I was poking around the spark documentation and might never have found that one. With that i can make take steps to implement some of this with a few additional items to finalize given the capabilities we have tentatively established for dask and spark.

First, I would propose we add this functionality as a new feature of read_distributed_data. I propose the following def line for the revision:

def read_distributed_data(
    data,
    db=None,
    query=None,
    scratchfile=None,
    collection="wf_TimeSeries",
    mode="promiscuous",
    normalize=None,
    load_history=False,
    exclude_keys=None,
    format="dask",
    npartitions=None,
    spark_context=None,
    data_tag=None,
    sort_clause=None,
    container_merge_function=None,
    container_to_merge=None,
    aws_access_key_id=None,
    aws_secret_access_key=None,
):

There are three new features in the above relevant to this discussion page. The first two are directly related that have the argument keys container_merge_function and container_to_merge. container_merge_function for my dask example would be the post function defined there. container_to_merge would be the dask bag or pyspark rdd that contains the parallel container containing a set of Metadata to be merged into the seismic data container. The third is the sort_clause argument that is only peripherally related to this discussion point. That turns out to be needed for the prototype example at the top of this post to guarantee data passed to the merge (confusion there - this is continuous data merge not the point of this discussion) function for continuous data are sorted in time order. Thus, a secondary issue here but a useful to essentail functionality for ensembles.

The second issue is a generic way to do this process within a workflow. I think we can do that with a generic function with this signature that is more-or-less what I'll have to build in read_distributed_data. Might even be able to have that function call this one. Here is a possible definition:

def append_metadata(pc1,pc2):
   """
   Appends Metadata stored in parallel container pc2 to assumed existing Metadata container of 
  components in parallel container pc2.  pc1 and pc2 can be either dask bag or pyspark rdd containers 
  but must be consistent in type (i.e. you can't mix dask and spark containers) and their lengths 
  must match.  The function should throw an exception if the size of the two containers do not match.
  Returns a bag or RDD (depending on input) with the contents of each pc2 pushed to the Metadata 
  container of pc1.  Note if there are any common key values pc1 and pc2 the pc2 component will 
  be the value defined in the output.  
  
  """

pavlis · 2023-10-05T14:06:41Z

pavlis
Oct 5, 2023
Collaborator Author

I modified the little test program above I wrote for dask for spark using the zip idea described above. After the usual hacking I got the following to work (btw the example from the documentation works out without any issues with the mspass container):

from pyspark import SparkContext
# Initialize SparkContext
# needed only for a restart if cell above is executed first
#sc = SparkContext("local", "array_addition")
# repeat of above to make self contained
from mspasspy.ccore.seismic import TimeSeriesEnsemble,TimeSeries
from mspasspy.ccore.utility import Metadata
import dask.bag as dbg
def post(e,md):
    """
    Merge function.  Copies md to metadata of ensemble.
    """
    for k in md:
        e[k]=md[k]
    return e
x=TimeSeries(100)
x.set_live()
e = TimeSeriesEnsemble()
for i in range(5):
    e.member.append(x)
e.set_live()
print(len(e.member))
md=Metadata()
md["foo"]="bar"
mdl=[]
for i in range(5):
    mdl.append(md)
enslist=[]
for i in range(5):
    enslist.append(e)
rdd1 = sc.parallelize(enslist)
rdd2 = sc.parallelize(mdl)
result_rdd = rdd1.zip(rdd2).map(lambda x: post(x[0],x[1]))

# Show the result
result = result_rdd.collect()
for x in result:
    print(type(x),x)

Produces this output

5
<class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'> {'foo': 'bar'}
<class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'> {'foo': 'bar'}
<class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'> {'foo': 'bar'}
<class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'> {'foo': 'bar'}
<class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'> {'foo': 'bar'}

So, if you concur with my plan for mods to read_distibuted_data I will make it so.

0 replies

wangyinz · 2023-10-05T15:01:50Z

wangyinz
Oct 5, 2023
Maintainer

That's great! Yeah, I think the design is pretty good. One minor point is that we may also want a dedicated function to do the merge outside of the read step. This seems to be a common need not just at the read step.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need for join operator #462

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Need for join operator #462

pavlis Oct 2, 2023 Collaborator

Replies: 5 comments · 1 reply

pavlis Oct 2, 2023 Collaborator Author

pavlis Oct 4, 2023 Collaborator Author

wangyinz Oct 4, 2023 Maintainer

pavlis Oct 4, 2023 Collaborator Author

pavlis Oct 5, 2023 Collaborator Author

wangyinz Oct 5, 2023 Maintainer

pavlis
Oct 2, 2023
Collaborator

Replies: 5 comments 1 reply

pavlis
Oct 2, 2023
Collaborator Author

pavlis
Oct 4, 2023
Collaborator Author

wangyinz
Oct 4, 2023
Maintainer

pavlis Oct 4, 2023
Collaborator Author

pavlis
Oct 5, 2023
Collaborator Author

wangyinz
Oct 5, 2023
Maintainer