import_table design #257

pavlis · 2021-10-15T11:07:37Z

pavlis
Oct 15, 2021
Maintainer

I am proposing the following function prototype as a core MsPASS function. Note this is only a design docstring but it describes what I have in mind that I'm 100% sure is feasible. This is should be a very useful function to import a large class of auxiliary data to MsPASS but more importantly as a intermediary with relational database tables or programs like excel. Specifically:

Antelope user's can use command line tools to create (potentially large) tables of what they call "views" formed by relational database joins. e.g. a common construct for waveform processing in the antelope world is wfdisc->sitechan. You can find an example in our tutorials in one of the jupyter notebooks.
excel user's could export a spreadsheet they prepared as a csv file and use this function to import the data they want to use in some custom process. There are many examples like geometry in seismic reflection data.

Anyway, here is my proposed api for this process for discussion:

def import_table(db,filename,collection,
                format='unixtext',header_line=0,attribute_names=None,
                rename_attributes=None, attributes_to_use=None,
                one_to_one=True,parallel=False):
    """
    Import a text file representation of a table and store its 
    representation as a set of documents in a specified collection. 
    In one_to_one mode ever row in the input file will be saved 
    as a document in the specified collection with no tests for 
    duplications. That is feasible in MongoDB because the objectid 
    is used as the unique index.  When one_to_one is false 
    the panda (parallel set False) or dask (parallel set True) 
    drop_duplicates method is applied to the (optionally decimated)
    table before saving.  Use the attributes_to_use list in 
    combination with one_to_one=False to decimate a joined view
    to retrieve unique tuples duplicated in a join.  
    (e.g. for Antelope uses if you created a view by joining site to 
    wfdisc populate attributes_to_use with names like sta, lat, lon, elev
    ondate, and offdate and run this function with one_to_one set False.)
    
    :param db:  MongoDB database handle.  Normally a MsPASS Database, 
      but the base class raw MongoDB handle can also work in this context.
    :param filename:  path to text file that is to be read to create the 
      table object that is to be processed (internally we use pandas or 
      dask dataframes)
    :param collection:  MongoDB collection name to be used to save the 
      (often subsetted) tuples of filename as documents in this collection. 
    :param format:  expected format of the text in this file. Current 
      supported options are:
          'unixtext' (default) is a text file with the standard unix 
            white space definition between tokens.   This format would 
            be the norm, for example, from an Antelope raw table or view
            IF you exclude comment attributes that may contain spaces
        'csv' - standard csv table representation.  
    :param attribute_names: This argument must be either a list of (unique)  
      string names to define the attribute name tags for each column of the 
      input table.   The length of the array must match the number of 
      columns in the input table or this function will throw a MsPASSError 
      exception.   This argument is None by default which means the 
      function will assume the line specified by the "header_line" argument as 
      column headers defining the attribute name.  If header_line is less 
      than 0 this argument will be required.  When header_line is >= 0 
      and this argument (attribute_names) is defined all the names in 
      this list will override those stored in the file at the specified 
      line number.  
    :param  rename_attributes:   This is expected to be a python dict 
      keyed by names matching those defined in the file or attribute_names 
      array (i.e. the panda/dataframe column index names) and values defining 
      strings to use to override the original names.   That usage, of course, 
      is most common to override names in a file.  If you want to change all 
      the name use a custom attributes_name array as noted above.  This 
      argument is mostly to rename a small number of anomalous names. 
    :param attributes_to_use:  If used this argument must define a set of 
      attribute names that define the subset of the dataframe dataframe 
      attributes that are to be saved.  For relational db users this is 
      effectively a "select" list of attribute names.  The default is 
      None which is taken to mean no selection is to be done. 
    :param one_to_one: is an important boolean use to control if the 
      output is or is not filtered by rows.  The default is True 
      which means every tuple in the input file will create a single 
      MongoDB document.  (Useful, for example, to construct an wf_miniseed 
      collection css3.0 attributes.)  If False the (normally reduced) set 
      of attributes defined by attributes_to_use will be filtered with the 
      panda/dask dataframe drop_duplicates method before converting the 
      dataframe to documents and saving them to MongoDB.  That approach 
      is important, for example, to filter things like Antelope "site" or
      "sitechan" attributes created by a join to something like wfdisc and 
      saved as a text file to be processed by this function. 
    :param parallel:  When true we use the dask dataframe operation.  
      The default is false meaning the simpler, identical api panda 
      operators are used. 
    """

pavlis · 2021-10-15T11:09:10Z

pavlis
Oct 15, 2021
Maintainer Author

BTW here is the tutorial referenced above.

0 replies

pavlis · 2021-10-17T12:44:57Z

pavlis
Oct 17, 2021
Maintainer Author

First, I've already implemented this and it turned out to be surprisingly easy IF we only use dask for a scheduler. The dask implements most pandas methods with an parallel api so pandas calls in serial translate immediately into dask calls with the same signature.

What this brings up is a topic that perhaps should be in a different thread, but I'll put it here for now. I had an idea when I learned more about dataframes we should discuss. That is, we might consider an alternative data path for setting up a workflow that is more like the relational approach commonly used for antelope processing with their Datascope database (a relational db for those to which that name is foreign).

In datascope the standard paradigm to drive a workflow is to form a working view by some sequence of one or more of the following basic relational database operations: join, sort, select, and group. The whole idea from a relational perspective is to build a table of attributes that drives the workflow; the workflow ultimately reduces to an outer loop over either single tuples in the table or groups of tuples defined by a group operation. In the MsPASS perspective single tuples/groups would map to a single atomic data object: tuples map to TimeSeries or Seismogram objects and groups map to ensembles.

The idea this raises is that it might provide an alternative method to provide inputs to a reader. Now we are using the paradigm of one document in a wf collection maps to one atomic object. The approach I am thinking about could be a smoother path for importing data already managed by a relational database. We could either interact directly with a relational database server or simulate it with MongoDB and dataframes. I would not advise the former as a starting point but for MongoDB I think it would largely reduce to building something like a css3.0 schema definition that would implement at least some of the more common tables like wfdisc, site, sitechan, origin, event, arrival, and assoc. With the import_table function I just wrote it would be very easy to directly import antelope tables like site, sitechan, etc. and store them in collections with the same name. I already know it would not be especially hard to do things like build a dataframe from the relational equivalent of a join sequence like: wfdisc->sitechan->site->assoc->arrival->origin. (easy, I should say, in concept but dealing with all the potential errors is not so easy) Then all we'd need is a map function to map the dataframe into a set of MsPASS objects (TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble).

Once the map function just described ran a workflow could proceed exactly like one that originated from a read_distributed_data function call.

Looking at dask documentation the reader step may require some additional gyrations but I think the concept is valid.

Opinions?

8 replies

wangyinz Oct 18, 2021
Maintainer

I don't know how to do that neither, but a quick search leads me to this and this. Both seem to be good solutions, but we might want to test for the performance first. Please comment if you know any other solutions @haruming @Yangzhengtang.

pavlis Oct 18, 2021
Maintainer Author

Those are interesting examples. I think it is an example of using different words for a good search yielding different answers. Suspecting you used the keyword "range".

Potential here, but what I don't think any of those allow is a mixed join operation. merge_asof and the other example with numpy all are using numerical matches in isolation. The problem in find an arrival that matches a waveform is that the arrival also requires the "sta" key, which is a string match. I'm starting to think the only solution here is a loop approach where the table on the right is searched. I think the interval index thing noted in one of those posts could make that process more efficient, but I need to do some more digging to be sure. @haruming and @Yangzhengtang it would be good to hear if you have anything to add before I dig myself deeper here.

haruming Oct 18, 2021
Collaborator

Sorry, I don't think I have experience with what you described above, but the links Ian shared seem to be feasible. What I could think of is a naive approach to associate each waveform by the arrival time using a loop or something, which could be inefficient. By the way, the idea in the discussion I think is really cool and the design looks reasonable to me.

Yangzhengtang Oct 19, 2021
Collaborator

Sorry I don't have experience with that neither. The links Ian shared contains several good solutions. And I think this one using sqlite3 and this one using pandasql look pretty cool and very easy to read. After all, it seems that the issue that you described above can be done in a sql query, so maybe one straightforward solution is to just turn the dataframe into a table in a rational db and operate on that db. However, the performance might not be very good, because it obviously will do a lot of indexing when building the db. So we might need to do some benchmarking first.

Regarding the latter issue with matching the "sta" key, I think the sqlite3 and pandasql solutions above can still handle it well. And another solution that comes into my mind would be joining two dataframe into one dataframe and then filtering elements using something like df=df[df.sta_1 == df.sta_2], but this approach will create a very large intermediate dataframe and we might want to avoid it.

pavlis Oct 19, 2021
Maintainer Author

@Yangzhengtang that is a good point (about using the relational db to do some of this stuff before feeding the result to MsPASS) and is in fact the route I started with. That is, the new import_table function I wrote can read any text file that read_csv (pandas or dask) can eat up and convert it to a dataframe with attributes you assign. That is the model I used in the raw data tutorial where I used Antelope's database system to construct a large table that included the joins I needed to do. I am currently trying to figure out how to solve the problem for users who don't have antelope but who have the flat files antelope uses that were created by someone else. A case in point is the large database of arrival time measurements created by the array network facility of the usarray when it was still in existence. Here is the page I'm talking about.

I'm currently at the step where I think I need to make it work before I make it fast. The functionality I'm trying to achieve here is not an especially high priority.

pavlis · 2021-10-21T12:44:14Z

pavlis
Oct 21, 2021
Maintainer Author

Here is a new wrinkle on this problem. Hope you guys have a cleaner solution to this than I do.

I have this little code segment to convert the rows of a dataframe to a dict that can be passed to insert_one with MongoDB:

df=df_final[keeplist]
df=df.rename(columns=rename_these)
dbclient=DBClient()
db=Database(dbclient,'pwmigtest')

attribute_names=df.columns
for i in df.index:
    doc=dict()
    for k in attribute_names:
        doc[k] = df[k][i]
        print(type(doc[k]))
   db.wf_Seismogram.insert_one(doc)

where the print statement is for debugging. This is crashing on the insert_one call because of the following output of that debug line:

... lines above ...
class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'str'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'str'>
<class 'numpy.int64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
... many more like this for test...

Apparently pandas are interacting with numpy and numpy is changing the type of integers and floats to its internal namespace. Now I can do a crude conversion by testing type but that seems crazy. There seem to be a bunch of dataframe to converters that may be the solution, but the point is this is a very ugly collision for any use of dataframe interaction with MongoDB. Any ideas to make this cleaner would be appreciated.

1 reply

pavlis Oct 22, 2021
Maintainer Author

Solution Ian pointed me to with using dataframe's to_dict method combined with insert_many seems to be a solid and fast solution to saving a dataframe to MongoDB.

pavlis · 2021-10-22T15:06:50Z

pavlis
Oct 22, 2021
Maintainer Author

@Yangzhengtang since you are working on turning my prototype into a solid component of MsPASS I have a suggestion I realized this morning would be a helpful addition to the API. It came up trying to read data defined by documents saved to wf_Seismogram. The reader aborted because it requires the attribute with the key 'storage_mode'. The solution was easy using the dataframe insert method:

df.insert(1,'storage_mode','file')

noting the 1 as the first arg is a bit arbitrary and came from an example. It is the column position where the data are inserted. When saving to MongoDB that order is irrelevant anyway.

We could just let user's solve this problem as I did, but it suggests to me import table needs and argument to allow setting one or more constant columns in a dataframe. I propose this:

def import_table(...other parameters as now..., append_column=None):

Default None means do nothing. I propose if the value is not null it should be a python dictionary. It the content is a single value it can be passed as above to define a constant value for the entire column of data. As I read the documentation the dict could also contain a list of values that are to be set, but in that case the list must be the same length as the number of tuples in the table. The main use for this, I suspect, would be the form I used setting a constant for a single table.

2 replies

pavlis Oct 22, 2021
Maintainer Author

Actually, a name for the argument here is probably insert_column since it calls the insert method of dataframe. append would be confusing because there is also an append method.

Yangzhengtang Oct 22, 2021
Collaborator

That is a good idea, and it is reasonable to add this insert_column parameter, I will do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import_table design #257

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

import_table design #257

pavlis Oct 15, 2021 Maintainer

Replies: 4 comments · 11 replies

pavlis Oct 15, 2021 Maintainer Author

pavlis Oct 17, 2021 Maintainer Author

wangyinz Oct 18, 2021 Maintainer

pavlis Oct 18, 2021 Maintainer Author

haruming Oct 18, 2021 Collaborator

Yangzhengtang Oct 19, 2021 Collaborator

pavlis Oct 19, 2021 Maintainer Author

pavlis Oct 21, 2021 Maintainer Author

pavlis Oct 22, 2021 Maintainer Author

pavlis Oct 22, 2021 Maintainer Author

pavlis Oct 22, 2021 Maintainer Author

Yangzhengtang Oct 22, 2021 Collaborator

pavlis
Oct 15, 2021
Maintainer

Replies: 4 comments 11 replies

pavlis
Oct 15, 2021
Maintainer Author

pavlis
Oct 17, 2021
Maintainer Author

wangyinz Oct 18, 2021
Maintainer

pavlis Oct 18, 2021
Maintainer Author

haruming Oct 18, 2021
Collaborator

Yangzhengtang Oct 19, 2021
Collaborator

pavlis Oct 19, 2021
Maintainer Author

pavlis
Oct 21, 2021
Maintainer Author

pavlis Oct 22, 2021
Maintainer Author

pavlis
Oct 22, 2021
Maintainer Author

pavlis Oct 22, 2021
Maintainer Author

Yangzhengtang Oct 22, 2021
Collaborator