import_table design #257
Replies: 4 comments 11 replies
-
BTW here is the tutorial referenced above. |
Beta Was this translation helpful? Give feedback.
-
First, I've already implemented this and it turned out to be surprisingly easy IF we only use dask for a scheduler. The dask implements most pandas methods with an parallel api so pandas calls in serial translate immediately into dask calls with the same signature. What this brings up is a topic that perhaps should be in a different thread, but I'll put it here for now. I had an idea when I learned more about dataframes we should discuss. That is, we might consider an alternative data path for setting up a workflow that is more like the relational approach commonly used for antelope processing with their Datascope database (a relational db for those to which that name is foreign). In datascope the standard paradigm to drive a workflow is to form a working view by some sequence of one or more of the following basic relational database operations: join, sort, select, and group. The whole idea from a relational perspective is to build a table of attributes that drives the workflow; the workflow ultimately reduces to an outer loop over either single tuples in the table or groups of tuples defined by a group operation. In the MsPASS perspective single tuples/groups would map to a single atomic data object: tuples map to TimeSeries or Seismogram objects and groups map to ensembles. The idea this raises is that it might provide an alternative method to provide inputs to a reader. Now we are using the paradigm of one document in a wf collection maps to one atomic object. The approach I am thinking about could be a smoother path for importing data already managed by a relational database. We could either interact directly with a relational database server or simulate it with MongoDB and dataframes. I would not advise the former as a starting point but for MongoDB I think it would largely reduce to building something like a css3.0 schema definition that would implement at least some of the more common tables like wfdisc, site, sitechan, origin, event, arrival, and assoc. With the import_table function I just wrote it would be very easy to directly import antelope tables like site, sitechan, etc. and store them in collections with the same name. I already know it would not be especially hard to do things like build a dataframe from the relational equivalent of a join sequence like: wfdisc->sitechan->site->assoc->arrival->origin. (easy, I should say, in concept but dealing with all the potential errors is not so easy) Then all we'd need is a map function to map the dataframe into a set of MsPASS objects (TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble). Once the map function just described ran a workflow could proceed exactly like one that originated from a read_distributed_data function call. Looking at dask documentation the reader step may require some additional gyrations but I think the concept is valid. Opinions? |
Beta Was this translation helpful? Give feedback.
-
Here is a new wrinkle on this problem. Hope you guys have a cleaner solution to this than I do. I have this little code segment to convert the rows of a dataframe to a dict that can be passed to insert_one with MongoDB:
where the print statement is for debugging. This is crashing on the insert_one call because of the following output of that debug line:
Apparently pandas are interacting with numpy and numpy is changing the type of integers and floats to its internal namespace. Now I can do a crude conversion by testing type but that seems crazy. There seem to be a bunch of dataframe to converters that may be the solution, but the point is this is a very ugly collision for any use of dataframe interaction with MongoDB. Any ideas to make this cleaner would be appreciated. |
Beta Was this translation helpful? Give feedback.
-
@Yangzhengtang since you are working on turning my prototype into a solid component of MsPASS I have a suggestion I realized this morning would be a helpful addition to the API. It came up trying to read data defined by documents saved to wf_Seismogram. The reader aborted because it requires the attribute with the key 'storage_mode'. The solution was easy using the dataframe insert method:
noting the 1 as the first arg is a bit arbitrary and came from an example. It is the column position where the data are inserted. When saving to MongoDB that order is irrelevant anyway. We could just let user's solve this problem as I did, but it suggests to me import table needs and argument to allow setting one or more constant columns in a dataframe. I propose this:
Default None means do nothing. I propose if the value is not null it should be a python dictionary. It the content is a single value it can be passed as above to define a constant value for the entire column of data. As I read the documentation the dict could also contain a list of values that are to be set, but in that case the list must be the same length as the number of tuples in the table. The main use for this, I suspect, would be the form I used setting a constant for a single table. |
Beta Was this translation helpful? Give feedback.
-
I am proposing the following function prototype as a core MsPASS function. Note this is only a design docstring but it describes what I have in mind that I'm 100% sure is feasible. This is should be a very useful function to import a large class of auxiliary data to MsPASS but more importantly as a intermediary with relational database tables or programs like excel. Specifically:
Anyway, here is my proposed api for this process for discussion:
Beta Was this translation helpful? Give feedback.
All reactions