raw data (miniseed and seed) schema #108
Replies: 1 comment
-
Yes, I agree to that. Data import is a whole new level of complexity that is beyond the scope of mspass at current stage. I guess we need to make clear that the data provenance feature we provide is really for the data processing done within mspass. For external data, we don't necessarily have to retain all the history. Presumably, the problem will largely go away once IRIS is moved to the Cloud (well, there are still other data sources that subject to the same problem). btw, I have a minor comment regarding using the |
Beta Was this translation helpful? Give feedback.
-
We need to settle some issues about how to handle raw miniseed and other format data. I am opening this discussion page to provide a place to preserve our discussion on this topic.
First, earlier we recognized that we want to build this package on the import model for handling data. That is, MsPASS uses its own internal object format to provide atomic containers for two basic types that we call TimeSeries and Seismogram data. Interwoven in this discussion is the related topics of "Ensembles" that are logically grouped atomic objects. I want to remind us that we need to support multiple export formats that can be imported into mspass. The immediate discussion here is about miniseed data because that is the most standard format and the one we have for our large test data set with USArray. For the record, I want to remind us that there is a clean and simple interface we should be able to exploit with obspy's reader for a range of other formats. That function is described here. Our existing import for miniseed, in fact, is based on obspy's reader note the long list of other formats. KEY POINT is we need to design a generic import schema and possibly generic import functions that simplifies creating import modules for other formats.
Our prototype for this type type thing is the ensembles.py file found here. A reminder that the current is limited to ensembles because of the specific kind of data we are handling for our initial tests: effectively USArray shot gathers for receiver function processing. We need to think more broadly about how to build a more generic reader than this prototype. For now, however, let me describe a few things about the prototype.
We may want to generalize this function for other formats to discard debris that would be baggage to carry through a long processing chain.
5. Imported data to match our framework will need a uuid to define the "origin" of the waveform. If IRIS web services did not create an impossible bottleneck, this could be a url defining the web service request. One strong possibility we should discuss is that is what should be stored in the database, but for now I had to create it since I'm reading these data from a file because I had now way to reconstruct the web service request. The function in ensemble.py with this signature:
does this by this sequence of code in that function:
i.e. it uses our ProcessingHistory method newid to get a unique id string and then save it to MongoDB with the key "seed_file_id" (key subject to change as the comment says). This will work, but it is not at all ideal because if on destroyed the collection where this index is stored there is no way to recover anything about the id's relation to the original data. Further, the file information stored elsewhere is volatile and could become equally meaningless. The URL for web services is only long term solution to this dilemma I can think of now, but it is worth a discussion. Other imports will have similar issues only potentially far worse since a url may not exist. My recommendation for now is to punt this down the road and say import collections like wf_miniseed used in ensemble.py are scratch collections used for format conversions. Then we only need to make sure we preserve enough information in process collections like wf_Seismogram to uniquely identify the source. For SEED data that means keeping net,sta,chan, and loc codes and the absolute time interval. For other formats this is an open question.
A key point to close on for this lengthy introduction is I think for now we should not impose any restrictions on import collections like wf_miniseed defined in ensemble.py. We need only require that the readers like ensemble.py write Metadata fields with keys defined in our schema for any output to processing collections like wf_Seismogram and wf_TimeSeries.
Beta Was this translation helpful? Give feedback.
All reactions