raw data (miniseed and seed) schema #108

pavlis · 2021-01-14T18:34:08Z

pavlis
Jan 14, 2021
Collaborator

We need to settle some issues about how to handle raw miniseed and other format data. I am opening this discussion page to provide a place to preserve our discussion on this topic.

First, earlier we recognized that we want to build this package on the import model for handling data. That is, MsPASS uses its own internal object format to provide atomic containers for two basic types that we call TimeSeries and Seismogram data. Interwoven in this discussion is the related topics of "Ensembles" that are logically grouped atomic objects. I want to remind us that we need to support multiple export formats that can be imported into mspass. The immediate discussion here is about miniseed data because that is the most standard format and the one we have for our large test data set with USArray. For the record, I want to remind us that there is a clean and simple interface we should be able to exploit with obspy's reader for a range of other formats. That function is described here. Our existing import for miniseed, in fact, is based on obspy's reader note the long list of other formats. KEY POINT is we need to design a generic import schema and possibly generic import functions that simplifies creating import modules for other formats.

Our prototype for this type type thing is the ensembles.py file found here. A reminder that the current is limited to ensembles because of the specific kind of data we are handling for our initial tests: effectively USArray shot gathers for receiver function processing. We need to think more broadly about how to build a more generic reader than this prototype. For now, however, let me describe a few things about the prototype.

This reader mostly assumes the data are stored with one ensemble of data per file. That is a specific assumption that is too limiting, but it is what we have that works.
The read function in this prototype is a large memory model algorithm. It is that way because we use obspy's reader which is based on a hidden assumption of the authors that the data are stored in many small files. A shot gather file from USArray like we have is not a small file (3500+ channels at 40 sps for a bit more than one hour of time=around 500 million numbers = 4 G when expanded to doubles). Hence, that algorithm is very much subject to memory problems.
The solution I created for this import function is to first build an index into the raw file to be read. I think that model remains the right one BUT is currently subject to a problem with obspy's reader. That is their reader was designed to eat up single files without regard to size. Ultimately we either need to work with the obspy developers to add methods to handle a seek to foff and convert nbytes algorithm and our schema should allow for that capability. Right not the reader would ignore any such definition.
I discovered that obspy's reader left some useless information, in my opinion, in the "stats" container they user for a header. The branch I checked in this morning has this code to deal with this issue:

def erase_seed_metadata(d,keys=['mseed','_format']):
    """
    Erase a set of debris obspy's reader adds to stats array that get copied to data.  We
    don't need that debris once data are converted to MsPASS data objects and stored in
    MongoDB.  Use this function when importing seed data with obspy's reader to
    reduce junk in the database.
    :param d: is the data object to be cleaned (assumed only to be a child of Metadata so
      erase will work.
    :param keys:  list of keys to clear.  Default should be all that is needed for
      standard use.
    """
    for mem in d.member:
        for k in keys:
            mem.erase(k)

We may want to generalize this function for other formats to discard debris that would be baggage to carry through a long processing chain.
5. Imported data to match our framework will need a uuid to define the "origin" of the waveform. If IRIS web services did not create an impossible bottleneck, this could be a url defining the web service request. One strong possibility we should discuss is that is what should be stored in the database, but for now I had to create it since I'm reading these data from a file because I had now way to reconstruct the web service request. The function in ensemble.py with this signature:

def dbsave_seed_ensemble_file(db,file,gather_type="event",
                keys=None):

does this by this sequence of code in that function:

his=ProcessingHistory()  # used only to create uuids
...
for d in dseis:
            mddict={}
            mddict['net']=d.stats['network']
            mddict['sta']=d.stats['channel']
            mddict['chan']=d.stats['location']
            st=d.stats['starttime']
            et=d.stats['endtime']
            mddict['starttime']=st.timestamp
            mddict['endtime']=et.timestamp
            stimes.append(st.timestamp)
            etimes.append(et.timestamp)
            mddict['sampling_rate']=d.stats['sampling_rate']
            mddict['delta']=d.stats['delta']
            mddict['npts']=d.stats['npts']
            mddict['calib']=d.stats['calib']
            # this key name could change
            mddict['seed_file_id']=his.newid()
            members.append(mddict)

i.e. it uses our ProcessingHistory method newid to get a unique id string and then save it to MongoDB with the key "seed_file_id" (key subject to change as the comment says). This will work, but it is not at all ideal because if on destroyed the collection where this index is stored there is no way to recover anything about the id's relation to the original data. Further, the file information stored elsewhere is volatile and could become equally meaningless. The URL for web services is only long term solution to this dilemma I can think of now, but it is worth a discussion. Other imports will have similar issues only potentially far worse since a url may not exist. My recommendation for now is to punt this down the road and say import collections like wf_miniseed used in ensemble.py are scratch collections used for format conversions. Then we only need to make sure we preserve enough information in process collections like wf_Seismogram to uniquely identify the source. For SEED data that means keeping net,sta,chan, and loc codes and the absolute time interval. For other formats this is an open question.

A key point to close on for this lengthy introduction is I think for now we should not impose any restrictions on import collections like wf_miniseed defined in ensemble.py. We need only require that the readers like ensemble.py write Metadata fields with keys defined in our schema for any output to processing collections like wf_Seismogram and wf_TimeSeries.

wangyinz · 2021-01-15T07:39:07Z

wangyinz
Jan 15, 2021
Maintainer

Yes, I agree to that. Data import is a whole new level of complexity that is beyond the scope of mspass at current stage. I guess we need to make clear that the data provenance feature we provide is really for the data processing done within mspass. For external data, we don't necessarily have to retain all the history. Presumably, the problem will largely go away once IRIS is moved to the Cloud (well, there are still other data sources that subject to the same problem).

btw, I have a minor comment regarding using the newid method of ProcessingHistory to generate uuids. We should probably use the uuid module from Python. Actually, that is what our global history object will be using according to the design that @JiaoMaWHU and I put together.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw data (miniseed and seed) schema #108

{{title}}

Replies: 1 comment

{{title}}

Select a reply

raw data (miniseed and seed) schema #108

pavlis Jan 14, 2021 Collaborator

Replies: 1 comment

wangyinz Jan 15, 2021 Maintainer

pavlis
Jan 14, 2021
Collaborator

wangyinz
Jan 15, 2021
Maintainer