obspy response recovery is questionable #129

pavlis · 2021-01-25T14:17:45Z

pavlis
Jan 25, 2021
Collaborator

Yet another potential problem revealed in revising the documentation. While discussing the api for handling site and channel I dug into the way we implemented pickling the channel data. Here is the specific block of code that I am uncertain is correct:

                for chan in chans:
                    chanrec['chan'] = chan.code
                    chanrec['vang'] = chan.dip
                    chanrec['hang'] = chan.azimuth
                    chanrec['edepth'] = chan.depth
                    st = chan.start_date
                    et = chan.end_date
                    # as above be careful of null values for either end of the time range
                    st = self._handle_null_starttime(st)
                    et = self._handle_null_endtime(et)
                    chanrec['starttime'] = st.timestamp
                    chanrec['endtime'] = et.timestamp
                    n_chan_processed += 1
                    if (self._channel_is_not_in_db(chanrec)):
                        picklestr = pickle.dumps(chan)
                        chanrec['serialized_channel_data'] = picklestr

What concerns me is not the structure of that code, but what is saved. As I recall, and the code more or less tells me that recollection is correct, we are serializing what obspy calls a "Channel" object/class. It is described here.

What concerns me when I look at that page is there is not method in Channel to retrieve the obspy Response object. Weird because they have a plot method, which obviously uses that data suggesting the class actually does somehow hold the response data and the api just hides it.

I think we need to design a simple test to see if we can retrieve one of the 'serialized_channel_data' attribute, loads the result to restore it, and verify it contains response data in some form. There are thousands of stationxml files in the raw data tutorial that could be used to design the test. This is not a time critical test, but one we do need to do before our initial release because to some parts of our community response data is critical.

pavlis · 2021-01-25T14:19:48Z

pavlis
Jan 25, 2021
Collaborator Author

A different point if we dig into this point. The site and channel related code needs to use the schema to remove hard coded names like 'serialized_channel_data'. I am not the one to do that for multiple reasons.

0 replies

wangyinz · 2021-01-25T16:25:24Z

wangyinz
Jan 25, 2021
Maintainer

These inventory and catalog related database methods are created before we had the database API design. Now looking at it, maybe we should consider moving some of the methods into the preprocessing? I don't think we have any tests for this group of methods right now - they are not called in the new database API.

0 replies

pavlis · 2021-01-25T21:00:21Z

pavlis
Jan 25, 2021
Collaborator Author

It is a good idea, I think, to move all those into mspasspy.preprocessing.seed since all are preprocessing steps to build a valid site, channel, and/or source collection. site and channel functions definitely belong in the seed directory. We may want a different directory and module structure for source. Handling source data is a different thing than station data.

No matter where the functions are put in the module structure I think we do need to modify them at the same time to use the new schema structure to reduce the hard coded names.

0 replies

pavlis · 2021-02-02T13:18:40Z

pavlis
Feb 2, 2021
Collaborator Author

This is a not exactly the same issue as the title of this discussion section, but is closely enough related I will put it here. In writing the getting started jupyter notebook I discovered a weird feature of obspy. I considered reporting this to the obspy issues page, but it isn't really their problem but I think it is a problem with stationxml that just creates an inconsistency in the abomination they call an Inventory. While I'm being philosophical I want to assert this is a lesson to all readers who write code in modern OOP languages like python: think about data structures carefully and don't make them more complicated than necessary. As I think I mentioned earlier when working on this it became clear to me that Inventory was just a python image of what the FDSN stationxml format can define. The problem is that stationxml has to allow for a large range of complexity that creates the potential for multiple tree structures describing the same data. That is what I am pretty sure is happening here.

To get to the point the summary is this: data read from stationxml files stored by obspy's mass downloader and parsed into a different tree structure than data downloaded with web services. For the record, here are some specifics:

In my tutorial I used this incantation to call web services directly:

from obspy import UTCDateTime
from obspy.clients.fdsn import Client
client=Client("IRIS")
t0=UTCDateTime('2011-03-11T05:46:24.0')
starttime=t0-3600.0
endtime=t0+(7.0)*(24.0)*(3600.0)
inv=client.get_stations(network='TA',station='*',starttime=starttime,endtime=endtime,
                        level='response',format='xml',channel='BH?')

That returns stationxml data for 446 station. You should be able to run the commands and get the same result. If you run this little set of lines:

for x in inv:
    y=x.stations
    print('length of stations return list=',len(y))

you should get something like this:

length of stations return list= 446

The weirdness is if you read from the set of files like those I pushed to stampede2 for our test data you get something quite different. Here is a read line I used (you should be able to readily adapt this to get a similar result by changing the path to what is appropriate for stampede2):

inv2=read_inventory(path_or_file_object='/home/pavlis/testdata/stationxml/site_2011.dir/*.xml',format='STATIONXML')

Now if you run a comparable little iterator loop like this one:

for x in inv2:
    y=x.stations
    print('length of stations return list=',len(y))

You will get a long list of lines like these:

length of stations return list= 1
length of stations return list= 1
length of stations return list= 1
length of stations return list= 1
length of stations return list= 1
length of stations return list= 1
length of stations return list= 1
 ...

Why this happens is that the read_inventory function is parsing a large set of files. Each of those files has data for one station and one station only. In contrast, the call to client.get_stations reads gets all the data in one large file. Thus we get two inventory objects with comparable data but stored in a completely different tree structures.

I think I can fix this rather quickly. I always thought the structure of what I got back from read_inventory from files was weird and now I know why. If this works as I think it will, we can handle this completely under the hood. I'm writing this long comment to preserve this knowledge, however, because it could come back to bite us sometime in the future if there is some other weird permutation of the stationxml files we haven't seen yet.

Also, emphasize this is a lesson in "keep it simple stupid"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obspy response recovery is questionable #129

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

obspy response recovery is questionable #129

pavlis Jan 25, 2021 Collaborator

Replies: 4 comments

pavlis Jan 25, 2021 Collaborator Author

wangyinz Jan 25, 2021 Maintainer

pavlis Jan 25, 2021 Collaborator Author

pavlis Feb 2, 2021 Collaborator Author

pavlis
Jan 25, 2021
Collaborator

pavlis
Jan 25, 2021
Collaborator Author

wangyinz
Jan 25, 2021
Maintainer

pavlis
Jan 25, 2021
Collaborator Author

pavlis
Feb 2, 2021
Collaborator Author