database verify program #152

pavlis · 2021-02-21T21:57:37Z

pavlis
Feb 21, 2021
Collaborator

I have written a prototype database verify standalone program and before I go any further I want to get some feedback.

First, what are the tests that a verify program should run? I have implemented the following:

normalization tests. Verifies all wf collection documents have valid links to a list the test is given of required normalization collections.
Attribute validity tests. This tests checks every attribute in a collection looking for two issus that defined "pedantic": (a) type mismatches, and (b) undefined keys (i.e. data with keys that aren't defined in the schema).
A "required attribute test". We don't currently have that key word in the yaml file but we maybe should. Right now the test has to define what data are required. It simply takes the list of required keys, checks that all data in the requested collection have them defined, and makes sure the data in the db have the type the schema expects.

Are there other key tests you can think of that we need?

The other issue is how to make a tool like this useful. We must not make the mistake Antelope does with the program they have called "dbverify". It is absurd because usually generates 10s of thousands or more messages and 99% of them are harmless. Point is we have to point out real problems and reduce extraneous messages that show problems that are mostly harmless.

I have a case in point for my prototype. If the reader was running on data I have loaded from the getting started tutorial, it would fail it it was being "pedantic" but the issues present in the database are completely harmless. The error program flags them as undefined attributes. They are present because the process of reading the data through obspy and miniseed leave a number of relic attributes in the wf_TimeSeries collection. What they are is no the point. The issue is it is very very easy to put undefined attributes in the database with MongoDB and we need to: (a) always handle such attributes cleanly on reads and writes and/or (b) have good tools for the user to "clean" problems detected by a more final version of this dbverify prototype I just wrote.

There are some important usability issues for a program like this too. It is clear to me this and some set of cleaning procedures will be essential tools for our users. I think some more discussion of the design is an essential part of our database api design.

haruming · 2021-02-22T04:37:09Z

haruming
Feb 22, 2021
Collaborator

I think that's all the key tests we need. Actually, in my DB refining implementation, I change the YAML file and add the 'constraint' key to every key in the wf collections schema, which is what we discussed last week(the week before the snowstorm in TX).

For the issue of generating thousands of harmless messages, I think we need to define which problems are classified as harmful/harmless. I think type mismatch in test 2 and test 3 are harmful because, without correct data type or required attributes, other functions might not work.

For the last point, I agree with you that we should handle those undefined attributes carefully when read/save data.

2 replies

pavlis Feb 22, 2021
Collaborator Author

I didn't know about the constraint key implementation. Good to know. Can you tell me how a program can ask what the value of "constraint" is?

Excellent idea to define a level of severity for the different kinds of errors. I'm thinking the comand like could define a level of testing that is the same as that we defined in design docs we have been working on: allow a -pedantic or -cautious option making the default a generous as possible

haruming Feb 22, 2021
Collaborator

It's in the Google Doc Database API design

Recommend we change the yaml key to “constraint” and allow it to have a finite set of acceptable values to describe different behaviour.  Here is an initial list:
required - attribute is required and must have a specific type.  Example are npts, starttime, and delta that are required attributes for TimeSeries and Seismogram

xref_key - attribute is a cross-referencing key to another collection.  These are always saved and typing is enforced but more loosely.   That is, if the type cannot be converted it should only generate an elog message and the save should not be aborted.  

normal - a normal attribute.  Needed for workflow but not an essential part of the data.  For this type of attribute errors should be logged but if recovery is not possible the data object should be killed, an “Invalid” message should be posted, and the return code (see below) should indicate this condition.  The distinction is from the above is mismatches kill only that and do not abort the job.  

optional -  known attribute but type enforcement is not dogmatic.   If a save fails for an optional attribute an error log message would be posted as a warning only.

I add a function in schema.py to get the value of 'constraint', though all the changes are not merged to the master yet, still need more work to do.

def constraint(self, key):
        """
        Return a description of the constraint this attribute defines.

        :param key: The name that defines the attribute of interest
        :type key: str
        :return: A string with a terse description of the constraint this attribute defines
        :rtype: str
        :raises mspasspy.ccore.utility.MsPASSError: if constraint is not defined
        """
        if 'constraint' not in self._main_dic[key]:
            raise MsPASSError('constraint is not defined for ' + key, 'Complaint')
        return self._main_dic[key]['constraint']

pavlis · 2021-02-22T15:53:49Z

pavlis
Feb 22, 2021
Collaborator Author

I have revised the prototype to make it a lot more solid and am ready to push it to the repository. Problem is this is a new thing we haven't had before and I need some guidance on how you want to structure this. That is, what I have is python script that is set up to be a command line program run from the unix shell. It has args. For the record, with the version I have right now this is what --help yields:

usage: dbverify db [-pf pffile]

MsPASS database verify program

positional arguments:
  dbname                MongoDB database name on which to run tests

optional arguments:
  -h, --help            show this help message and exit
  -t TEST, --test TEST  Select which test to run Current options:
                        normalization, required, schema_check
  -c [COLLECTION [COLLECTION ...]], --collection [COLLECTION [COLLECTION ...]]
                        One or more collections on which the test is to be run
  -n [NORMALIZE [NORMALIZE ...]], --normalize [NORMALIZE [NORMALIZE ...]]
                        List of normalization collections to test (Used only
                        for -test normalization option
  -error_limit ERROR_LIMIT
                        Set error limit - stop checking when this many errors
                        are found Default is 1000
  -v, --verbose         When used print offending values. Otherwise just
                        return a summary

I used the standard argparse builtin to construct the command line interface - pretty slick compared to parsing args in a C program.

The issue are:

Where do I put the python script that defines the command line program? We don't currently have a bin directory.
What is the universal way to define the path to the python instance being used? Currently I have the usual #!/usr/bin/python3 but that isn't universal. Then again, since the normal usage should be inside the container we could just hard wire it. Let me know how this should be done.
Will this require a change in setup.py to create a bin directory of command line tools? This isn't the last. I think we need some additional command tools to fix problems found by dbverify. (e.g. "clear_attributes" to remove all data with offending keys)
Finally, should it be dbverify or some other name? speak now or forever hold you peace.
This and the related edit program scripts will need a man pages. Do we have a structure for man pages?

4 replies

wangyinz Feb 22, 2021
Maintainer

Cool! See my answers below:

One way to do it is to put it under mspass/python/mspasspy/db/script/. We can then install it under /bin as a executable.
I think the standard shebang line should be #!/usr/bin/env python3.
Yes, and I will look into that.
sure
We don't have place for man page, but I think we can use the combination of docstrings and help info like what obspy does in https://github.com/obspy/obspy/blob/master/obspy/db/scripts/indexer.py

Another point I want to add here is that we probably should make dbverify callable as a function, too.

pavlis Feb 22, 2021
Collaborator Author

Ok, I should be able to close on this - well mostly. I will handle 1-3 but I don't exactly know what you have in mind for point 5. That example uses argparser much as I did, but don't know for sure how the docstring part should work.

Their docstring in that example is not a man page. A man page needs more details. I'm fluent in old school nroff used to generate classic unix man pages, but do you want that?

Finally, the last point is done. All 3 tests are actually python functions with complete docstrings. The driver command line tool just does pretty printing (in verbose mode) or gives a summary of problems by default. Here is an example:

(base) pavlis@pavlis-linux:~/testdata$ dbverify getting_started -t schema_check -c wf_TimeSeries
check_attribute_types result for collection= wf_TimeSeries
Collection has no type inconsistencies with schema
Collection found  1000  documents with keys not defined in the schema
Offending keys and number found follow:
{"_fdsnws_dataselect_url": 1000, "_format": 1000, "chan": 1000, "endtime": 1000, "mseed": 1000, "net": 1000, "sta": 1000}

where this example hit the default error limit of1000 and I notice I neglected to a pretty print of the summary (uses json_util.dumps).

wangyinz Feb 22, 2021
Maintainer

I don't know how man page should be done in python, but my understanding on this is that sphinx supports building man page. Just instead of make html, it can build man page with make man. However, I just tested and it failed with some wired error message. With some quick google, I think the problem is that the jupyter notebook parser that we use doesn't support making man page.

Anyway, I guess the point is, we should just use normal docstrings, and sphinx should take care of generating the man page for us.

pavlis Feb 23, 2021
Collaborator Author

Ok, I'll expand what I started then for docstring at the top of the file, which is how I think it is supposed to be structured. As I said before I have the problem that I've never been able to get sphynx to build a local copy of the documentation I can view before pushing something like this. Maybe when I get a chance to get you set up on quakes we can solve this but for now I think it better that I get the content down than worry about details like formatting.

On the issue of man pages. Don't waste your time unless you think we can autobuild some useful man pages. In the python world man pages seem to be viewed like hand written text on papyrus compared to printed text from a laserprinter.

pavlis · 2021-02-23T12:01:56Z

pavlis
Feb 23, 2021
Collaborator Author

As I was writing the man page material I came across something I need to discuss with you guys. One of the test functions I have developed has the name "check_required". The concept is the user should be able to run that test to assure a set of attributes are defined and of the right type in one or more collections. Each collection has a set of data that should be defined as required. I can think of three ways to do this and I would like to get your opinions before I set this cement. As an old guy my views may not match yours:

Add an -r option that would be followed by a list of required attributes for a given collection name. This is the most flexible option but the most tedious for the user.
Use the new feature I think you've implemented that allows an attribute to be listed as 'required' for a given collection. This is the most maintainable but the least flexible solution.
Use something like an AntelopePf or a yaml file to define the user's definition of required. This is more flexible, gives the user more control, and is easily defaulted. Dark side is complexity of maintaining the default file.

I initially thought 2 would be the right solution, but I am pretty certain it is not. Here is a perfectly good example. Our current site (and channel) collections contain three attributes that define what I would call at least 3 of the "required" attributes: lat, lon, and elev. A problem is that if one wanted to use that same structure for active source data there is a disconnect. Most active source packages use local cartesian coordinates instead of lat-lon so the base metadata loaded would have x,y and elev data. Hence, for such data lat and lon would be a problem - solvable but an unnecessary constraint in my view.

So, I think it is 1 or 3. I recommend we start with 1 because it is simple and consistent with the rest of the command line interface. Down the road if it proves too ponderous we can easily add an option to load a general set of defaults through a pf or yaml file. i.e. we could add 3 later and not impact any jobs scripts people used earlier. 3 is mostly a convenience. What you think?

ps. Another model might be using the schema definition to define the default and using -r only to override the schema. That might be the best of all.

2 replies

wangyinz Feb 23, 2021
Maintainer

I am confused. Shouldn't "check_required" simply follow what's defined in our schema yaml file? In the case of an active source application, the user will have a very different yaml file that doesn't have lat or lon in it, but as long as our verify function follows the same yaml file, it should always work the same. Did I miss something here? Maybe I need to see the code first..

pavlis Feb 23, 2021
Collaborator Author

The point is "required" for a workflow may encompass more attributes than we would list with the schema as required. For the schema I would think that would be reserved for the most universal concepts. There are many examples. Another I can immediately think of are hang and vang in channel. If your entire data set was only vertical component data hang and vang are unnecessary. Hence, they shouldn't really be listed as required in the schema. If a process needs those attributes, however, it would be essential to check the database first to make sure every entry has it defined. The only thing rigidly required should be things like npts and delta since without them it is impossible to construct a TimeSeries or Seismogram. I don't think we want to add shades of what required means to the schema so a generic checking mechanism like that I suggest seems essential.

I think I will blunder on and add a -r argument for the check_required test. The function is already defined with an arg that is a list of keys the function should check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database verify program #152

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

database verify program #152

pavlis Feb 21, 2021 Collaborator

Replies: 3 comments · 8 replies

haruming Feb 22, 2021 Collaborator

pavlis Feb 22, 2021 Collaborator Author

haruming Feb 22, 2021 Collaborator

pavlis Feb 22, 2021 Collaborator Author

wangyinz Feb 22, 2021 Maintainer

pavlis Feb 22, 2021 Collaborator Author

wangyinz Feb 22, 2021 Maintainer

pavlis Feb 23, 2021 Collaborator Author

pavlis Feb 23, 2021 Collaborator Author

wangyinz Feb 23, 2021 Maintainer

pavlis Feb 23, 2021 Collaborator Author

pavlis
Feb 21, 2021
Collaborator

Replies: 3 comments 8 replies

haruming
Feb 22, 2021
Collaborator

pavlis Feb 22, 2021
Collaborator Author

haruming Feb 22, 2021
Collaborator

pavlis
Feb 22, 2021
Collaborator Author

wangyinz Feb 22, 2021
Maintainer

pavlis Feb 22, 2021
Collaborator Author

wangyinz Feb 22, 2021
Maintainer

pavlis Feb 23, 2021
Collaborator Author

pavlis
Feb 23, 2021
Collaborator Author

wangyinz Feb 23, 2021
Maintainer

pavlis Feb 23, 2021
Collaborator Author