database verify program #152
Replies: 3 comments 8 replies
-
I think that's all the key tests we need. Actually, in my DB refining implementation, I change the YAML file and add the 'constraint' key to every key in the wf collections schema, which is what we discussed last week(the week before the snowstorm in TX). For the issue of generating thousands of harmless messages, I think we need to define which problems are classified as harmful/harmless. I think type mismatch in test 2 and test 3 are harmful because, without correct data type or required attributes, other functions might not work. For the last point, I agree with you that we should handle those undefined attributes carefully when read/save data. |
Beta Was this translation helpful? Give feedback.
-
I have revised the prototype to make it a lot more solid and am ready to push it to the repository. Problem is this is a new thing we haven't had before and I need some guidance on how you want to structure this. That is, what I have is python script that is set up to be a command line program run from the unix shell. It has args. For the record, with the version I have right now this is what --help yields:
I used the standard argparse builtin to construct the command line interface - pretty slick compared to parsing args in a C program. The issue are:
|
Beta Was this translation helpful? Give feedback.
-
As I was writing the man page material I came across something I need to discuss with you guys. One of the test functions I have developed has the name "check_required". The concept is the user should be able to run that test to assure a set of attributes are defined and of the right type in one or more collections. Each collection has a set of data that should be defined as required. I can think of three ways to do this and I would like to get your opinions before I set this cement. As an old guy my views may not match yours:
I initially thought 2 would be the right solution, but I am pretty certain it is not. Here is a perfectly good example. Our current site (and channel) collections contain three attributes that define what I would call at least 3 of the "required" attributes: lat, lon, and elev. A problem is that if one wanted to use that same structure for active source data there is a disconnect. Most active source packages use local cartesian coordinates instead of lat-lon so the base metadata loaded would have x,y and elev data. Hence, for such data lat and lon would be a problem - solvable but an unnecessary constraint in my view. So, I think it is 1 or 3. I recommend we start with 1 because it is simple and consistent with the rest of the command line interface. Down the road if it proves too ponderous we can easily add an option to load a general set of defaults through a pf or yaml file. i.e. we could add 3 later and not impact any jobs scripts people used earlier. 3 is mostly a convenience. What you think? ps. Another model might be using the schema definition to define the default and using -r only to override the schema. That might be the best of all. |
Beta Was this translation helpful? Give feedback.
-
I have written a prototype database verify standalone program and before I go any further I want to get some feedback.
First, what are the tests that a verify program should run? I have implemented the following:
Are there other key tests you can think of that we need?
The other issue is how to make a tool like this useful. We must not make the mistake Antelope does with the program they have called "dbverify". It is absurd because usually generates 10s of thousands or more messages and 99% of them are harmless. Point is we have to point out real problems and reduce extraneous messages that show problems that are mostly harmless.
I have a case in point for my prototype. If the reader was running on data I have loaded from the getting started tutorial, it would fail it it was being "pedantic" but the issues present in the database are completely harmless. The error program flags them as undefined attributes. They are present because the process of reading the data through obspy and miniseed leave a number of relic attributes in the wf_TimeSeries collection. What they are is no the point. The issue is it is very very easy to put undefined attributes in the database with MongoDB and we need to: (a) always handle such attributes cleanly on reads and writes and/or (b) have good tools for the user to "clean" problems detected by a more final version of this dbverify prototype I just wrote.
There are some important usability issues for a program like this too. It is clear to me this and some set of cleaning procedures will be essential tools for our users. I think some more discussion of the design is an essential part of our database api design.
Beta Was this translation helpful? Give feedback.
All reactions