revision for handling dead data #443

pavlis · 2023-07-06T10:53:18Z

pavlis
Jul 6, 2023
Collaborator

As you know I've been working on cleaning up some redundancy and inefficiencies in the Database class. I encountered some issues in how we currently handle data marked dead. I came across two things I don't think we handle well:

The only sign leftover from a dead datum is a tombstone subdocument left in elog. One could find dead data that way easily with MongoDB, but it strike me now as a bit of a collision of concept to mix dead records in elog. Indeed all dead data should have an elog entry, but the analogy in the real world to the current situation is having graves randomly around the landscape instead of localized in one place.
We now have a mechanism to kill full ensembles. There is an ambiguity of how that should be recorded because an ensemble kill would leave a record in the ensemble elog, not the members. How to record that has not been addressed. I'm reworking the code for saving ensembles and am not quite sure we even handle this cleanly at present. Point is, we need to handle dead ensembles cleanly and make it clear when data are killed at the ensemble level.

I have two proposals to handle each of these issues:

I think all tombstones and bodies need to be buried in special collection. We can and should give it the descriptive name cemetery. The documents in cemetery are the record the dead. They need only link back to the parent waveform that was killed during processing.
We should handle ensembles using documents with a different structure saved in the cemetery. Ensembles marked dead should leave a different kind of record than ensembles with some members killed. The later should look just like a datum killed when running atomic operations. Ensemble kills should leave a more complex document with subdocuments containing tombstone records for members.

Before I go too far down this road I think we need to agree on whether or not this is a prudent choice. The key benefit is putting records of the dead in an easier, more logical organization. Fixing the more esoteric ensemble issue is of secondary importance in my view.

wangyinz · 2023-07-07T15:14:49Z

wangyinz
Jul 7, 2023
Maintainer

I think that design sounds good. I think it make sense to explicitly manage the killed data. Note that some would argue that it is an overkill to manage dead data in a database, but I think it makes sense if we want to preserve the history of the whole workflow. With that said, I think in a lot of cases, we don't really care about the kills. I think we should probably implement it as optional so that the killed data is only managed when user explicitly asks so. What do you think?

2 replies

pavlis Jul 7, 2023
Collaborator Author

Indeed it is overkill to manage dead dead, but it is also true it is a necessary evil for reproducibility. A key distinction, actually, between earthquake data and reflection data is that kills are considered outliers in reflection processing. With earthquake data a large fraction of data will often be killed. The other issue in reflection processing is that dead data are always carried along as baggage until they can be discarded in something like a stack. The reason is that the matrix model of multichannel data makes that sometimes essential to avoid confusing the system.

That said, you give me reason to continue with an approach I'd already started. That is, I think we can and should handle dead data with the Undertaker class. It only needs some minor changes to work with atomic data (current version was written for ensembles only). The Undertaker class has a cremate method that does what you suggest - leaves little to nothing behind. For ensembles, it returns a copy with all the dead data removed. I think for atomic data cremate should return a default constructed version of the same type as the data it handled to allow it's use in a bag/rdd. The alternative is to return a None, but that could cause other problems. A default constructed object will be dead by definition so should not cause downstream problems if left in a bag/rdd.

We also have a (new) option in Undertaker of a muffify method. A mummy return for atomic data is a copy of the parent but after a call to set_npts(0) which clears the sample data but does not alter Metadata, elog, or history. cremate clears all those.

I think new writers need to have an option saying how the writer should handle dead data. Seems to me the options are: bury (save to db), cremate, and muffify.

pavlis Jul 9, 2023
Collaborator Author

I have a prototype for the new Undertaker class that implements bury, cremate, and muffify for any mspass data object. It also retains the method with a name that is the best programming joke ever: bring_out_your_dead although it is only appropriate for ensembles. This class should be viewed as the way to regularize handling of dead dead. The methods have the following behavior:

bury - for all data types store elog and (optional) history data to a specified collection, which now defaults to "cemetery". It does essentially the same thing as the save_data method of the current Database class for dead data. A distinction is that atomic data are "muffified" (see below) before being returned and ensembles have the dead data members removed completely. Atomic data should perhaps be returned as a default constructed copy, but the overhead of a mummy is pretty small.

mummify - does little more than call the set_npts method of all atomic components to 0 length. For ensembles that leaves dead components with 0 length in the return but all other containers are unchanged (well npts the Metadata container will change).

cremate - reduces all dead data to ashes. The ashes for atomic data are the default constructed version of the data type. For ensembles returns an empty, default constructed ensemble when the ensemble is marked dead. For the more common situation with dead members returns a copy of the ensemble with the dead members removed.

bring_out_your_dead - is as before. Returns a pair of ensembles: one with all the live data and the other with all the dead data. Note I'm using this in the reworking of Database when saving ensembles. Makes the logic handling live and dead data differently much clearer than the current situation where that feature is deeply buried inside the save_data method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revision for handling dead data #443

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

revision for handling dead data #443

pavlis Jul 6, 2023 Collaborator

Replies: 1 comment · 2 replies

wangyinz Jul 7, 2023 Maintainer

pavlis Jul 7, 2023 Collaborator Author

pavlis Jul 9, 2023 Collaborator Author

pavlis
Jul 6, 2023
Collaborator

Replies: 1 comment 2 replies

wangyinz
Jul 7, 2023
Maintainer

pavlis Jul 7, 2023
Collaborator Author

pavlis Jul 9, 2023
Collaborator Author