Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gridfs read/write robustness issue #570

Open
pavlis opened this issue Oct 23, 2024 · 2 comments
Open

gridfs read/write robustness issue #570

pavlis opened this issue Oct 23, 2024 · 2 comments

Comments

@pavlis
Copy link
Collaborator

pavlis commented Oct 23, 2024

Something I have not yet figure out is causing the following exception to be thrown while reading Seismogram objects stored in gridfs:

ValueError: Database._read_data_from_gridfs: Size mismatch in sample data. Number of points in gridfs file = 12001 but expected 36003

That comes from the method posted to the ValueError exception message. For reference, here is the current _read_data_from_gridfs code with the docstring stripped:

def _read_data_from_gridfs(self, mspass_object, gridfs_id):
        gfsh = gridfs.GridFS(self)
        fh = gfsh.get(file_id=gridfs_id)
        if isinstance(mspass_object, (TimeSeries, Seismogram)):
            if not mspass_object.is_defined("npts"):
                raise KeyError(
                    "Database._read_data_from_gridfs:  Required key npts is not defined"
                )
            else:
                if mspass_object.npts != mspass_object["npts"]:
                    message = "Database._read_data_from_gridfs: "
                    message += "Metadata value for npts is {} but datum attribute npts is {}\n".format(
                        mspass_object["npts"], mspass_object.npts
                    )
                    message += "Illegal inconsistency that should not happen and is a fatal error"
                    raise ValueError(message)
            npts = mspass_object.npts
        if isinstance(mspass_object, TimeSeries):
            # fh.seek(16)
            float_array = array("d")
            float_array.frombytes(fh.read(mspass_object.get("npts") * 8))
            mspass_object.data = DoubleVector(float_array)
        elif isinstance(mspass_object, Seismogram):
            np_arr = np.frombuffer(fh.read(npts * 8 * 3))
            file_size = fh.tell()
            if file_size != npts * 8 * 3:
                # Note we can only detect the cases where given npts is larger than
                # the number of points in the file
                emess = (
                    "Database._read_data_from_gridfs: Size mismatch in sample data. Number of points in gridfs file = %d but expected %d"
                    % (file_size / 8, (3 * mspass_object["npts"]))
                )
                raise ValueError(emess)
            # v1 did a transpose on write that this reversed - unnecessary
            # np_arr = np_arr.reshape(npts, 3).transpose()
            np_arr = np_arr.reshape(3, npts)
            mspass_object.data = dmatrix(np_arr)
        else:
            raise TypeError("only TimeSeries and Seismogram are supported")
        if mspass_object.npts > 0:
            mspass_object.set_live()
        else:
            mspass_object.kill()

That provides the context for this error message. I am not at all sure what in my dataset is causing this error to be thrown. The issue I have isn't really what causing that error. I suspect it is something anomalous cause by a unrelated problem I'm chasing in the bundle algorithm that I suspect is creating the flawed document this method is using to try to read these data. i.e. this is a rare problem that may not indicate a flaw in the writer.

THE ISSUE is I think it is a mistake to have this method throw ValueError exceptions in both of the cases where that can happen in this code. Because the function has a mspass_object passed to it means the proper solution is to kill and post not throw an exception in this case. The reader will be more robust if we make that change.

@wangyinz
Copy link
Member

Yeah, this could probably just be a kill. I think originally it is treated as as error because _read_data_from_gridfs is one of the underlying method called by the read_data methods. I think the newly implemented read_data can correctly handle the kill here.

@pavlis
Copy link
Collaborator Author

pavlis commented Oct 24, 2024

Ok, I'll make those changes. I'll see if I can figure out in debugging the original problem why that error was created but that will be a challenge.

pavlis added a commit that referenced this issue Oct 28, 2024
wangyinz pushed a commit that referenced this issue Dec 25, 2024
* Iniial checking of revised colored noise regularized deconvolution algorithm

* Fix a hole in the C++ api for all atomic seismic data objects.  There was a setter for the protected t0shift
attribute but no getter.  This adds the getter called get_t0shift.

* add time_axis method to BasicTimeSeries to simply matplotlib graphics

* Change method named deconvolve to process to mesh with RFdeconProcessor.py python interface

* Fix bug from missing call to kill result with an invalid error

* Checkpoint this version with scaffolding after a bug fix

* Fix bug in failing to handle all zero arrays

* Fix bun in handling duplicates in attributes_to_load list and load_if_defined list

* Bug fixes with debug scaffolding - commit to save state but needs more work

* Save a version before a significant change

* Save this version before an experimental improvement in case I have to back it out

* Fix rare bug in pad mode created by subsample t0 ambiguity

* Fix handling in pad mode of off by one issue from t0 subsample ambiguity - this is closely link to previous commit of a C++ file

* Add several new functions and fix a number of bugs - working version processes all test data successfully

* Add sort_ensemble function

* Partially tested version of new CNRDecon module - not yet working but preserve this version

* Preserve debug version with print scaffolding

* Version passing initial (incomplete) tests

* Minor bug fix in array method found in initial, incomplete test

* Temporary test file for interactive use and not with pytest

* format with black

* Untested fix for bug in handling undefined loc code in some situations.  Also fixes a few other blemishes

* Patch to address issue #570

* Add new implementation of EstimateBandwidth - note this version has plot scaffolding and should not be used for production

* Change handling of read errors to address github issue #570

* Add a TODO comment that may indicate a future problem

* Update docstrings

* Fix typo before push

* minor changes to save before removing scaffolding

* Convert debug scaffolding graphics to a new function useful validating broadband_snr_QC behavior

* Format with black

* Break out function for robust filtering with estimated bandwidth

* Implement the new filter function - accidentally overwrote first attempt cached incorrectly as previous commit

* Fix bug failing to set ensemble live before returning

* Fix handling of high f corner to silence filter warnings - prevent high corner exceeding 80 percent Nyquist

* Several bug fixes and add arrival demean function

* Fix error in computation in beam_coherence function

* Fix pytest function for snr to match new algorithm

* Fix errors in new tests for time_axis method

* Update tests for some changes in error handling

* Fix bugs detected in updated pytest run

* Fix tests handling errors that use elog instead of throwing exceptions

* Disable pickle tests for now.  Needs work with pybind11 code to get that to work

* Clean up formatting with black

* Remove elog related to updating "sampling_rate" in resample.py

* 🎨 Format Python code with psf/black

* feat: make seisbench dependency optional  (#578)

* updated config files

* fix github test failure

* Resolve conflicts with master branch

* Fixes to get boost to compile on macos - may be a problem

* Add boost serialization code to all deconvolution operators - needed as step one for parallelization

* Add C++ test program for deconvolution operator serialization

* Force Boost_INCLUDE_DIR - seems necessary for macos but should be redundant for linux

* Force SDK 14.5 for Macos - this may need to be disabled in the future

* Add pybind11 code to allow pickle of decon operators

* Minor fix to run in github enviornment

* Add start of test script for CNRDeconEngine and enable pickle for RFdecon tests

* Fix typo

* Fix bug in handling power spectrum dt when auto setting fft size

* Add code to normalize FIR impulse response outputs and assorted fixed found in testing

* Add c++ tests for ShapingWavelet

* Add feature to post QCmetrics to RF output

* Near final form of pytest scripts with scaffolding commented out

* Partial implementation of tests for new CNR algorithm

* Add master pf files for decon algorithms

* Revert to older version to keep gtree test script from failing

* Fix path for pf file for this test

* Mostly complete test for CNRRFDecon function

* Change name of overloaded method for 3C dta

* Change test pf file to match test name - previous used RFdeconProcessor.pf which was confusing

* Fix error in handling time shifts for actual_output ethod

* Additional changes for timing problem

* Changes to fix problems found with pytest

* Expand testing of CNRDecon module

* Remove test file unintentionally included earlier

* Update yaml cmake to newer version

* Change to use newer yaml version

* Fix error in tag - download 0.8.0

* Revert to 0.7.0 - 0.8.0 failes with cmake

* Revert to yaml-cpp-0.6.3 - all later releases fail from a cmake configuration error

* Fix path for AntelopePf constructors to use current directroy

* Disable decon_serialization test run for now

* Add error handlers in C++ code to make CNRDeconEngine more robust

* Fix CNRDecon problems found with pytest

* enable test_md and test_decon_serialization

* enable test_md and test_decon_serialization with cmake file - should have been in earlier commit

* Regularize the use of data files and add comment to Cmake file for maintenance

* Add debug print statements to AntelopePf to sort out pytest failures on github

* Add lines to clear PFPATH env value that caused later tests to mysteriously fail

* Revert to pre 3.9 method to merge two dictionaries - 3.8 build was breaking because of this one line

* Remove print scaffolding used to solve PFPATH issue

* Format with black

* Disable CMAKE_OSX_SYSROOT explicit setting

---------

Co-authored-by: Cloud1e <[email protected]>
Co-authored-by: Cloud1e <[email protected]>
Co-authored-by: Jinxin Ma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants