diff --git a/docs/source/_static/figures/MapDAGFigure.png b/docs/source/_static/figures/MapDAGFigure.png
new file mode 100644
index 000000000..a40ef0fc3
Binary files /dev/null and b/docs/source/_static/figures/MapDAGFigure.png differ
diff --git a/docs/source/_static/figures/MapProcessing.gif b/docs/source/_static/figures/MapProcessing.gif
new file mode 100644
index 000000000..bebaeb7d5
Binary files /dev/null and b/docs/source/_static/figures/MapProcessing.gif differ
diff --git a/docs/source/_static/figures/ReduceFigure.png b/docs/source/_static/figures/ReduceFigure.png
new file mode 100644
index 000000000..55f6670c8
Binary files /dev/null and b/docs/source/_static/figures/ReduceFigure.png differ
diff --git a/docs/source/_static/figures/graphics/img_example.png b/docs/source/_static/figures/graphics/img_example.png
new file mode 100644
index 000000000..abebf51c4
Binary files /dev/null and b/docs/source/_static/figures/graphics/img_example.png differ
diff --git a/docs/source/_static/figures/graphics/wt_example.png b/docs/source/_static/figures/graphics/wt_example.png
new file mode 100644
index 000000000..6b1dbcec3
Binary files /dev/null and b/docs/source/_static/figures/graphics/wt_example.png differ
diff --git a/docs/source/_static/figures/graphics/wtva_example.png b/docs/source/_static/figures/graphics/wtva_example.png
new file mode 100644
index 000000000..f3a385503
Binary files /dev/null and b/docs/source/_static/figures/graphics/wtva_example.png differ
diff --git a/docs/source/_static/figures/graphics/wtvaimg_example.png b/docs/source/_static/figures/graphics/wtvaimg_example.png
new file mode 100644
index 000000000..4b630bf15
Binary files /dev/null and b/docs/source/_static/figures/graphics/wtvaimg_example.png differ
diff --git a/docs/source/cxx_api/index.rst b/docs/source/cxx_api/index.rst
index 21bf32a24..ef1154293 100644
--- a/docs/source/cxx_api/index.rst
+++ b/docs/source/cxx_api/index.rst
@@ -5,6 +5,7 @@ The MsPASS C++ API's key components are the following classes.
.. toctree::
+ ../cxx_api/mspass
../cxx_api/mspass.utility.Metadata
../cxx_api/mspass.seismic.TimeSeries
../cxx_api/mspass.seismic.Seismogram
diff --git a/docs/source/cxx_api/mspass.rst b/docs/source/cxx_api/mspass.rst
index ebe06bdfa..82a407992 100644
--- a/docs/source/cxx_api/mspass.rst
+++ b/docs/source/cxx_api/mspass.rst
@@ -7,11 +7,12 @@ MsPASS C++ API
-
+
.. .. doxygennamespace:: mspass
.. :members:
diff --git a/docs/source/getting_started/run_mspass_with_docker.rst b/docs/source/getting_started/run_mspass_with_docker.rst
index 077c9cf53..04c08485b 100644
--- a/docs/source/getting_started/run_mspass_with_docker.rst
+++ b/docs/source/getting_started/run_mspass_with_docker.rst
@@ -6,15 +6,18 @@ Run MsPASS with Docker
Prerequisites
-------------
-Docker is required for users to run MsPASS on desktop systems.
-It is the piece of software you will use to run and manage
+Docker is required in normal use to run MsPASS on desktop systems.
+The alternative is a more complicated installation of the components
+built from source as described on
+`this wiki page `__.
+Docker is the piece of software you will use to run and manage
any containers on your desktop system.
Docker is well-supported on all current desktop operating systems and
has simple install procedures described in detail in the
product's documentation found `here `__
The software can currently be downloaded at no cost, but you must have
-administrative priveleges to install the software.
+administrative privileges to install the software.
The remainder of this page assumes you have successfully installed
docker. For Windows or Apple user's it may be convenient to launch the
"docker desktop" as an alternative to command line tools.
@@ -85,8 +88,8 @@ to save your results to your local system. Without the
``--mount`` incantation any results
you produce in a run will disappear when the container exits.
-An useful, alternative way to launch docker on a linux or MacOS system
-is use the shell ``cd`` command in the terminal you are using to make
+A useful, alternative way to launch docker on a linux or MacOS system
+is to use the shell ``cd`` command in the terminal you are using to make
your project directory the "current directory". Then you can
cut-and-paste the following variation of the above into that terminal
window and */home* in the container will be mapped to your
@@ -159,6 +162,26 @@ you are doing before you alter any files with bash commands in this
terminal. A more standard use is to run common monitoring commands like
``top`` to monitor memory and cpu usage by the container.
+If you are using dask on a desktop, we have found many algorithms perform
+badly because of a subtle issue with python and threads. That is, by
+default dask uses a "thread pool" for workers with the number of threads
+equal to the number of cores defined for the docker container.
+Threading with python is subject to poor performance because of
+something called the Global Interpreter Lock (GIL) that causes multithread
+python functions to not run in parallel at all with dask. The solution
+is to tell dask to run each worker task as a "process" not a thread.
+(Note pyspark does this by default.) A way to do that with dask is to
+launch docker with the following variant of above:
+
+.. code-block::
+
+ docker run -p 8888:8888 -e MSPASS_WORKER_ARG="--nworkers 4 --nthreads 1" --mount src=`pwd`,target=/home,type=bind mspass/mspass
+
+where the value after `--nworkers` should be the number of worker tasks
+you want to have the container run. Normally that would be the number of
+cores defined for the container which be default is less than the number of
+cores for the machine running docker.
+
Finally, to exit close any notebook windows and the Jupyter notebook
home page. You will usually need to type a `ctrl-C` in the terminal
window you used to launch mpass via docker.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 1f259c4b2..36ea8b3e8 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -6,19 +6,19 @@
MsPASS Documentation
====================
-The Massive Parallel Analysis System for Seismologists
-is an open source framework for seismic data processing
+The Massive Parallel Analysis System for Seismologists
+is an open source framework for seismic data processing
and management. It has three core components:
-* A scalable parallel processing framework based on a
+* A scalable parallel processing framework based on a
dataflow computation model.
* A NoSQL database system centered on document store.
* A container-based virtualization environment.
-The system builds on the `ObsPy `_
-toolkit, with extension built on a rewrite of the
-`SEISPP `_
-package.
+The system builds on the `ObsPy `_
+toolkit, with extension built on a rewrite of the
+`SEISPP `_
+package.
.. .. mdinclude:: ../../README.md
@@ -26,31 +26,67 @@ package.
:maxdepth: 1
:caption: Getting Started
+ getting_started/quick_start
getting_started/run_mspass_with_docker
getting_started/deploy_mspass_with_docker_compose
getting_started/deploy_mspass_on_HPC
getting_started/getting_started_overview
-
+
.. toctree::
:maxdepth: 1
- :caption: User Manual
+ :caption: Introduction
user_manual/introduction
- user_manual/data_object_design_concepts
- user_manual/time_standard_constraints
- user_manual/obspy_interface
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Data Management
+
user_manual/database_concepts
user_manual/CRUD_operations
+ user_manual/mongodb_and_mspass
+ user_manual/normalization
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Seismic Data Objects
+
+ user_manual/data_object_design_concepts
+ user_manual/numpy_scipy_interface
+ user_manual/obspy_interface
+ user_manual/time_standard_constraints
+ user_manual/processing_history_concepts
+ user_manual/continuous_data
+ user_manual/schema_choices
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Data Processing
+
+ user_manual/algorithms
user_manual/importing_data
user_manual/handling_errors
user_manual/data_editing
user_manual/header_math
user_manual/graphics
- user_manual/processing_history_concepts
- user_manual/parallel_processing
- user_manual/normalization
+ user_manual/signal_to_noise
user_manual/adapting_algorithms
+
+.. toctree::
+ :maxdepth: 1
+ :caption: System Tuning
+
+ user_manual/parallel_processing
+ user_manual/memory_management
user_manual/io
+ user_manual/parallel_io
+
+.. toctree::
+ :maxdepth: 2
+ :caption: FAQ
+
+ user_manual/FAQ
+ user_manual/development_strategies
.. toctree::
@@ -60,4 +96,3 @@ package.
python_api/index
cxx_api/index
mspass_schema/mspass_schema
-
diff --git a/docs/source/user_manual/CRUD_operations.rst b/docs/source/user_manual/CRUD_operations.rst
index 9f0ee67a4..bfbc6ee89 100644
--- a/docs/source/user_manual/CRUD_operations.rst
+++ b/docs/source/user_manual/CRUD_operations.rst
@@ -48,7 +48,7 @@ alternative construct:
from mspasspy.db.client import Client
from mspasspy.db.database import Database
dbclient = Client()
- db = Database(dbclient, 'database_name', db_schema='wf_Seismogram')
+ db=Database(dbclient, 'database_name', db_schema='wf_Seismogram')
If your workflow requires reading both TimeSeries and Seismogram
data, best practice (i.e. it isn't required but a good idea)
@@ -63,7 +63,7 @@ the synonymous word "save". Here we list all save methods with a brief
description of each method. Consult the docstring pages for detailed
and most up to date usage:
-1. :py:meth:`save_data ` is probably the most common method you will use. The
+1. :code:`save_data` is probably the most common method you will use. The
first argument is one of the atomic objects defined in MsPASS
(Seismogram or TimeSeries) that you wish to save. Options are
described in the docstring. Here is an example usage:
@@ -84,7 +84,7 @@ and most up to date usage:
normalized collections (:code:`source`, :code:`channel`, and/or :code:`site`) with no
safety checks. We discuss additional common options in a later section.
-2. :py:meth:`save_ensemble_data ` is similar to :code:`save_data` except the first argument
+2. :code:`save_ensemble_data` is similar to :code:`save_data` except the first argument
is an Ensemble object. There are currently two of them: (1) TimeSeriesEnsemble
and (2) SeismogramEnsemble. As discussed in the section
:ref:`data_object_design_concepts` an Ensemble
@@ -101,11 +101,7 @@ and most up to date usage:
are copied verbatim to each member. If previous values existed in any
of the members they will be silently replaced by the ensemble groups version.
- :py:meth:`save_ensemble_data_binary_file `
- is an optimized version of save_ensemble_data. It saves all objects of the
- ensemble into one file, and only opens the file once.
-
-3. :py:meth:`save_catalog ` should be viewed mostly as a convenience method to build
+3. :code:`save_catalog` should be viewed mostly as a convenience method to build
the :code:`source` collection from QUAKEML data downloaded from FDSN data
centers via obspy's web services functions. :code:`save_catalog` can be
thought of as a converter that translates the contents of a QUAKEML
@@ -137,7 +133,7 @@ and most up to date usage:
This particular example pulls 11 large aftershocks of the 2011 Tohoku
Earthquake.
-4. :py:meth:`save_inventory ` is similar in concept to :code:`save_catalog`, but instead of
+4. :code:`save_inventory` is similar in concept to :code:`save_catalog`, but instead of
translating data for source information it translates information to
MsPASS for station metadata. The station information problem is slightly
more complicated than the source problem because of an implementation
@@ -199,31 +195,6 @@ and most up to date usage:
collection that has invalid documents you will need to write a custom function to override that
behaviour or rebuild the collection as needed with web services.
-5. :code:`write_distributed_data` is a parallel equivalent of :code:`save_data` and :code:`save_ensemble_data`.
- MsPASS supports two parallel frameworks called SPARK and DASK.
- Both abstract the concept of the parallel data set in
- a container they call an RDD and Bag respectively. Both are best thought
- of as a handle to the entire data set that can be passed between
- processing functions. The function can be thought of as writing the entire data set
- from a parallel container to storage. The input is SPARK RDD or DASK BAG of objects (TimeSeries or Seismogram), and the
- output is a dataframe of metadata. From the container, it will firstly write to files distributedly
- using SPARK or DASK, and then write to the database sequentially. The two parts are done in two
- functions: :code:`write_files`, and :code:`write_to_db`. It returns a dataframe of metadata for
- each object in the original container. The return value can be used as input for :code:`read_distributed_data`
- function.
-
- Note that the objects should be written to different files, otherwise it may overwrite each other.
- dir and dfile should be stored in each object.
-
- :code:`write_files` is the writer for writing the object to storage. Input is an object (TimeSeries/Seismogram),
- output is the metadata of the original object with some more parameters added. This is
- the reverse of :code:`read_files`.
-
- :code:`write_to_db` is to save a list of atomic data objects (TimeSeries or Seismogram)
- to be managed with MongoDB. It will write to the doc and to the database for every metadata of the
- target mspass object. Then return a dataframe of the metadata for target mspass objects.
- The function is the reverse of :code:`read_to_dataframe`.
-
Read
~~~~~~~
@@ -233,7 +204,7 @@ and Seismogram. There are also convenience functions for reading ensembles.
As with the save operators we discuss here the key methods, but refer the
reader to the sphinx documentation for full usage.
-1. :py:meth:`read_data ` is the core method for reading atomic data. The method has
+1. :code:`read_data` is the core method for reading atomic data. The method has
one required argument. That argument is an ObjectID for the document used
to define the read operation OR a MongoDB document (python dict) that
contains the ObjectID. The ObjectID is guaranteed to provide a
@@ -244,10 +215,10 @@ reader to the sphinx documentation for full usage.
.. code-block:: python
- query = {...Some MongoDB query dict entry...}
- cursor = db.wf_TimeSeries.find(query) # Changed to wf_Seismogram for 3D data
- for doc in cursor:
- d = db.read_data(doc) # Add option collection='wf_Seismogram' for 3C reads
+ query={...Some MongoDB query dict entry...}
+ cursor=db.wf_TimeSeries.find(query) # Changed to wf_Seismogram for 3D data
+ for doc in cursor:
+ d=db.read_data(doc) # Add option collection='wf_Seismogram' for 3C reads
By default :code:`read_data` will use the waveform collection defined
in the schema defined for the handle. The default for the standard
@@ -312,7 +283,7 @@ reader to the sphinx documentation for full usage.
3. The "pedantic" mode is mainly of use for data export where a
type mismatch could produce invalid data required by another package.
-2. A closely related function to :code:`read_data` is :py:meth:`read_ensemble_data `. Like
+2. A closely related function to :code:`read_data` is :code:`read_ensemble_data`. Like
:code:`save_ensemble_data` it is mostly a loop to assemble an ensemble of
atomic data using a sequence of calls to :code:`read_data`. The sequence of
what to read is defined by arg 0. That arg must be one of two things:
@@ -337,17 +308,10 @@ reader to the sphinx documentation for full usage.
cursor = db.wf_TimeSeries.find(query)
ens = db.read_ensemble_data(cursoe)
- :py:meth:`read_ensemble_data_group `
- is an optimized version of :code:`save_ensemble_data`. It groups the files firstly to avoid
- duplicate open for the same file. Open and close the file only when the dir or dfile change.
- When multiple objects store in the same file, this function will group the files first
- and collect their foffs in that file. Then open the file once, and sequentially read the data
- according to the foffs. This function only supports reading from binary files.
-
3. A workflow that needs to read and process a large data sets in
a parallel environment should use
the parallel equivalent of :code:`read_data` and :code:`read_ensemble_data` called
- :py:meth:`read_distributed_data `. MsPASS supports two parallel frameworks called
+ :code:`read_distributed_data`. MsPASS supports two parallel frameworks called
SPARK and DASK. Both abstract the concept of the parallel data set in
a container they call an RDD and Bag respectively. Both are best thought
of as a handle to the entire data set that can be passed between
@@ -385,25 +349,6 @@ reader to the sphinx documentation for full usage.
If you are using DASK instead of SPARK you would add the optional
argument :code:`format='dask'`.
- :code:`read_distributed_data` divide the process of reading into two parts:
- reading from database and reading from file, where reading from database is
- done in sequence, and reading from file is done with DASK or SPARK. The two parts
- are done in two functions: :code:`read_to_dataframe`, and :code:`read_files`.
- The division is to avoid using database in DASK or SPARK to improve efficiency.
-
- The input can also be a dataframe, which stores the information of the metadata.
- It will read from file/gridfs according to the metadata and construct the objects.
-
- :code:`read_to_dataframe` firstly construct a list of objects using cursor.
- Then for each object, constrcut the metadata and add to the list. Finally it will
- convert the list to a dataframe.
-
- :code:`read_files` is the reader for constructing the object from storage. Firstly construct the object,
- either TimeSeries or Seismogram, then read the stored data from a file or in gridfs and
- loads it into the mspasspy object. It will also load history in metadata. If the object is
- marked dead, it will not read and return an empty object with history. The logic of reading
- is same as :code:`Database.read_data`.
-
Update
~~~~~~
@@ -475,7 +420,12 @@ In MsPASS we adopt these rules to keep delete operations under control.
We trust rules 1 and 2 require no further comment. Rule 3, however,
needs some clarification to understand how we handle deletes.
A good starting point is to look at the signature of the simple core delete
-method of the Database class: :py:meth:`delete_data `
+method of the Database class:
+
+.. code-block:: python
+
+ def delete_data(self,id,remove_unreferenced_files=False,
+ clear_history=True,clear_elog=False):)
As with the read methods id is the ObjectID of the wf collection document
that references the data to be deleted.
@@ -681,12 +631,12 @@ list of elog messages:
# This needs to be checked for correctness - done while off the grid
query = {'$def' : 'tombstone'}
- cursor = db.elog.find(query)
+ cursor=db.elog.find(query)
for doc in cursor:
- wfmd = doc['tombstone']
+ wfmd=doc['tombstone']
print('Error log contents for this Seismogram marked dead:',
- wfmd['net'], wfmd['sta'], UTCDateTime(wfmd['startime']))
- err = doc['logdata']
+ wfmd['net'],wfmd['sta'],UTCDateTime(wfmd['startime'])
+ err=doc['logdata']
for e in err:
print(e.message)
diff --git a/docs/source/user_manual/FAQ.rst b/docs/source/user_manual/FAQ.rst
new file mode 100644
index 000000000..cd65d9e0e
--- /dev/null
+++ b/docs/source/user_manual/FAQ.rst
@@ -0,0 +1,15 @@
+.. _FAQ:
+
+Frequency Asked Questions (FAQ)
+=====================================
+
+This page is a link to pages on pragmatic
+topics that can, we hope, make MsPASS more approachable. The topic of each
+page is used as the hyperlink text. Click on topics of interest to learn
+more.
+
+:ref:`How do I develop a new workflow from scratch? `
+
+:ref:`How does MsPASS handle continuous data? `
+
+:ref:`What database schema should I use? `
diff --git a/docs/source/user_manual/algorithms.rst b/docs/source/user_manual/algorithms.rst
new file mode 100644
index 000000000..f708e2dff
--- /dev/null
+++ b/docs/source/user_manual/algorithms.rst
@@ -0,0 +1,329 @@
+.. _algorithms:
+
+Algorithms
+===============================
+Overview
+-----------
+This page is a quick reference to algorithms available in MsPASS.
+The page is organized as a set of tables with hyperlinks to the docstrings
+for each algorithm function.
+
+Processing Functions
+--------------------------
+These algorithms have a function form to allow their use in parallel map
+operations. They are functions that have one or more seismic data
+objects as is the inputs and return the output of the algorithm that
+may or may not be the same type. Some are wrappers
+for obspy functions, some are written in C++ with python wrappers, and
+some are pure C++ functions with python bindings. Note the name field in the
+table below is a hyperlink to the docstring for that function.
+
+.. list-table:: Processing Functions
+ :widths: 10 70 10 10
+ :header-rows: 1
+
+ * - Name
+ - Synopsis
+ - Inputs
+ - Outputs
+ * - :py:func:`correlate`
+ - Cross-correlate pair of signals
+ - TimeSeries
+ - TimeSeries
+ * - :py:func:`correlate_stream_template`
+ - obspy correlation of many with a template
+ - TimeSeriesEnsemble(arg)),TimeSeries(arg1)
+ - TimeSeriesEnsemble
+ * - :py:func:`correlate_template`
+ - obspy single channel correlation against a template
+ - TimeSeries
+ - TimeSeries
+ * - :py:func:`detrend`
+ - obspy detrend algorithm
+ - TimeSeries
+ - TimeSeries
+ * - :py:func:`filter`
+ - obspy time-invariant filter function
+ - All
+ - Same as input
+ * - :py:func:`interpolate`
+ - obspy function to resample using interpolation
+ - TimeSeries
+ - TimeSeries
+ * - :py:func:`WindowData`
+ - Cut a smaller window from a larger segment
+ - All
+ - Same as input
+ * - :py:func:`merge`
+ - Combine a set of segments into one
+ - TimeSeriesEnsemble (member vector)
+ - TimeSeries
+ * - :py:func:`scale`
+ - scale data to specified level with a specified amplitude metric
+ - Any
+ - Same as input
+ * - :py:func:`snr`
+ - Compute one of several possible time-domain signal-to-noise ratio metrics
+ - TimeSeries Seismogram
+ - float
+ * - :py:func:`broadband_snr_QC`
+ - Computes a series of snr-based amplitude metrics and posts them to output Metadata
+ - TimeSeries or Seismogram
+ - Same as input (Metadata altered)
+ * - :py:func:`arrival_snr`
+ - Computes a series of snr-based amplitude metrics relative to an Extarrival time
+ - TimeSeries
+ - TimeSeries (Metadata altered)
+ * - :py:func:`FD_snr_estimator`
+ - Computes a set of broadband snr estimates in the frequency domain
+ - TimeSeries
+ - [dict,ErrorLogger]
+ * - :py:func:`ArrivalTimeReference`
+ - Shifts time 0 to arrival time defined by a Metadata key
+ - Seismogram or SeismogramEnsemble
+ - Same as input
+ * - :py:func:`agc`
+ - Applies an automatic gain control operator with a specified duration
+ - Seismogram
+ - TimeSeries of gains, Seismogram input altered in place
+ * - :py:func:`repair_overlaps`
+ - Repair overlapping data segments
+ - list of TimeSeries
+ - TimeSeries
+ * - :py:func:`seed_ensemble_sort`
+ - Sorts an ensemble into net:sta:chan:loc order
+ - TimeSeriesEnsemble
+ - TimeSeriesEnsemble
+ * - :py:func:`splice_segments`
+ - Splice a list of TimeSeries objects into a continuous single TimeSeries
+ - list of TimeSeries
+ - TimeSeries
+ * - :py:func:`bundle`
+ - Bundle a TimeSeriesEnsemble into a SeismogramEnsemble
+ - TimeSeriesEnsemble
+ - SeismogramEnsemble
+ * - :py:func:`BundleSEEDGroup`
+ - Bundle a list of TimeSeries into a Seismogram object
+ - list of TimeSeries
+ - Seismogram
+ * - :py:func:`ExtractComponent`
+ - Extract one component from a Seismogram or SeismogramEnsemble
+ - Seismogram or SeismogramEnsemble
+ - TimeSeries or TimeSeriesEnsemble
+ * - :py:func:`ator`
+ - Change from UTC to relative time standard
+ - any
+ - same as input
+ * - :py:func:`rtoa`
+ - Change from relative to UTC time standard
+ - any
+ - same as input
+ * - :py:func:`rotate`
+ - Generic coordinate rotation
+ - Seismogram or SeismogramEnsemble
+ - same as input
+ * - :py:func:`rotate_to_standard`
+ - Restore data to cardinal directions
+ - Seismogram or SeismogramEnsemble
+ - same as input
+ * - :py:func:`transform`
+ - Apply a general transformation matrix to 3C data
+ - Seismogram or SeismogramEnsemble
+ - same as input
+ * - :py:func:`linear_taper`
+ - Apply a one-sided, linear taper
+ - any
+ - same as input
+ * - :py:func:`cosine_taper`
+ - Apply a one-sided, cosine taper
+ - any
+ - same as input
+ * - :py:func:`vector_taper`
+ - Apply a taper defined by a vector of samples
+ - any
+ - same as input
+
+Nonstandard Processing Functions
+-----------------------------------
+The next table is similar to above, but the inputs or outputs are not seismic
+data objects. They are used internally by some functions and can have
+utility for writing custom functions that use them inside another algorithm.
+
+.. list-table:: Nonstandard Processing Functions
+ :widths: 10 70 10 10
+ :header-rows: 1
+
+ * - Name
+ - Synopsis
+ - Inputs
+ - Outputs
+ * - :py:func:`BandwidthStatisticsBandwidthStatistics`
+ - Compute statistical summary of snr in a passband returned by EstimateBandwidth
+ - BandwidthData object
+ - Metadata
+ * - :py:func:`EstimateBandwidth`
+ - Estimate signal bandwidth estimate of power spectra of signal and noise
+ - PowerSpectra of signal and noise windows
+ - BandwidthData - input for BandwidthStatisics
+ * - :py:func:`MADAmplitude`
+ - Calculate amplitude with the MAD metric
+ - TimeSeries or Seismogram
+ - double
+ * - :py:func:`PeakAmplitude`
+ - Calculate amplitude with the peak absolute value
+ - TimeSeries or Seismogram
+ - double
+ * - :py:func:`PercAmplitude`
+ - Calculate amplitude at a specified percentage level
+ - TimeSeries or Seismogram
+ - double
+ * - :py:func:`RMSAmplitude`
+ - Calculate amplitude with the RMS metric
+ - TimeSeries or Seismogram
+ - double
+
+Processing Objects
+-------------------------------------
+This collection of things are "processing objects" meaning they implement
+processing using a C++ or python class that has a method that runs
+an algorithm on seismic data.
+
+.. list-table:: Processing Objects
+ :widths: 10 70 10 10 10
+ :header-rows: 1
+
+ * - Class Name
+ - Synopsis
+ - Processing method
+ - Inputs
+ - Outputs
+ * - :py:class:`Add`
+ - Add a constant to a metadata value
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Add2`
+ - Add two metadata values
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`ChangeKey`
+ - Change key associated with a value
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Divide`
+ - Divide a metadata value by a constant
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Divide2`
+ - Divide one metadata value by another
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`IntegerDivide`
+ - Apply integer divide operator to a metadata value
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`IntegerDivide2`
+ - Apply integer divide operator to two metadata values
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Mod`
+ - Change a key to value mod constant
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Mod2`
+ - Set a field as mod division of a pair of values
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Multiply`
+ - Change a key to value times constant
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Multiply2`
+ - Set a field as produce of a pair of values
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`SetValue`
+ - Set a field to a constant
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Subtract`
+ - Change a key to value minus a constant
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`Subtract2`
+ - Set a field as difference of a pair of values
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`erase_metadata`
+ - Clear all defined values of a specified key
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataOperatorChain`
+ - Apply a chain of metadata calculators
+ - apply
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`FiringSquad`
+ - Apply a series of kill operators
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataDefined`
+ - Kill if a key-value pair is defined
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataUndefined`
+ - Kill if a key-value pair is not defined
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataEQMetadataEQ`
+ - Kill a datum if a value is equal to constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataNEMetadataNE`
+ - Kill a datum if a value is not equal to constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataGE`
+ - Kill if a value is greater than or equal to a constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataGT`
+ - Kill if a value is greater than a constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataLE`
+ - Kill if a value is less than or equal to a constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataLT`
+ - Kill if a value is less than a constant
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
+ * - :py:class:`MetadataInterval`
+ - Kill for value relative to an interval
+ - kill_if_true
+ - Any seismic data object
+ - Edited version of input
diff --git a/docs/source/user_manual/continuous_data.rst b/docs/source/user_manual/continuous_data.rst
new file mode 100644
index 000000000..358b057ce
--- /dev/null
+++ b/docs/source/user_manual/continuous_data.rst
@@ -0,0 +1,441 @@
+.. _continuous_data:
+
+Continuous Data Handling with MsPASS
+==========================================
+Concepts
+~~~~~~~~~~~
+Seismologists universally use the term "continuous data" to mean
+long time series records with no significant gaps. The term is
+actually potentially confusing because the data are not really continuous
+at all but sampled at a fixed sample interval, nor are they of infinite duration
+which might be implied if you were discussing Fourier transforms on functions.
+In practice, "continuous data" in seismology means a series of
+windowed segments that can be merged to form a longer window of data.
+The maximum length possible is the duration of data recording.
+The process of gluing (merging) multiple segments is, more or less,
+the inverse of cutting a shorter window out of a longer window of data.
+Gluing/merging data algorithms have to deal with some different issues
+than windowing.
+
+Some important issues about common practice and the reality of real
+data are:
+
+#. There are two common choices for how the data are blocked:
+ (1) day volumes, and (2) raw digitizer files of irregular length
+ created when the digitizer does a memory dump (All current
+ generation digitizers write to an internal memory buffer that is
+ dumped when the memory use exceeds a high water mark.) In either case
+ there is some explicit or implicit (e.g. file naming convention)
+ that provides a hint of the order of the segments.
+
+#. A large fraction of data contain various types of "data gaps".
+ Gaps occur for a long list of reasons that are mostly unimportant
+ when analyzing such data. What is important is that data gaps
+ span a range of time scales from a single sample to years.
+
+#. A less-common problem is a data overlap. An overlap occurs when
+ two segments you need to merge have conflicting time stamps.
+ To make this clear it is helpful to review two MsPASS concepts in
+ the TimeSeries and Seismogram data objects. Let *d1* and *d2* be
+ two :code:`TimeSeries` objects that are successive segments we
+ expect to merge with *d2* being the segment following *d1* in time.
+ In MsPASS we use the attribute *t0* (alternatively the method
+ *starttime*) for the time of sample 0. We also use the method
+ *endtime* to get the computed time of the last data sample. An
+ overlap is present between these two segments when
+ *d2.t0 < d1.endtime()*.
+ We know of three ways overlaps can
+ occur: (1) timing problems with the instrument that recorded the
+ data, (2) hardware or software issues in recording instrument that
+ cause packets to be duplicated in memory before they are dumped, and
+ (3) blunders in data management where duplicate files are indexed
+ and defined in a database wf collection (wfdisc table in Antelope).
+
+MsPASS has some capabilities for merging data within the realities of
+real data noted above. These are discussed in the section below.
+MsPASS does not, however, substitute for nitty-gritty details network
+and experiment operators have to face in cleaning field data for archive.
+We consider that as a problem already solved by Earthscope,
+the USGS, and global network operators in systems they
+use for creating the modern data archive of the FDSN. Custom
+experimental data may need to utilize Earthscope tools to fix problems not
+covered by MsPASS.
+
+Gap Processing
+~~~~~~~~~~~~~~~~~~
+Internally MsPASS handles data gaps with a subclass of the
+:code:`TimeSeries` called :code:`TimeSeriesWGaps`. That extension of
+:code:`TimeSeries` is written in C++ and is documented
+`here `__.
+Like :code:`TimeSeries` this class has python bindings created
+with pybind11. All the methods described in the C++ documentation
+page have python bindings. There are methods for defining gaps,
+zeroing data in defined gaps, and deleting gaps.
+See the doxygen pages linked above for details.
+The python functions that currently deal with data gaps have a second
+strategy for handling his problem best described in the context
+of those functions.
+
+
+Merging Data Segments
+~~~~~~~~~~~~~~~~~~~~~~~~
+There are currently two different methods in MsPASS to handle merging
+continuous data segments: (1) a special, implicit option of the
+:py:meth:`mspasspy.db.database.Database.read_data` method of the
+:py:class:`mspasspy.db.database.Database` class, and (2) the
+processing function :py:func:`mspasspy.algorithms.window.merge`.
+In addition, there is a special reader function called
+:py:func:`mspasspy.db.ensembles.TimeIntervalReader` that can be used
+to read fixed time windows of data. That function uses :code:`merge`
+to do gap and overlap repair.
+
+read_data merge algorithm
+---------------------------
+This approach is only relevant if you have raw miniseed files you
+plan to read to initiate your processing sequence. The miniseed format
+uses a packet structure with each packet normally defining a single
+channel (Note the standard allows multiplexed data but none of us have
+ever encountered such data.). The order of the packets is used by
+all readers we know of to determine if a sequence of packets are a single
+waveform. If the station codes ("net", "sta", "chan", and "loc" attributes
+in all MsPASS schemas) change in a sequence of packets readers
+universally assume that is the end of a given segment. How readers handle
+a second issue is, however, variable. Each miniseed packet has a time
+tag that is comparable to the `t0` attribute of a :class:`TimeSeries` object
+and end time field equivalent to the output of the :class:`TimeSeries`
+endtime method. If the `t0` value of a packet is greater than some
+fractional tolerance of 1 sample more than the endtime of the previous
+packet, a reader will invoke a gap handler. A reader's gap handler
+commonly has options for what to do with different kinds of "gaps", but
+for this section our definition is defined by the way obspy
+handles this problem with their :class:`Stream` merge method described
+`here `__.
+That particular algorithm is invoked when reading miniseed data
+if and only if a block of data defined running the mspass
+function :py:meth:`mspasspy.db.database.Database.index_mseed_file` is
+run with the optional argument `segment_time_tears` is set False.
+(Note the default is True.). If you need to use this approach, you will
+need to also take care in defining the value of the following arguments
+that are passed to obspy's merge function for gap handle:
+`merge_method`, `merge_fill_value`, and `merge_interpolation_samples`.
+Those three arguments are passed directly to obspy merge arguments with
+a variant of the same names: `method`, `fill_value`, and `interpolation_samples`.
+
+Note an alternative user's who have previously used obspy for this functionality
+may want to consider is to write a custom function that utilizes obspy's merge
+directly rather than the implied used in read_data.
+
+MsPASS merge function
+-----------------------
+MsPASS has a native version of a function with a capability similar to
+the obspy merge function noted above. The MsPASS function add some additional
+features and, although not verified by formal testing,
+is likely much faster than the obpsy version due to fundamental differences
+in the implementation.
+The docstring for :py:func:`mspasspy.algorithms.window.merge` describes more
+details but some key features of this function are:
+
+- Like obspy's function of the same name its purpose is to glue/merge
+ a set of waveform components into a single, continuous time series.
+ A key difference is that the obspy function requires an obspy
+ Stream object as input while the MsPASS function uses the "member"
+ container of a :class:`TimeSeriesEnsemble` object as input.
+- It provides for an optional windowing of the merged result. That approach
+ is useful, for example, for carving events out from a local archive of
+ continuous waveform data in a single step. This feature is useful for
+ reducing the memory footprint of a parallel job.
+- Gaps are flagged and posted with a Metadata approach. Obspy has a set of
+ options for gap handling that are inseparable from the function.
+ Any detected gaps in the
+ MsPASS merge function are posted to the Metadata component of the
+ :class:`TimeSeries` it returns accessible with the key "gaps".
+ The content of the "gaps" attribute is a list of one or more
+ python dictionaries with the keyworks "starttime" and "endtime"
+ defining the epoch time range of all gaps in the returned datum.
+ The function also has an optional "zero_gaps". When set True
+ (default is False) any gaps are explicitly set to zeros. By default
+ the values should be treated as undefined, although in practice they
+ are likely zeros.
+- Overlap handling is controlled by another boolean parameter
+ with the name "fix_overlaps". When set True the function will
+ check for overlapping data and attempt to repair overlaps only if
+ the sample numerical data match within machine tolerance.
+ The default behavior is to mark the return dead if any overlap is
+ detected. Obspy uses a less dogmatic algorithm driven by an optional
+ function argument called "interpolation_samples". As noted above it has
+ been our experience that, in general, overlapping data always indicate
+ a data quality problem that invalidates the data when the samples
+ do not match. If you need
+ the obspy functionality use the
+ :py:func:`mspasspy.util.converter.TimeSeriesEnsemble2Stream` and the
+ inverse :py:func:`mspasspy.util.converter.Trace2TimeSeriesEnsemble`
+ to create the obspy input and then restore the returned data to
+ the MsPASS internal data structures
+
+TimeIntervalReader
+-----------------------
+A second MsPASS tool for working with continuous data is a function
+with the descriptive name :py:func:`mspasspy.db.ensembles.TimeIntervalReader`.
+It is designed to do the high-level task of cutting a fixed time
+interval of data from one or more channels of a continuous data archive.
+This function is built on top of the lower-level
+:py:func:`mspasspy.algorithms.window.merge` but is best thought of as
+an alternative reader to create ensembles cut from a continuous data archive.
+For that reason the required arguments are a database handle and the
+time interval of data to be extracted from the archive. Gap and overlap
+handling is handled by :code:`merge`.
+
+Examples
+------------
+*Example 1: Create a single waveform in a defined time window
+from continuous data archive.*
+This script will create a longer :class:`TimeSeries` object from a set day files
+for the BHZ channel of GSN station AAK. Ranges are constant for a simple
+illustration:
+
+.. code-block:: python
+
+ # code above would define database handle db
+ from mspasspy.algorithms.window import merge
+ from obspy import UTCDateTime
+ from bson import json_utils #TODO verify this is right
+ net ="II"
+ sta="AAK"
+ chan="BHZ"
+ loc="00" # STS-1 sensor at AAK
+ # TODO: select a reasonable time interval
+ output_stime=UTCDateTime()
+ output_etime=UTCDateTime()
+ # this is a MongoDB query to retrieve all segments with data in the
+ # desired time range of output_stime to output_etime
+ query = {
+ "$and": [
+ { "sta" : {"$eq" : sta}},
+ { "net" : {"$eq" : net}},
+ { "chan" : {"$eq" : chan}},
+ { "loc" : {"$eq" : loc}},
+ { "starttime" : {"$lte" : output_etime}},
+ { "endtime" : {"$gte" : output_stime}}
+ ]
+ }
+ cursor=db.wf_miniseed.find(query).sort() # TODO work out sort format
+ tsens = db.read_data(query,collection="wf_miniseed")
+ if tsens.live:
+ merged_data = merge(tsens.member,output_starttime,output_endtime)
+ if merged_data.live:
+ print("Output is ok and has ",merged_data.npts," data samples")
+ else:
+ print("Data have problems - gaps or overlaps caused the datum to be killed")
+ else:
+ print("The following query yielded no data:")
+ print(json.utils.dumps(query,indent=2))
+
+*Example 2: parallel read from continuous archive* This example is a workflow
+to build a dataset of waveforms
+segmented around a set of previously measured P wave arrival time from
+an archive of continuous data. The example is not complete as it
+requires implementing a custom function that below is given the symbolic
+name "arrivals2list". From that list we create a dask bag and use it
+to drive a parallel read with `read_distributed_data` that passes a
+series of enembles to a function defined at the top that runs `merge`.
+The example is made up, but is a prototype for building an event-based
+data set of all waveforms with P wave times packed the the
+Earthscope Array Network Facility (ANF) available online
+from Earthscope.
+
+.. code-block:: python
+
+ from mspasspy.db.DBClient import DBClient
+ import dask.bag as dbg
+ dbclient=DBClient()
+ # we need two database handles. One for the continuous data (dbc)
+ # and one to save the segments (dbo).
+ dbc = dbclient.get_database("TA2010") # TA continuous data from 2010
+ dbo = dbclient.get_database("Pdata2010") # arrivals from ANF picks
+
+ def query_generator(doc):
+ """
+ Generates a MongoDB query to run against wf_miniseed for waveform
+ segments containing any of the time interval time+stwin<=t<=time+etwin.
+ Returns a python dict that is used by read_distributed_data to
+ generate a dask bag of ensembles. Note this is an illustrative example
+ and makes no sanity checks on inputs for simplicity.
+
+ The input is the same python dict later loaded with the data using
+ the container_to_merge argument of read_distributed_data.
+ """
+ net = doc["net"]
+ sta = doc["sta"]
+ time = doc["arrival_time"]
+ query["net"]=net
+ query["sta"]=sta
+ stime=time+stwin
+ etime=time+etwin
+ query = {
+ "$and": [
+ { "sta" : {"$eq" : sta}},
+ { "net" : {"$eq" : net},
+ { "starttime" : {"$lte" : etime}},
+ { "endtime" : {"$gte" : stime}}
+ ]
+ }
+ return query
+
+ def make_segments(ensemble,stwin,etwin):
+ """
+ Function used in parallel map operator to create the main output of
+ this example workflow. The input is assumed to be a time-sorted ensemble
+ with all data overlapping with the time window defined by
+ stwin <= t-arrival_time <= etwin
+ where t is time of a d data sample. i.e. stwin an etwin are times relative
+ to the arrival time. The input ensemble is assumed to normally
+ contain multiple channels. The algorithm works through all channels it
+ finds. For each group if the number of segments is 1 it simply uses
+ the WindowData function. If multiple segments are present it calls the
+ MsPASS merge function with fix_overlaps set True and with the time
+ window requested. That will return a single waveform segment
+ when possible. If the merge fails that segment will be posted but
+ marked dead.
+
+ :param ensemble: input ensemble for a single station normally containing
+ multiple channels.
+ :param stwin: output window relative start time
+ :param etwin: output window relative end UTCDateTime
+ """
+ # handle dead (empty) ensembles cleanly returning a default constructed
+ # datum dead by definition
+ if ensemble.dead():
+ return TimeSeriesEnsemble()
+ ensout=TimeSeriesEnsemble()
+ net = ensemble["net"]
+
+ sta = ensemble["sta"]
+ time = ensemble["arrival_time"]
+ # the ensemble will usually contain multiple channels. We have to
+ # handle each independently
+ chanset = set()
+ for d in ensemble.member:
+ chan = d["chan"]
+ if loc in d:
+ loc=d["loc"]
+ else:
+ loc=None
+ chanset.add([chan,loc])
+ for chan,loc in chanset:
+ enstmp=TimeSeriesEnsemble()
+ for d in ensemble.member:
+ if d["chan"] == chan:
+ if loc:
+ if d.is_defined("loc"):
+ if d["loc"] == loc:
+ enstmp.member.append((d))
+ # enstmp now has only members match chan and loc - now we can run merge
+ # if needed.
+ if len(enstmp.member)>1:
+ d = merge(enstmp.member,time+stwin,time+etwin,fix_overlaps=True)
+ ensout.member.append(d)
+ else:
+ # above logic means this only happens if there is only one segment
+ # in that case we can just use WindowData
+ d = WindowData(enstmp.member,time+stwin,time+etwin)
+ ensout.member.append(d)
+ return ensout
+
+ # This undefined function would read the arrival time data
+ # stored in some external form and return a list of python dict
+ # with the keys 'net', 'sta', and 'arrival_time' defined.
+ arrival_list = arrival2list(args)
+ sort_clause=[("chan", 1), ("time",1)]
+ # This creates a bag from arrival_list that we can pass to the
+ # reader for loading with the container_to_merge argument
+ arrival_bag = dbg.from_sequence(arrival_list)
+ window_start_time = -100.0 # time of window start relative to arrival
+ window_end_time = 300.0 # time of window end relative to arrival
+ mybag = dbg.from_sequence(arrival_list)
+ mybag = mybag.map(query_generator,window_start_time,window_end_time)
+ qlist=mybag.compute()
+ # qlist now is a list of python dict defining queries. These are
+ # passed to the parallel reader to create a bag of ensemble objects.
+ mybag = read_distributed_data(qlist,
+ dbc,
+ collection="wf_miniseed",
+ sort_clause=sort_clause,
+ container_to_merge=arrival_bag,
+ )
+ mybag = mybag.map(make_segments)
+ # note the output of this function, with default here, is a list of
+ # objectids of the saved waveforms
+ out_ids = write_distributed_data(mybag,dbo,collection="wf_TimeSeries")
+
+The above example is complicated a bit as it is an example of a parallel
+job. The parallel IO feature of this example are important as this
+example could run very slowly as a serial job driven my millions of picks
+that exists for the problem it simulates - an Earthscope TA
+continuous data archive being accessed
+to assemble a data set of several million waveform segments built from the
+ANF catalog. It may be helpful to expand on the main steps of this algorithm:
+
+1. The first step assumes the existence of an undefined function with
+ the name `arrival2list`. For the prototype example given it could
+ be driven by the CSS3.0 tables created by the Earthscope
+ Array Network Facility (ANF). That data can currently be found
+ `here `__. The actual implementation
+ would need to select what picks to use and pull out a restricted set of
+ attributes from the CSS3.0 tables creating a large list of tuples
+ with each tuple containing: ['net', 'sta', 'arrival_time'] values.
+ Note that step can be done in a couple of lines with the pandas
+ module but is omitted as that is not a unique solution. (e.g. one
+ could also accomplish the same thing with a MongoDB database 'arrival'
+ collection with suitable content.)
+2. The `from_sequence` method of dask bag creates a bag from a list.
+ In this case it becomes a bag of python dict containers.
+ The map call that follows
+ using the custom function defined earlier in the code box creates
+ a bag of python dictionaries that define queries to MongoDB. What
+ the queries are designed to do is described in the docstring for that
+ function.
+3. We call the compute method to actually create the list of queries
+ that will drive the reader. That approach assumes the size of that
+ container is not overwhelming, which is likely a good assumption since
+ the individual dict containers are of the order of 100 bytes each.
+4. The called to `read_distributed_data` defines the main parallel workflow.
+ In this mode it reads a (large) series of ensembles driven by the
+ input query list. This usage creates a implicit parallel reader.
+ Each instance creates a `TimeSeriesEnsemble` with all channels
+ for a particular station that have waveforms that intersect with the
+ desired output time segment around the specified arrival time.
+ An important feature exploited in the reader here is that implemented
+ with the argument `container_to_merge`. The docstrings give details
+ but the main functionality it provides is a way to do a one-to-one
+ mapping of a list of metadata loaded to the ensembles. That feature
+ adds a major efficiency for large data sets compared to the alternative of
+ millions of MongoDB queries that one might consider to solve that problem.
+ This example also requires the `sort_clause` argument to assure the
+ queries return data in an order consistent with the requirements of the
+ `make_segments` function that does all main work here.
+5. The map call following `read_distributed_data` calls the function
+ earlier that handles the slice and dice operation. How that is done is
+ best gleaned fromt he docstring comments.
+6. This example calls the parallel writer, `write_distributed_data`, to
+ save the results.
+
+*Example 3: Application of TimeIntervalReader.*
+This example assumes we have a list of shot times from something like an
+onshore-offshore experiment using airguns or a set set of land shots with
+known shot times. The script is serial, but is readily converted to
+a parallel form using standard concepts described elsewhere in this user's manual.
+
+.. code-block:: python
+
+ from mspasspy.db.DBClient import DBClient
+ import os
+ dbclient=DBClient()
+ db = dbclient.get_database("my_continuous_dataset")
+ wstime=0.0
+ wetime=50.0 # cut 50 s listen windows
+ with fd = os.fopen("shottimes.txt"):
+ lines = fd.readlines()
+ for t in lines:
+ tslist = TimeIntervalReader(db,t+wstime,t+wetime,fix_overlaps=True)
+ for ts in tslist:
+ db.save_data(ts) # defaults to saving to wf_TimeSeries so omit data_tag
diff --git a/docs/source/user_manual/database_concepts.rst b/docs/source/user_manual/database_concepts.rst
index e13bdf30e..40074b707 100644
--- a/docs/source/user_manual/database_concepts.rst
+++ b/docs/source/user_manual/database_concepts.rst
@@ -209,7 +209,7 @@ schema CSS3.0 used, for example, in Antelope. The concepts involved are:
distinctions are a bit subtle and better left to the more detailed
discussion below.
* *Source related Metadata.* Any event driven processing needs information
- about seismic sources that are aassociated with the signals to be
+ about seismic sources that are associated with the signals to be
analyzed. That data is stored in this collection.
A common feature of all "normalized" collection data is that they define a
@@ -248,7 +248,8 @@ has millions of documents. Although experience showed that expectation was
true, we also found there are situations where embedded database operations
can be a bottleneck in a workflow. For that reason we developed a set of
normalization classes in python that cache tables of attributes needed for
-normalization. That idea is described below in the section :ref:`NEW SUBSECTION TO BE ADDE DTO THIS FILE`
+normalization. That idea is described below in the
+:ref:`normalization` section.
Waveform Processing
~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/source/user_manual/development_strategies.rst b/docs/source/user_manual/development_strategies.rst
new file mode 100644
index 000000000..d86cdbc5f
--- /dev/null
+++ b/docs/source/user_manual/development_strategies.rst
@@ -0,0 +1,188 @@
+.. _development_strategies:
+
+How do I develop a new workflow from scratch?
+==================================================
+*Answer by Prof. Gary L. Pavlis*
+
+The answer to this question depends on the complexity of the workflow you
+need to develop. We discuss this here in terms of two endmembers: (1) a
+workflow that can be reduced to a single python script with one or two simple
+adapter functions, and (2) a complex program that requires adapting algorithms
+from an external package. This page is designed in a progression from these
+endmembers.
+
+For all cases there it is important to first recognize a fundamental
+starting point, at least in present IT environment I suspect most
+seismologists work in. That is, you should expect:
+
+1. You absolutely will need an interactive framework to develop any
+ workflow. For most people, that means a local desktop or laptop
+ with docker installed. It is possible to use only a web browser
+ and with connections to a cluster, but that is guaranteed to be
+ a lot more difficult to do. The fact that containerization makes
+ the transition from a desktop a cluster much simpler makes the
+ model of using a desktop for initial development theh only sensible
+ norm.
+2. Every single example I have ever created to run MsPASS was
+ best created by first writing the workflow as a serial process
+ with a single outer loop over the dataset elements. Using
+ guidelines in :ref:`parallel_processing` convert to a parallel
+ equivalent only after you get the syntax and variables all defined
+ with the serial job.
+2. Most workflows reduce to one or more sections that have the
+ generic structure: read data -> process data -> save result.
+ To debug a parallel workflow on your desktop use a subset of
+ your data copied to your workstation and run the test on
+ the smaller subset that defines the "read data" section.
+ That approach provides a simple way to validate the workflow has
+ no obvious errors like python syntax errors and usage errors.
+ It is also a very good idea to use your desktop's system monitor
+ to watch the workflow's cpu and memory usage while running the
+ test data set. You may need to tune the memory use of the workflow
+ based on concepts described in :ref:`memory_management` section
+ of this manual.
+3. If you are running on a (now ancient) computer with only on core
+ you will not be able to test the parallel version on your desktop/laptop.
+ If your machine has only two cores, keep your test data set to
+ the minimum possible because you may bring the machine to its knees
+ while you run the test. In general, my advice would be that to a
+ desktop/laptop parallel test configure docker to only use two cores.
+ I have found it is easy to bring a desktop/laptop to it's knees if
+ you use all or a large fraction of the available cores. Hence, the
+ corollary to keep in mind is you shouldn't expect to do much else
+ like web surfing or email
+ on your desktop/laptop while running your test data set. You can expect
+ sluggish performance if parallelization is working as it should.
+
+Simple workflows
+~~~~~~~~~~~~~~~~~~~~
+
+Simple workflows can usually be developed using only the standard MsPASS
+container and the jupyter lab interface. There are a number of approaches
+discussed in common tutorials found by searching with the keywords
+"jupyter debug" in this situation.
+
+#. Inserting simple print statements to display the value of suspected variables.
+#. Inserting matplotlib graphics to visualize data at an intermediate stage of
+ processing.
+#. If you have a box that throws a mysterious exception that is not self-explanatory
+ the `%debug` magic command can be useful.
+ To use this feature add a new code box after the cell with problems, put
+ that command in the box, and push the jupyter run button. You will get
+ an ipython prompt you can use to investigate variables defined where the
+ error occurred.
+#. Use the jupyter lab debugger.
+#. Enable the pdb debugger
+
+In addition to the technical approaches for code debugging it is worth
+pointing out a completely different point; use the jupyter notebook
+structure to document your work. Use `Markdown` boxes to at least provide
+notes to yourself about what you are trying to do with the code and
+any key background. In addition, as always use comments within the code
+to clarify any sections using some less than obvious trick or dependent
+upon some obscure features that you exploited in writing that algorithm.
+
+For simple workflows the recently developed debugger for the jupyter lab
+will usually be sufficient. If an error occurs outside your notebook,
+however, the best strategy at present is to use pdb, which is included
+in the standard mspass container. There are numerous
+tutorials on using pdb to debug a jupyter notebook. Here are a
+couple we found useful:
+
+- `This `__
+ concise but perhaps cryptic introduction for jupyter lab and pdb.
+- `This `__
+ good overview that discusses a range of strategies for jupyter notebooks.
+
+As always for a topic like this a google search will give you a long
+string of variable quality resources.
+
+Complex developments
+~~~~~~~~~~~~~~~~~~~~~~~~~
+Because MsPASS uses python as the job control language, developing
+a workflow differs little from programming. Advanced users may decide
+they need to add features not in the framework that are too extensive
+to implement as code blocks in the jupyter format. Alternatively,
+python has a wide range of open-source toolkits that you may find the
+need to add to facilitate your custom research code. Here we provide
+suggestions on basic strategies.
+
+The first point is that advanced development is best done on a desktop
+where interactive access is the norm. Jumping immediately to an HPC
+cluster is destined to create frustration. Some recommended steps to
+create a development environment for MsPASS on a desktop are the
+following:
+
+#. Install an IDE of your choice.
+ If you are doing extensive python development you are likely already
+ familiar with one of the standard Integrated Development Environments
+ like `pycharm `__ or
+ `spyder `__, but there are
+ a number of others. IDEs dramatically improve most people's ability
+ to test and debug python applications compared to a simple editor
+ and pdb. If you aren't already using one, choose one and use it.
+#. Install a local copy of MsPASS. Unfortunately, there are currently
+ no simple ways we know of to
+ run any of these IDEs with an instance of mspass running via
+ docker or singularity. For this reason you will likely find it necessary
+ to install a local copy of mspass on your development desktop.
+ The process for doing that is described in a wiki page on github
+ found `here `__.
+#. The wiki page referenced above describes how to install dask and/or spark.
+ We have found many projects do not require the parallel framework and
+ can be reduced to writing and testing an appropriate set of functions
+ and/or python classes that implement the extension you need. Once a
+ function to preform a task is written and thoroughly tested it can
+ be plugged into the framework as just another python function used in
+ a map or reduce operator.
+#. Design a local test of your application that can be tested in serial
+ form on a small, test data set. Move to a massive HPC or cloud system
+ only when needed and you are confident your application is as well
+ tested as realistically possible
+#. Here we assume the tool you are developing can be placed in one or
+ more files that can serve as a standard python module; meaning something
+ you can "import" with the right path to the file. If you don't know how
+ to build a python module file there are huge numbers of internet
+ resources easily found with a web search.
+#. To test a python function with the mspass container, copy your python
+ code to a directory you mount with the appropriate docker or singularity run
+ incantation. The simplest way to do that is to just put your python
+ script in the same directory as your notebook. In that case, the
+ notebook code need only include a simple `import`. e.g. if you have
+ your code saved in a file `mymodule.py` and you want to use a function
+ in that module called `myfunction`, in your notebook you would just
+ enter this simple, failry standard command:
+
+ .. code-block:: python
+
+ from mymodule import myfunction
+
+ If `mymodule` is located in a different directory use the
+ docker "--mount" option or apptainer/singularity "-B" options to
+ "bind" that directory to the container. For example, suppose we have
+ module `mymodule.py` stored in a directory called `/home/myname/python`.
+ With docker this could be mounted on the standard container
+ with the following incantation:
+
+ .. code-block:: bash
+
+ docker run --mount src=/home/myname/python,target=/mnt,type=bind -p 8888:8888 mspass/mspass
+
+ To make that module accessible with the same import command as above you
+ would need to change the python search path. For this example, you could
+ use this incanation:
+
+ .. code-block:: python
+
+ import sys
+ sys.path.append('/mnt')
+
+#. Once you are finished testing you can do one of two things to make
+ it a more durable feature. (a) Assimilante
+ your module into mspass and submit
+ you code as a pull request to the github site for mspass. If accepted it
+ becomes part of mspass. (b) Build a custom docker container that
+ adds your software as an extension of the mspass container. The docker
+ documentation and the examples in the top level directory for the MsPASS
+ source code tree should get you started. It is beyond the scope of this
+ document to give details of that process.
diff --git a/docs/source/user_manual/importing_data.rst b/docs/source/user_manual/importing_data.rst
index 2ac27988a..fe04880fa 100644
--- a/docs/source/user_manual/importing_data.rst
+++ b/docs/source/user_manual/importing_data.rst
@@ -10,7 +10,7 @@ want to process. The focus of our initial development has been
tools to assemble data acquired from `FDSN (Federated Digital
Seismic Network) `__
data centers. Decades ago FDSN adopted the
-`Standard for the Exchange of Earthquake Data (SEED) `__
+`Standard for the Exchange of Earthquake Data (SEED)`__
format. Since then a subset of SEED, commonly called miniseed,
has become the universal tool for distributing earthquake data through
all FDSN data centers. Miniseed is compressed data format with
@@ -22,9 +22,9 @@ by ftp transfers of files to web services. FDSN web services
distribute three fundamentally different data types:
(1) single channel, sample data delivered as images of miniseed files,
(2) station metadata delivered with a standard format called
-`StationXML `__, and
+`StationXML`__, and
(3) earthquake source data delivered through a standard format called
-`QuakeML `__.
+`QuakeML`__.
Fortunately for our development efforts a well-developed solution for
acquiring all three data types from FDSN sources was already available in
obspy. The sections below assume you can consult obspy's
@@ -50,16 +50,20 @@ the FDSN confederation of data centers is to use obspy.
Obspy has two different python functions that can be used to
fetch miniseed data from one or more FDSN data centers.
-#. The simplest tool obspy provides is their :py:meth:`get_waveform `
- method of the fdsn web service client.
+#. The simplest tool obspy provides is their :code:`get_waveform`
+ method of the fdsn web service client. The documentation for that
+ function can be found
+ `here`__.
:code:`get_waveform` retrieves a relatively small number of waveform
at a time using station codes with wildcards and a time range.
It is suitable only for small datasets or as a custom agent
to run for weeks acquiring data driven by some large list.
#. For most MsPASS users the obspy tool
- of choice is :py:mod:`mass_downloader `.
- An overview of the concepts of :code:`mass_downloader` are given
- `here `__.
+ of choice is :code:`bulk_download`.
+ An overview of the concepts of :code:`bulk_download` are given
+ `here`__
+ and the detailed docstring for the function can be found
+ `here`__.
The obspy tools are well documented and relatively easy to use users
are referred to the above pages and examples to build a workflow to
@@ -76,12 +80,12 @@ give a few warnings.
caller has to be set up, and then the data transmitted via the internet.
The built-in, multiple delays make that process too slow for large
data requests.
-#. The authors of obspy developed their :code:`mass_downloader` function
+#. The authors of obspy developed their :code:`bulk_download` function
because their early experience, like ours, showed :code:`get_waveform`
was not feasible as a tool to download large data sets.
- :code:`mass_downloader` makes it feasible to download large data sets.
+ :code:`bulk_download` makes it feasible to download large data sets.
Be warned that feasible, however, does not mean quick. In our experience
- :code:`mass_downloader` can sustain a transfer rate around a few Mb/s.
+ :code:`bulk_download` can sustain a transfer rate around a few Mb/s.
That is outstanding performance for internet data transfer,
but keep in mind 1 Tb of data at that rate
will require of the order of one week to download.
@@ -90,7 +94,7 @@ give a few warnings.
where a large request will have missing data that we know are present
at the data center. The reason is that web service is, by design,
a one way request with no conversation between the client and server
- to guarantee success. The authors of obspy's :code:`mass_downloader`
+ to guarantee success. The authors of obspy's :code:`bulk_download`
method seem to have done some tricks to reduce this problem but
it is not clear to us if it has been completely solved.
@@ -103,7 +107,7 @@ transitioning to a cloud file service model as the next generation of
data access. It is a current development effort of MsPASS to
provide a simple reader for cloud systems. Examples of
using our implementation for S3 on AWS
-can be found `here `__.
+can be found `here`__.
As the word "prototype" implies that api
is likely to change. Check the docstring api for more recent changes if
you are interested in this capability.
@@ -113,7 +117,7 @@ Assembling Receiver Metadata
Receiver metadata from FDSN sources is easily obtained and assimilated
into MsPASS using a combination of obspy functions and import functions
-that are part of the :py:class:`Database ` class (MongoDB handle) of MsPASS.
+that are part of the :code:`Database` class (MongoDB handle) of MsPASS.
FDSN receiver metadata is obtainable through one of two fundamentally different
approaches: (1) the older "dataless SEED" file format, and (2) the
@@ -131,13 +135,15 @@ We recommend users utilize obspy to assemble receiver metadata from FDSN
data centers. There are two different tools obspy provides. We have found
both are usually necessary to assemble a complete suite of receiver metadata.
-#. The fdsn client has a method called :py:meth:`get_stations ` that is
+#. The fdsn client has a method called :code:`get_stations` that is
directly comparable to :code:`get_waveform`. It uses web
services to download metadata for one or more channels of data. It has
a set of search parameters used to define what is to be retrieved that
is usually comprehensive enough to fetch what you need in no more than
a few calls. Be warned it retrieves the results into memory into a
- custom obspy data object they call an :py:class:`Inventory `.
+ custom obspy data object they call an :code:`Inventory`. The docstring
+ for an :code:`Inventory` object can be found
+ `here`__.
An :code:`Inventory` object can be viewed as more or less a
StationXML format file translated into a python data structure with a
few added decorations (e.g. plotting). We return to this point
@@ -145,14 +151,16 @@ both are usually necessary to assemble a complete suite of receiver metadata.
#. When you use the :code:`mass_downloader` you have the option of
having that function download the station metadata and save the
actual StationXML data files retrieved from web services. The resulting
- files can then be read with their :py:func:`read_inventory ` method.
+ files can then be read with their :code:`read_inventory` method
+ described `here`__.
Both of the approaches above can be used to create an obspy
:code:`Inventory` object in python: for (1) that is the return of the
function while for (2) it is the output of a call to :code:`read_inventory`
with :code:`format="STATIONXML"`. We use the obspy :code:`Inventory` object
as an intermediary for storing receiver metadata in a MongoDB database.
-The :code:`Database` class has a method we call :py:mod:`save_inventory `.
+The :code:`Database` class has a method we call :code:`save_inventory`
+described `here`__.
That method translates an :code:`Inventory` object into documents stored in
what we call the :code:`channel` and :code:`site` collections. As noted
many other places in our documentation :code:`channel` contains receiver
@@ -165,12 +173,12 @@ following code framgment extracted from our tutorials:
from mspasspy.db.database import Database
from mspasspy.db.client import DBClient
- dbclient = DBClient()
- db = Database(dbclient, 'getting_started')
- inv = client.get_stations(network='TA', starttime=starttime, endtime=endtime,
- format='xml', channel='BH?', level='response')
- ret = db.save_inventory(inv, verbose=False)
- print('save_inventory returned values=', ret)
+ dbclient=DBClient()
+ db=Database(dbclient,'getting_started')
+ inv=client.get_stations(network='TA',starttime=starttime,endtime=endtime,
+ format='xml',channel='BH?',level='response')
+ ret=db.save_inventory(inv,verbose=False)
+ print('save_inventory returned values=',ret)
As noted above an :code:`Inventory` object
is more or less an image of a StationXML file. StationXML is complete, but
@@ -246,8 +254,10 @@ MongoDB source documents.
Like the receiver problem, obspy has two comparable functions for
retrieving source metadata.
-#. :py:meth:`get_events ` is an obspy function that is very similar to the
+#. :code:`get_events` is an obspy function that is very similar to the
receiver equivalent :code:`get_stations` noted above.
+ Their documentation on this function can be found
+ `here`__.
Like the receiver equivalent it has search criteria to yield a set of source data
based on some spatial, time, magnitude, and/or other criteria.
In addition, like :code:`get_stations`, :code:`get_events` returns the
@@ -260,7 +270,7 @@ retrieving source metadata.
#. If you use the obspy :code:`mass_downloader` driven by source
queries (see example titled "Earthquake Data" on the
- :py:mod:`mass_downloader ` page)
+ mass_downloader page found `here`__)
that function will create QuakeML data files defining the unique source data for
all the waveforms downloaded with each call to that function.
@@ -271,7 +281,7 @@ intermediary for the import. :code:`get_events` returns the
obspy :code:`Catalog` class directly while the output QuakeML files from
the :code:`mass_downloader` are easily created by calling the
obspy function :code:`read_events` described
-`here `__.
+`here`__.
A :code:`Catalog` instance can then be saved to a MongoDB source collection
using the :code:`Database` method called :code:`save_catalog`.
The following is a fragment of a workflow doing this with the output of
@@ -279,7 +289,6 @@ The following is a fragment of a workflow doing this with the output of
.. code-block:: python
- FIXME
paste in portion of 2012 usarray workflow
An alternative for which we provide limited support is importing catalog
@@ -309,21 +318,23 @@ to be processed:
from obspy import read
from mspasspy.db.database import Database
from mspasspy.db.client import DBClient
- dbclient = DBClient()
- db = Database(dbclient, 'mydatabasename')
+ dbclient=DBClient()
+ db=Database(dbclient,'mydatabasename')
...
for fname in filelist:
- st = obspy.read(fname, format="SOMEFORMAT")
+ st = obspy.read(fname,format="SOMEFORMAT")
d = converterfunction(st)
db.save_data(d)
where :code:`SOMEFORMAT` is a keyword from the list of obspy
-supported formats in :py:func:`read `
+supported formats found
+`here`__
and :code:`converterfunction` is a format-specific python function you would need to
write. The function :code:`converterfunction` needs to handle
the idiosyncrasies of how obspy handles that format and convert the stream
:code:`st` to a TimeSeriesEnsemble using the MsPASS converter function
-:py:mod:`Stream2TimeSeriesEnsemble `.
+:code:`Stream2TimeSeriesEnsemble` documented
+`here`__.
That is a necessary evil because as the authors of the obspy write in
their documentation some formats have concepts incompatible with
obspy's design. Although we cannot provide unambiguous proof we have
@@ -352,11 +363,12 @@ TimeSeriesEnsemble:
will require developing a function like that we call :code:`converterfunction`
above.
-The MsPASS :py:mod:`schema ` module
+The MsPASS schema class
+(see `this page`__ for details)
has tools we designed to aid conversion of Metadata
(i.e. item 2 above) from external representations
-(format) of data to MsPASS. In particular, the :py:mod:`apply_aliases ` and
-the inverse :py:mod:`clear_aliases ` were designed to simplify the mapping for
+(format) of data to MsPASS. In particular, the :code:`apply_aliases` and
+the inverse :code:`clear_aliases` were designed to simplify the mapping for
key-value pairs in one namespace to another. To utilize this feature for
a given format you can either create a yaml file defining the aliases or
hard code the aliases into a python dict set as key:alias.
diff --git a/docs/source/user_manual/io.rst b/docs/source/user_manual/io.rst
index ce7be0025..2175e9d2e 100644
--- a/docs/source/user_manual/io.rst
+++ b/docs/source/user_manual/io.rst
@@ -5,83 +5,455 @@ I/O in MsPASS
Overview
-----------------------
+Input-output (IO) performance of a workflow can easily be the main throttle
+on the speed of a workflow in MsPASS. The reason is a corollary of an
+old saying about the definition of a "supercomputer". The old saying is
+that a supercomputer is a computer that turns all compute-bound jobs into
+io-bound jobs. The scalability of MsPASS, which means it runs the
+same with one or a thousand workers (just faster), means the more processors
+you throw at a job the more likely it will be that it becomes IO bound.
+For that reason the focus of this section is the parallel readers.
+Howver, most of the key concepts are equally applicable to serial
+jobs.
-We want to use parallel containers to load data for efficiency, and the process of reading
-data includes reading from the file system and reading from the database. To further improve
-efficiency, we want to avoid database related operations and only perform file system read and write
-operations in the parallel containers. This module describes distributed I/O operations in
-MsPASS without using Database. The functions here are not methods of Database.
+Most seismologists have, at best, only a vague notion of how IO
+works in a modern computer system. The reason is that IO is the one of
+the earliest examples of how computer languages abstract a complex
+process into a simple command. The earliest version of FORTRAN had
+the READ and WRITE commands that are a type example.
+IO has become increasingly complex as computers have evolved.
+We thus first give a brief overview of key concepts that impose
+limits of IO performance on different kinds of hardware. That
+discussion could extend to book length, but we review only the basics
+at a high level that we hope can help user's understand what factors
+might limit the performance of their processing job. We then review the
+API for parallel processing and discuss a few practical guidelines for
+avoiding IO bottlenecks.
+
+.. note::
+ This section could rapidly become seriously out-of-date. Current
+ efforts by Earthscope and other FDSN data centers to move their
+ services to cloud computer systems promise to completely change
+ some elements of this section.
+
+IO Concepts
+---------------
+*Concept 1: IO is transfer of data to and from main memory*. The
+most basic thing to recall and understand is that any time a computer
+algorithm does IO it means data has to be transferred from something to
+the computer's main memory (RAM). That process can take from a few clock
+cycles to millions (or many orders of magnitude more) of clock cycles depending on
+what the "something" is.
+
+*Concept 2: Buffering*. For more than 50 years all but a tiny fraction
+of computer processing jobs used buffered IO.
+In buffered IO a program does not read directly from any device but always
+reads from a memory buffer. Other elements of the system manage the flow of
+data in and out of the buffer. All MsPASS IO uses stock, buffered IO
+functions.
+
+*Concept 2: Blocking*. When an IO statement is executed in any computer language,
+including python, the normal behavior is for the program to be put in a wait state
+until the code that does the actual operation completes. That waiting is
+what is meant by *blocking IO*. With normal blocking IO your program will
+always wait until the IO operation returns before continuing execution. All
+MsPASS IO is currently blocking IO.
+
+*Concept 4: communication channels*. Today most IO is abstracted as
+taking place via a communication channel that can be manipulated by
+an appropriate API. That can be explicit like the FILE* handle in C
+or more abstract like the generic read handle in obspy's
+`read `__ function.
+In all cases the idea is the handle provides a mechanism to
+move data to/from some external location from/to the memory space of your program.
+
+*Concept 5: open/close*. No IO is possible unless the interface that
+defines a communication channel is initialized and made valid. That
+is traditionally done through some form of "open". Whatever the syntax,
+an "open" means creation of a handle that defines a particular instance of
+a communication channel. A "close" destroys the connection and releases
+the resources allocated to a create a communication channel.
+
+With the above it may be helpful to summarize the generic sequence of
+steps that happen when reading the data for a seismic data object,
+which is really just a generic read.
+
+#. Some version of "open" is used to create the handle to fetch the data.
+#. One or more distinct "read" operations are requested through the
+ handle. Each distinct read will block until the operation is completed.
+ How long of a delay that represents in your program is dependent on
+ a long list of variables. With files reads may or may not also involve calls to
+ "seek" to tell the handle to locate a specific piece of a file.
+ That operation can be very fast if nothing is required (For example, rewind a file
+ that is already positioned to the beginning.), but also may be slow
+ as a new buffer has to be loaded with the section of the file
+ containing the section requested.
+#. The handle is closed and destroyed when no longer needed by the algorithm.
+
+Writes are similar, except data is pushed to a buffer from your program
+instead of loaded into program memory space as directed by your program.
+(Note some programs read and write to the same communication channel
+but there is currently no such example in MsPASS except communications
+with the MongoDB server, which is the next item)
+For both reads and writes delays are inevitable in all three of the
+above generic steps.
+
+*Concept 6: database server communications*. Interaction with MongoDB
+is a special case of IO in MsPASS. Database communication is
+notorious for creating performance problems if handled
+carelessly for two reasons: (1) Each interaction (also sometimes
+called a transaction) with the MongoDB server creates an action/reaction delay.
+For reads that means the caller sends a request to the MongoDB server
+(itself an IO operation), the server has to process the request, and then
+return the result to the caller (a second IO operation). In a write the
+data to be saved is packaged and sent to the server, handled by the server,
+and a response acknowledging success (or failure) is returned. All
+pymongo read/write calls block until the circuit above completes creating
+built in delays. (2) the communication channel for MongoDB is
+always a TCP/IP network connection. The speed of that communication
+relative to disk transfer speeds
+can be an issue on some systems. A larger issue, however, is that when multiple
+processes are pushing large amounts of data the pipe can fill and limit
+throughput. A type example of this is gridfs read/write discussed below.
+
+The above are all generic ideas, and are only a part of the complexity
+of modern IO systems. They are sufficient, however, to summarize the
+factors that cause IO bottlenecks on most seismic processing jobs, in general,
+and MsPASS in particular. In the next section we describe how the
+current versions of the parallel readers and writers of MsPASS are
+designed to improve IO performance on clusters. We also appeal to the
+concepts above in the final section that discusses known bottlenecks in
+IO performance to avoid.
+
+
+Parallel IO in MsPASS
+-------------------------
+We use parallel containers in MsPASS to provide a mechanism to handle
+data sets that cannot fit in cluster memory. As stated numerous times
+in this User's Manual we use what is called a "bag" in dask and
+"RDD" (Resiliant Distributed Data) in pyspark for this purpose.
+To be truly parallel creation of a bag/RDD must be parallelized.
+Similarly, saving the result of a computation defined as one of them
+should also be parallelized. If they were not parallelized the
+IO stage(s) would nearly always be the throttle controlling the
+run time. In this section we discuss our current implementation
+to help the user understand current limitations of that processing.
Read
~~~~~~~
-Read is the proccess of loading the entire data set into a parallel container (RDD for SPARK
-implementations or BAG for a DASK implementations).
-
-1. :py:meth:`read_distributed_data ` is the core method for reading data.
-
-The input includes the metadata information stored in the dataframe. Depending on the storage mode,
-the metadata should include parameters such as file path, gridfs id, etc. This function will
-construct the objects from storage using DASK or SPARK, and loading complete mspass objects into
-the parallel container. A block of example code should make this clearer:
-
- .. code-block:: python
-
- from mspasspy.db.client import Client
- from mspasspy.db.database import Database
- from mspasspy.io.distributed import read_distributed_data
- dbclient = Client()
- # This is the name used to acccess the database of interest assumed
- # to contain data loaded previously. Name used would change for user
- dbname = 'distributed_data_example' # name of db set in MongoDB - example
- db = Database(dbclient,dbname)
- # This example reads all the data currently stored in this database
- cursor = db.wf_TimeSeries.find({})
- df = read_to_dataframe(dbclient, cursor) # store to dataframe
- rdd0 = read_distributed_data(
- df,
- cursor=None,
- format="spark",
- mode="promiscuous",
- spark_context=spark_context,
- )
-
- The output of the read is the SPARK RDD that we assign the symbol rdd0.
- If you are using DASK instead of SPARK you would add the optional
- argument :code:`format='dask'`.
+For both dask and pyspark the parallel read method is a function called
+:py:func:`read_distributed_data `.
+The docstring and the `:ref:CRUD operations`
+section of this manual have more details on
+the nuts-and-bolts of running this function, but from the perspective of
+IO performance it may be helpful to discuss how that function
+is parallelized for different options of arg0, which defines the primary
+input method to the function.
+#. *DataFrame input*. When the input is a DataFrame each worker
+ is assigned a partition of the received DataFrame to read.
+ As a result there will be one reader for each worker defined for
+ the job. Readers work through the rows of the DataFrame that
+ define the partition to output a bag/RDD of data objects with the same partitioning
+ as the DataFrame. The scheduler assigns additional partitions to
+ workers as they complete each one as usual.
+ Note normally the output bag/RDD never exists
+ as a complete entity but defines only an intermediary stage of
+ the data passed down a pipeline to other processing algorithms.
+ Note that the database is accessed in this mode only if
+ normalization during reading is requested
+ (see :ref:`Normalization`) or if any data
+ are stored with gridfs.
+#. *Database input*. Database input is, in many respects, a variant of
+ DataFrame input with an added serial processing delay. That is,
+ when the input is a Database handle it applies an optional
+ query (default is the entire wf collection specified)
+ and then attempts to load all wf documents. To allow for very large
+ data set where the documents alone would overflow the available
+ memory space the function has a :code:`scratchfile` option
+ that stages the results to a scratch file before creating the
+ bag/RDD. Users should recognize that a large delay is built into this
+ process by the time required to construct the intermediate list of
+ documents, particularly if a scratchfile is used. Once the
+ bag/RDD of documents is created the reader parallelization works
+ the same as for a DataFrame with minor differences that have little
+ to no impact on performance.
+#. *List of queries*. Parallel ensemble reads are driven by a list of
+ MongoDB queries. Since pymongo uses a python dict as the query
+ format, that means the function is driven by a list of python dict
+ containers. The read operation in that situation is more elaborate,
+ although not necessarily more time consuming, than the atomic,
+ parallel, read algorithms described above. A bag/RDD is created
+ from the list of dict
+ containers and partitioned as usual. The scheduler normally assigns
+ partitions sequentially to workers as with atomic data.
+ The parallelism is based on one
+ query per worker with each worker normally processing a sequence of
+ queries for each partition as each partition is completed.
+ For each bag component the algorithm first issues
+ a query to MongoDB using that dict content.
+ The query is then used to drive the
+ :py:meth:`mspasspy.db.database.Database.read_data` method with
+ a cursor input. That creates a new ensemble object that is passed
+ into the processing pipeline. This algorithm has no serial
+ steps, but can be bottlenecked if a large number of workers are
+ defined that often run queries simultaneously. Otherwise performance
+ is normally limited by the speed of the IO operations embedded in
+ the document definitions for the ensemble members. As with atomic
+ operations when gridfs storage is used, throughput can be limited by the MongoDB
+ connection shared by readers.
Write
-~~~~~~~
-Write is the proccess of saving the entire data set into the file system.
-
-1. :py:meth:`write_distributed_data ` is the core method for writing data.
-
-The input includes SPARK RDD or DASK BAG of objects (TimeSeries or Seismogram), and the output is a dataframe of metadata.
-From the container, it will firstly write to files distributedly using SPARK or DASK, and then write to the database
-sequentially. It returns a dataframe of metadata for each object in the original container. The return value
-can be used as input for :code:`read_distributed_data` function.
-
-Note that the objects should be written to different files, otherwise it may overwrite each other.
-dir and dfile should be stored in each object.
-
-A block of example code should make this clearer:
-
- .. code-block:: python
-
- import dask.bag as daskbag
- ts_list = [ts1, ts2, ts3] # given TimeSeries list
- list_ = daskbag.from_sequence(ts_list)
- # write to file and get a dataframe
- df = write_distributed_data(
- list_,
- db,
- mode="cautious",
- storage_mode="file",
- format="dask",
- )
- # read from the dataframe
- obj_list = read_distributed_data(df, format="dask").compute()
-
-
- The output of the write is the dataframe containing the metadata information that
- can be used in :code:`read_distributed_data` function.
+~~~~~~~~~~
+As with reading, parallel writes in MsPASS all should go through
+a common function called
+:py:func:`write_distributed_data `.
+Parallelism is this function is similar to the reader in the sense that
+when run on a bag/RDD each worker will have one writer thread/process
+that does the output processing. The way data are handled by writers,
+however, is more complex.
+
+The first detail to understand about the parallel writer is it splits
+up the task into two distinctly different operations: (1) saving
+the sample data and (2) saving the Metadata as set of MongoDB documents.
+Each instance of the writer handles these two operations in that order.
+The first calls a generic, private method in
+:py:class:`Database` called
+:py:meth:`_save_sample_data`.
+The speed of that operation is completely controlled by the "storage_mode"
+chosen and the communication path the choice implies.
+
+Saving the Metadata for each seismic object is different as the
+performance is mainly limited by MongoDB and the way the writers
+create the wf collection documents. Atomic level insertions
+(i.e. one wf document at a time) is
+the absolute slowest way to handle this operation as each insert
+(a "collection" method of MongoDB)
+blocks until the server responded that it has completed the
+operation. To reduce such delays
+:py:func:`write_distributed_data `
+always tries to insert multiple documents in each interaction with
+the MongoDB server. The method is different for atomic data and
+ensembles:
+
+#. For atomic data the algorithm uses the :code:`map_partition` method
+ defined for both a dask bag and a pyspark RDD. A partition is normally
+ made up of multiple data components. The writer than uses the
+ :code:`insert_many` method for a each set of wf documents extracted
+ from each partition. Unless the partitions are huge each call to
+ :code:`insert_many` takes only a slighly longer than calls to
+ :code:`insert_one`, which is the operation for one document.
+ Hence the database IO time for a workflow is reduced by approximation
+ one over the number of components per partition.
+#. For ensembles the ensemble itself provides a logical grouping.
+ That is, instead of using :code:`insert_many` by partition we
+ use the set of documents extracted from each ensemble member.
+ In this case the reduction in database write time is
+ reduced by approximately one over the average number of ensemble members.
+
+A final detail about writing is the handling of "dead" data.
+The overall behavior is controlled by a boolean argument to
+:py:func:`write_distributed_data `.
+called "cremate". When set true dead data are "cremated" using the
+:py:meth:`cremate` method of
+the :py:class:`Undertaker` class.
+As the imagery of the name implies little to nothing is left of dead
+data that is cremated. The default for the cremate parameter is False.
+In that situation dead data are "buried", which means a record of
+why they were killed is saved in a special collection called
+:code:`cemetery`.
+See the section
+:ref:`CRUD operations` for more about these
+concepts. For this discussion a key point is that lots of dead data
+with the default setting of :code:`cremate=False` can impose a
+bottleneck because a call to :code:`insert_one` will happen with
+each dead datum and each writer instance will block until that
+transaction is completed.
+
+Known IO Bottlenecks to avoid
+---------------------------------
+Excessive file open/close
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+At this time most workflows in MsPASS are driven by data stored on
+a sequence of files. e.g. a common starting point is a collection of
+miniseed files downloaded from one or more FDSN data centers.
+A key point to recognize is that a file open or close is not a zero
+cost operation. On modern systems the time is generally of the order of
+milliseconds, but keep in mind a millisecond today is typically a
+million cpu clock cycles. The time spent doing an open/close is
+always wasted wall time when your job is likely not using all available CPU cycles.
+An unfortunate legacy of the dominance of SAC in the seismology community for
+decades is the prevalent use of one file per datum. Obspy has,
+unfortunately, only strengthened this archaic idea that causes
+inevitable performance issues. The time to open and read a single SAC
+file is irrelevant when you are working interactively because a millisecond
+is not a perceptible delay to a human, but for the cpu of your computer
+it is forever. Unless your data set is tiny, never ever use one file
+per datum with MsPASS as it is almost guaranteed to be a bottleneck in
+performance.
+
+The easiest way to avoid excessive open/close in processing is to
+structure your algorithm into enembles (gather in reflection processing jargon)
+whenever possible. Both the miniseed and native binary readers for
+ensembles use an internal algorithm to minimize the number of file
+open and close calls. They do that by basically loading all data
+from common files, as defined by the `dir` and `dfile` attributes
+stored in the database, as a group when assembling an enemble.
+For workflows driven by atomic data there is currently no way to avoid
+an open/close for each read.
+
+Lustre file-systems and many small files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The SAC model of one file per datum will really get you in trouble if
+you try to process a large dataset with many workers on a large HPC cluster.
+The fundamental reason is that HPC clusters are (currently at least)
+tuned to handling large simulation jobs. Most simulation programs
+read small values of initial input data and then run a huge calculation
+for long periods of time. That means the systems are not tuned to
+handle large numbers of file open/close transactions.
+
+At the time of this writing the type example of this issue is when
+a job has to interact with a `Lustre file system `__.
+Lustre is a heavily
+used package today that HPC clusters use to set up huge virtual
+file systems. For example, if a petabyte scale file system is defined on
+the cluster you are using, it is probably running Lustre or some
+ancestor of Lustre. Lustre is a parallel file system that allows
+many processes to be doing reads or writes simultaneously.
+A problem arises because Lustre handles two concepts we noted above
+fundamentally differently: (1) open/close functions are implemented
+via a database server while (2) reads and writes are implemented through
+a "mover" that handles moving data to and from memory to the
+designated storage location on a large disk array. What that means is
+that open/close tends to be relatively slow while reads and writes to
+an open file are about as fast as technologically possible today.
+There is also a state dependency in file opens. Lustre keeps a cache
+of files it knows about so it is faster to reopen the same file shortly
+after closing it than to open a completely different file.
+
+The SAC model is particularly problematic for large data set if they
+are staged to a Lustre file system. Consider, for example, a relatively
+modest data set today of around a million waveforms. Supposed we staged
+that data to a Lustre file system. First, you will find just copying it
+there and handling it will be a problem because those commands also have
+to do lots of file open/close operations. If you then tried to run a
+workflow based on atomic data each element of the bag/RDD that would
+be needed to initiate a parallel workflow will require and open, read,
+and close sequence. If every file has a unique file name that means
+a database transaction with the Lustre "Metadata server" is required for
+each datum. There are known examples (not by us, but similar legacy codes)
+where a large parallel workflow has overwhelmed a Lustre Metadata server
+and effectively stopped an entire cluster. MsPASS has the potential
+to do the same if used naively in that environment.
+
+We know of three solutions to address this issue at present with
+different tradeoffs.
+
+#. Use the ensemble strategy noted above to structure data into
+ logical groupings to reduce the number of open/close operations.
+ If here is not logical grouping consider just using ensembles
+ defined by distinct files.
+#. If your workflow absolutely has to be done with atomic data
+ you should still aim to assemble data into larger files that
+ are logical for your data. As a minimum that will reduce
+ the load on the Lustre metadata server. Although untested, we
+ suspect a parallel job using atomic data performance can be improved
+ by sorting the list of documents retrieved from MongoDB by the
+ combined key of dir::dfile. Lustre will cache files it knows about
+ after an initial open so secondary opens of the same file
+ may be faster.
+#. Use gridfs to store your data. As noted in a related section that
+ has difference performance issues, but at least you won't be a bad
+ citizen and crash the Lustre file system.
+
+.. note::
+ Developments are in progress to improve IO performance on HPC
+ and cloud systems using newer software systems that address some of these
+ issues. The above list of workarounds may change if those developments
+ are successful.
+
+Single Communication Channel (gridfs)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Recall one of the storage options in MsPASS is MongoDB's gridfs system.
+Gridfs seems to be implemented in a manner conceptually similar to lustre.
+That is, the equivalent of a file name is handled differently
+than the container that hold the actual data. The first is held in a
+(normally smaller) collection called "fs.files" and the second in a
+collection called "fs.chunks".
+
+With that review it is easy to state the fundamental performance problem
+inherent in using gridfs as the prime storage method: all the data
+needed to construct all data objects has to pass through a single
+communication channel - the network connection to the MongoDB server.
+The result is not unlike what would happen if a freeway with one lane
+for each cpu in a job was forced to pass over a one-lane bridge.
+This situation is even worse because in most systems the network
+connection with the MongoDB server is much slower than file IO through
+the main system bus.
+
+There is a known solution to this problem, but none of us have
+attempted to implement it with MsPASS for an actual processing job.
+That solution is what MongoDB
+calls `sharding `__.
+Sharding can be used to set up multiple MongoDB servers to
+increase the number of communication channels and reduce collisions
+between workers demanding data simultaneously. It adds a great
+deal of complexity, however, and is known to only be effective if the
+read patterns are well defined in advance. It also is inevitably
+inferior to transfer rates for disk arrays or SSD storage
+because the pipe is a single network connection as opposed to
+the faster connection typical of disk hardware.
+
+Web-service URL reads
+~~~~~~~~~~~~~~~~~~~~~~~~
+At the present time the most common way to assemble waveform
+data for seismology research is to download data from one or more FDSN
+data centers via a standardized protocol called
+`web services `__.
+MsPASS has a capability of defining a waveform segment with URL,
+which makes it theoretically possible to define and entire data set
+as residing remotely and accessible through the web service interface.
+That, however, is presently a really bad idea for two reasons:
+
+#. The rate that data can be pulled by that mechanism is far too
+ slow to be useful as a starting point for any data processing
+ workflow other than a tiny data set.
+
+#. Web services has no mechanism to guarantee success of a query.
+ When one tries to retrieve millions of waveforms by this mechanism
+ some loss is nearly guaranteed.
+
+For these reasons the only recommended way to utilize FDSN data at this time
+is to stage the data to a local disk array, validate the completeness of
+what was received, and do appropriate QC before doing any large scale
+analysis.
+
+.. note::
+ At the time of this writing Earthscope is in the middle of a major
+ development effort to move their data services to a cloud system.
+ When that happens the model of using a URL to define each waveform
+ segment and running MsPASS on the same cloud system hosting the
+ data is expected to become a standard way to assemble a data research
+ ready data set.
+
+Using a slow disk for your database
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A hardware innovation that has profoundly improved IO performance in the
+past decade+ is growth of solid-state disks (SSD). SDDs have throughput
+rates orders of magnitude higher than standard magnetic disks. At present,
+however, they are more expensive and most systems have a mix of SSDs and
+magnetic disks. Large disk arrays use many magnetic disks to create
+virtual file systems that is some fraction of the total capacity of the
+entire set of disks. What that means is that for large data sets
+it is almost always necessary to store the waveform data on magnetic
+disks or a disk array of magnetic disks. We have found, however, that
+IO performance is generally improved if the files managed by MongoDB
+during processing are staged or stored permanently on an SSD.
+The database server performance is exceptionally effected because
+SSD are particularly superior with random access IO that is the norm for
+database transactions. The reasons are deep in the weeds of the difference
+in the hardware, but the same factors have driven the universal switch from
+magnetic to SSD system disks for laptop and desktop systems in the past
+decade. Operating system performance is drastically improved because
+many processes competing for IO channels cause read/write patterns to
+be widely scattered over the file system.
diff --git a/docs/source/user_manual/memory_management.rst b/docs/source/user_manual/memory_management.rst
new file mode 100644
index 000000000..f08ef169d
--- /dev/null
+++ b/docs/source/user_manual/memory_management.rst
@@ -0,0 +1,801 @@
+.. _memory_management:
+
+Memory Management
+======================
+*Gary L. Pavlis and Ian (Yinzhi) Wang*
+
+Overview
+~~~~~~~~~~~
+
+Dask and Spark, which are used in MsPASS to exploit parallel processing,
+are generic packages to implement the map-reduce paradigm of parallel processing.
+As generic frameworks Dask and Spark are flexible but not as efficient and
+bombproof as a typical seismic reflection processing package. One of the
+biggest weaknesses we have found with these packages is default memory management
+can lead to poor performance or, in the worst case, memory faults.
+The biggest problem seems to be when the size of
+the data objects considered atomic (Components of the generalized list called
+an RDD by Spark and a bag by Dask) are a significant fraction of any nodes
+memory.
+
+In this document we discuss pitfalls we are aware of in
+memory management with MsPASS. This document is focused mainly on dask
+because we have more experience using it.
+The main dask document on dask memory management is found
+`here `__.
+The comparable page for spark is `here `__.
+As for most heavily used packages like this there are also numerous pages
+on the topic that will quickly bury you with a web search. We produced this
+document to help you wade through the confusion by focusing on the practical
+experience with MsPASS. When we are talking specifically about dask or
+spark we will say so. We will use the term `scheduler` to refer to generic
+concepts common to both.
+
+*WARNING:* This is an evolving topic the authors of MsPASS are continuing
+to evaluate. Treat any conclusions in this document with skepticism.
+
+
+
+DAGs, tasks, and the map-reduce model of processing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To understand the challenge a scheduler faces in processing a workflow,
+and memory management we need to digress a bit and review the way a scheduler
+operates.
+Dask and spark both abstract processing as a sequence of "tasks".
+A "task" in this context is more-or-less the operation defined in
+a particular call to a "map" or "reduce" operator.
+(If you are unfamiliar with this concept see section :ref:`parallel_processing`
+and/or a web search for a tutorial on the topic.) Each "task"
+can be visualized as a black box that takes one or more inputs and
+emits an output. Since in a programming language that is the same
+thing conceptually as a "function call" map and reduce operators always
+have a function object as one of the arguments. A "map" operator
+has one main input (arguments to the function
+are auxiliary data) and a single output (the function return).
+Reduce operators have many inputs of a common
+data type and one output. Note, however, the restriction that
+they are always considered two at a time. For most MsPASS operators the inputs and
+outputs are seismic data objects. The only common exception is constructs
+used to provide inputs to readers such as when `read_distributed_data`
+is used to load ensembles.
+
+Dask and spark both abstract the job they are asked to handle
+as a tree structure called a "Directed Acyclic Graph" (DAG).
+The nodes of the tree are individual "tasks". It has been our experience
+that most workflows end up being only a sequence of map operators.
+In that case, the DAG reduces to a forest of saplings (sticks with no
+branches) with the first task being a read operation and the last being
+a write (save) operation.
+
+To clarify that abstract idea consider a concrete example. The
+code box below is a dask implementation of a simple job with
+7 steps: query database to define the dataset, read data,
+detrend, bandpass filter, window around P arrival time,
+compute signa-to-noise estimates, and
+save result. Here is a sketch of the workflow omitting complexities of
+imports and initializations:
+
+.. code-block:: python
+
+ # assume database handle, db, and query are defined earlier
+ cursor = db.wf_miniseed.find(query)
+ # Npartitions assumed defined in initializations - see text below for
+ # what this defines and why it is important for memory management
+ mybag = read_distributed_data(db,cursor,npartitions=Npartitions)
+ mybag = mybag.map(detrend)
+ mybag = mybag.map(filter,type="bandpass",freqmin=0.01,freqmax=1.5)
+ mybag = mybag.map(WindowData,-200.0,200.0)
+ # assume nwin and swin defined initialization
+ mybag = mybag.map(arrival_snr,noise_window=nwin,signal_window=swin)
+ mybag = mybag.map(db.save_data,data_tag="step1",return_data=False)
+ mybag.compute()
+
+This simple workflow consists only of map operators and is terminated by
+a save followed by a compute to initiate the (lazy) computation of the bag.
+The DAG for this workflow is illustrated below
+
+.. _DAG_figure:
+
+.. figure:: ../_static/figures/MapDAGFigure.png
+ :width: 600px
+ :align: center
+
+ Block diagram illustration of DAG defined by example workflow.
+ The example shows how work would be partitions with five processors
+ and five partitions. The labels at the top are function names
+ matching those used in the python code above. Each box denotes an
+ instance of that function run on one processor (worker). Data flow
+ is from left to right. Data enter each pipeline from a reader on the
+ left hand side (`read_distributed_data` but here given the simpler name
+ "reader") and exit in the save operation. For this simple case where the
+ number of processors match the number of partitions each processor would
+ be assigned 1/5th of the data. Termination of the workflow with a database
+ save (`save_data`) makes each pipeline largely independent of the others and
+ can improve performance as not processor has to wait for another except in
+ competition for attention from the database server.
+
+Now remember that a bag (RDD in spark) can be conceptualized as a
+container that is a list of data objects that doesn't necessarily fit
+into cluster memory, let alone
+a single node. Both dask and spark divide the container into
+`partitions` illustrated in the figure above. The partition size can be
+as small as one data object or some larger integer less than or equal to the
+number of data components. Think of a partition as the size of a bite
+the system uses to eat the elephant (the whole data set). That basic
+understanding should help you immediately realize that the partition
+size relative to system memory is a critical tuning component to optimize
+performance and make a workflow feasible. Setting the number of partitions
+too large can overwhelm the scheduler requiring it to handle a potentially
+massive DAG. The reasons, as you can see in the figure, is that the DAG
+size for a pure map workflow like this scales by the number of
+partitions (`Npartitions` in the python code above). The effort required to
+do the bookeeping for a million partitions differs dramatically from
+that required to do a few hundred.
+On the other hand, because the memory
+use for the processing scales with the memory required to load each partition
+small numbers of partitions used with a huge dataset can, at best,
+degrade performance and at worse crash the workflow from a memory fault.
+Using our cutsy eating an elephant analogy, the issue can be stated this way.
+If you eat an elephant one atom at a time and then try to
+reassemble the elephant the problem is overwhelming. On the other hand, if
+you cut the elephant up into chunks that are too big to handle, you can't
+do the job at all. The right size is something you can pick up and chew.
+
+The next concept to understand is how the scheduler
+needs to move data to workers and between processing steps.
+The figure below illustrates how that might work for
+the same situation illustrated above but with only two workers (processors).
+As the animation shows, the scheduler would assign the data
+in each partition to one of the two workers. From what we have observed
+the normal pattern for a job like our simple chain of map operators
+in this example is this. The data for the partition are loaded
+by each worker, which in this example means each worker issues a series of
+interactions with MongoDB to construct the input seismic data objects.
+Once that data is loaded in memory, the series of processing steps are
+applied sequentially by the worker. On the final step, the result returned
+by the final map call, which in this case is
+the output of the `save_data` method of `Database`, is returned to
+the scheduler node running the master python script (the one shown above
+for this example).
+
+.. _TwoProcessorAnimation_figure:
+
+.. figure:: ../_static/figures/MapProcessing.gif
+ :width: 600px
+ :align: center
+
+ Animated gif illustrating data flow for the same five partition
+ data set as illustrated above with only two processors. The animation
+ illustrates how the scheduler would assign each processor a data partition.
+ Each worker sequentially processes one data object at a time as illustrated
+ by the moving arrow. When a worker (processor) finishes a partition the
+ scheduler assigns it another until the data set is exhausted. This
+ example illustrates an inevitable discretization issue that can degrade
+ throughput. Because 5 is not a multiple of 2 this example requires
+ three passes to process and the last pass will only use one of the
+ workers.
+
+
+There are some important complexities the figure above glosses over
+that are worth mentioning:
+
+- Both dask and spark are generic schedulers. Nothing in the algorithms
+ described in the documentation guarantees the processing works at all
+ like the figure above shows. That figure shows what we've observed
+ happen using real-time monitoring tools. A working hypothesis is that
+ the schedulers recognize the geometry as a pipeline structure that
+ can be optimized easily by running each datum through the same
+ worker to avoid serialization.
+- The scheduler abstracts the concept of what a `worker` is. In MsPASS
+ jobs are run in a containerized environment and the cluster configuration
+ defines how many workers are run for each physical node assigned to a job.
+ A potential confusion to point out is that we refer to containers running on a compute
+ node as part of a virtual cluster (see :ref:`parallel_processing` section)
+ as a "worker node". To dask and spark a "worker" is a process/thread
+ the scheduler can send work to. That means a single "worker node"
+ will normally be running multiple dask/spark "workers".
+ There are complexities in how each "worker" interacts with thread pools
+ and/or spawned processes
+ that the dask or spark can be set up to launch. This is a
+ topic that authors have not fully resolved at the time this section
+ was written. It matters because optimal performance can be achieved
+ by defining sufficient worker threads to do computing as fast as possible,
+ but defining too many workers can create unintentional memory bloat issues.
+ There are also more subtle issues with threading related to the python GIL
+ (Global Interpreter Lock). Pure python function work badly with worker
+ threads because each thread shares a common GIL.
+ The defaults, however, are clear. For dask each worker container
+ (running in a node by itself) will use a thread pool with the number of
+ worker threads equal to the number of CPUs assigned to the container.
+ That is normally the number of cores on that physical node. Pyspark,
+ in contrast, seems to always use one process per worker and default
+ is less subject to the GIL locking problem.
+- Both dask and spark have tunable features for memory management and the
+ way scheduling is handled. In dask they are optional arguments to the
+ constructor for the dask client object. For spark it is defined in
+ the "context". See the documentation for the appropriate scheduler
+ if you need to do heavy performance tuning.
+- An important parameter for memory management we have not been able to
+ ascertain from the documentation or real-time monitoring of jobs is how
+ much dask or spark buffer a pipeline. The animation above assumes the simplest
+ case to understand where the answer is one.
+ That is, each pipeline processes one thing at a time.
+ i.e. we read one datum, process it through the chain functions in the
+ map operators, and then save the result. That is an oversimplification.
+ Both schedulers handle some fraction (or all) of a partition as the
+ chunk of data moving through the pipeline. That makes actual memory use
+ difficult to predict with much precision. The formulas we suggest
+ below are best thought of as upper bounds. That is, the largest memory
+ use will happen if an entire partition is moved through the pipeline
+ sequentially. That is the opposite approach to the animation above.
+ That is, done by partition all the data from each partition will first
+ be read into memory, that partition will be run through the processing
+ pipelines, and then saved. Real operation seems to actually be
+ between these extremes with some fraction of each partition being
+ at different stages in the pipeline as data move through the
+ processing chain.
+
+Memory Complexities
+~~~~~~~~~~~~~~~~~~~~~~~
+In the modern world of computing the concept of what "computer memory"
+means is muddy. The reason is that all computers for decades have
+extended the idea of memory hierarchy from the now ancient concepts of
+a memory cache and virtual memory. Schedulers like dask and spark are fundamentally
+designed to provide functionality in a distributed memory cluster of
+computers that define all modern HPC and cloud systems. Keep in
+mind that in such systems there are two very different definitions of
+system memory: (1) memory available to each worker node, and (2) the
+aggregate memory pool of all worker nodes assigned to a job. Dask and spark
+abstract the cluster and attempt to run a workflow within the physical
+limits of both node and total memory pools. If they are asked to do
+something impossible, like unintentionally asking the system to fit an
+entire data set in cluster memory, we have seen them fail and abort.
+Even worse is that when prototyping a workflow on a desktop outside
+of the containerized environement we have seen
+dask crash the system by overwhelming memory.
+(Note this never seems to happen running in a container which is
+a type example of why containers are the norm for cloud systems.)
+How to avoid this
+in MsPASS is a work in progress, but is a possibility all users should be
+aware of when working on huge data sets. We think the worst problems have been eliminated
+by fixing an issue with earlier version of the C++ code that was
+not properly set up to tell dask, at least, how much memory was being
+consumed. All memory management depends on data objects being
+able to properly report their size and have mechanisms for dask or
+spark to clear memory stored in the data objects when no longer needed.
+If either are not working properly, catastrophic failure is likely
+to eventually occur with upscaling of a workflow.
+
+At this point it is worth noting a special issue about memory
+management on a desktop system. Many to most users will likely want to
+prototype any python scripts on a desktop before porting the code to
+a large cluster. Furthermore, many research applications
+don't require a giant cluster to be feasible but can profit from
+multicore processing on a current generation desktop. On a desktop
+"cluster memory" and "system memory" are the same thing. There are
+a few corollaries to that statement that are worth pointing out
+as warnings for desktop use:
+
+- Beware running large memory apps if you are running MsPASS jobs
+ on the same system. Many common tools today are prone to extreme
+ memory bloat. It is standard practice today to keep commonly used
+ apps memory resident to provide instantaneous use by "clicking on"
+ an app to make it active. If you plan to process data with MsPASS
+ on a desktop plan to reduce memory resident apps to a minimum.
+- Running MsPASS on a desktop can produce unexpected memory faults
+ from other processes that you may not be aware of consuming memory.
+ If your job aborts with a memory fault, first try closing every other
+ application and running the job again with the system memory monitor
+ tool running simultaneously.
+- Running MsPASS under docker with normal memory parameters should never
+ actually crash your system. The reason is docker by default will only
+ allow the container to utilize some fraction of system memory.
+ Memory allocation failures are then relative to container size not
+ system memory size.
+
+In working with very large data sets there is the added constraint of
+what file systems are available to store the starting waveform data,
+the final results of a calculation, and any intermediate results that
+need to be stored. File systems i/o performance is wildly variable
+today with different types of storage media and mass store systems having
+many orders of magnitude difference in speed, throughput,
+or storage capacity. Thus, there is a
+different "memory" issue for storing original data, the
+MongoDB database, intermediate results, and final results. That is,
+however, a different topic that is mostly a side issue for the topic
+here of processing memory use. Dask and spark both assume auxiliary
+storage is always infinite and assume your job will handle any
+i/o errors gracefully or not so gracefully (i.e. aborting the job).
+Where the file systems enter in the memory issue
+is when the system has to do what
+both packages call `spilling`. A worker
+needs to "spill to disk" if the scheduler pushes data to it and
+there is no space to hold it. It is appropriate to think of
+"spilling" as a form of virtual memory management. The main difference is
+that what is "spilled" is not "pages" but data managed by the worker.
+Dask and spark both "spill" data to disk when memory use exceeds some
+high water mark defined by the worker's configuration. It should be
+obvious that the target for spilling should be the fastest file system
+available that can hold the maximum sized chunk of data that might be
+expected for that workflow. We discuss how to estimate worker
+memory requirements below.
+
+The final generic issue about memory management is a software
+issue that many seismologists may not recognize as an issue.
+That is, all modern computer languages (even modern FORTRAN) utilize
+dynamic memory allocation. In a language like C/C++ memory allocation
+is explicit in the code with calls to the `malloc` family of functions in
+C and `new` in C++. In object-oriented languages
+like python and java dynamic allocation is implicit. For instance,
+in python every time a new symbol is introduced and set to a "value"
+an object constructor is called that allocates the space for the data
+the object requires. A problem that happens
+in MsPASS is that we used a mixed language
+solution for the framework. Part of that is implicit in assembling
+most python applications from open-source components. A large fraction
+of python packages use numpy or scipy for which most of the code base is
+C/C++ and Fortran with python binding. In MsPASS we used a similar
+approach for efficiency with the core seismic data containers
+implemented in C++. The problem any mixed language solution faces
+is collisions in concept of different languages about memory management.
+That is, in C/C++ memory management is the responsibility of the
+programmer. That is, every piece of data in a `C/C++` application
+that is dynamically allocated with `malloc/new` statement has to somewhere else
+be released with a call to `free/delete`. Python, in contrast, uses
+what is universally called "garbage collection" to manage memory.
+(A web search will yield a huge list of sources explaining that concept.)
+What this creates in a mixed language solution like MsPASS is
+a potential misunderstanding between the two code bases. That is,
+python and C components need to manage their memory independently.
+If one side or the other releases memory before the other side is finished
+your workflow will almost certainly crash (often stated as "unpredictable").
+On the other hand, if one side holds onto data longer than necessary
+memory may fill and your workflow can abort from a memory fault.
+In MsPASS we use a package called `pybind11` to build the python
+bindings to our C/C++ code base. Pybind11 handles this problem
+through a feature called `return_value_policy` described
+`here `__.
+At the time this manual section was written we were actively working
+to get this setting right on all the C++ data objects, but be warned
+residual problems may exist. If you experience memory bloat problems
+please report this to us wo we will try to fix the issue as quickly as possible.
+
+bag/RDD Partitions and Pure Map Workflows
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+It has been our experience that most seismic data processing
+workflows can be reduced to a series of map only operators.
+The example above is a case in point. For this class of workflow
+we have found memory use is relatively predictable and scales with
+the number of partitions defined for the bag/RDD. In this section
+we summarize what we know about memory use predictions for this
+important subset of possible workflows.
+
+We need to first define some symbols we use for formulas we
+develop below:
+
+- Let :math:`N` denote the number of data objects loaded into the
+ workflows bag/RDD.
+- With seismic data the controlling factor for memory use is almost always
+ the number of samples in the data windows being handled by the workflow.
+ We will use :math:`N_s` to define the number of samples per atomic
+ data object. In MsPASS all sample data are stored as double data so the
+ number of bytes to store sample data for TimeSeries objects
+ is :math:`8 N_s` and the number
+ of bytes to store sample data for Seismogram objects
+ is :math:`24 N_s`.
+- All MsPASS atomic objects contain a generalized header discussed at
+ length elsewhere in this user's manual. Because we store such
+ data in a dictionary like container that is open-ended, it is
+ difficult to compute exact size measures of that component of a data
+ object. However, for most seismic data the size of this "header" is
+ small compared to the sample data. A fair estimate can be obtained
+ from the formula:
+ :math:`S_{header} = N_k N_{ks} + 8 ( N_{float} + N_{int} + N_{bool} ) + \bar{s} N_{string} + N_{other}`
+ where :math:`N_k` is the number of keys, :math:`N_{ks}` is an estimate of the
+ average (string) key size, :math:`N_{float}, N_{int}` and :math:`N_{bool}`
+ are the number of decimal, integer, and boolean attributes respectively.
+ The quantity :math:`\bar{s} N_{string}` is an estimate of the average size
+ (in bytes) of string values. Finally, :math:`N_{other}` is an estimate of the
+ size of other data types that might be stored in each objects Metadata
+ container (e.g. serialized obspy response object).
+- Let :math:`S_{worker}` denote the available memory (in bytes) for processing in
+ each worker container. Note that size is always significantly less than
+ the total memory size of a single node. If one worker is allocated
+ to each node, the available work space is reduced by some fraction
+ defined when the container is launched (implicit in defaults) to
+ allow for use by the host operating system. Spark and dask also each
+ individually partition up memory for different uses. The fractions
+ involved are discussed in the documentation pages for
+ `Spark `__
+ and
+ `Dask `__.
+ Web searches will also yield many additional documents
+ that might be helpful. With dask, at least, you can also establish the
+ size of :math:`S_{worker}` with the graphical display of
+ worker memory in the
+ `dask dashboard diagnostics `__.
+- Let :math:`N_{partitions}` define the number of partitions defined for
+ the working bag/RDD.
+- Let :math:`N_{threads}` denote the number of threads in the thread pool
+ used by each worker. For a dedicated HPC node that is normally the
+ number of cores per node.
+
+From the above it is useful to define two derived quantities.
+An estimate of the nominal size of TimeSeries objects in a workflow
+is:
+
+.. math::
+
+ S_{ts} = 8 N_s + S_{header}
+
+and for Seismogram objects
+
+.. math::
+
+ S_{seis} = 24 N_s + S_{header}
+
+For pure map operator jobs we have found dask, at least, always reduces the
+workflow to a pipeline that moves data as illustrated in the animated gif
+figure above.
+
+The pipeline structure reduces memory use to a small, integer multiple, which we
+will call :math:`N_c` for number of copies, of the input object size. i.e. as
+data flows through the pipeline only 2 or 3 copies are held in memory at the
+same time. However, dask, at least, seems to try to push
+:math:`N_{threads}` objects through the pipeline simultaneously assigning
+one thread per pipeline. Pyspark seems to do the same thing
+but uses separate processes, by default, for each worker. That means that
+the multiplier is at least about 2. Actual usage can be dynamic if the
+size of the objects in the pipeline are variable from the very common
+use of one of the time windowing functions. In our experience workflows
+with variable sized objects have rapidly fluctuating memory use as
+data assigned to different workers moves through the pipeline.
+
+If we assume
+that model characterizes the memory use of a workflow it is useful
+to define the following nondimensional number:
+
+.. math::
+
+ K_{map} = \frac{\frac{S_{worker}}{N_{threads}}}{\frac{N S_d}{N_{partitions}}}
+ = \frac{S_{worker} N_{partitions}}{N_{theads} N S_d}
+
+where :math:`S_d` denotes the data object size for each component.
+In MsPASS :math:`S_d` is :math:`S_{ts}` for TimeSeries data and
+:math:`S_{seis}` for Seismogram data. In words, :math:`K_{map}`
+is the ratio of memory available per process to the chunk size
+implicit in the data partitioning.
+
+The same formula can be applied to ensembles, but the computation of
+:math:`S_d` requires a different formula given in the section below
+on ensembles. :math:`K_c` is best thought of as a nondimensional
+number that characterizes the memory requirements for a pure map,
+workflow implemented by a pipeline with :math:`N_{threads}`
+working on blocks of data with size defined by :math:`S_d N_{partitions}`.
+If the ratio is large
+spilling is unlikely. When the ratio is less than one spilling is
+likely. In the worst case, a job may fail completely with a memory
+fault when :math:`K_c` is small. As stated repeatedly in this section
+this issue is a work in progress at the time of this writing, but
+from our experience for a typical work flow you should aim to tune the
+workflow to have :math:`K_c` be of the order of 2 or more to avoid
+spilling. Workflows with highly variable memory requirements within
+a pipeline (e.g. anytime a smaller waveform segment is cut from
+larger ones.) may allow smaller values of :math:`K_c`, particularly if
+the initial read is the throttle on throughput.
+
+The main way to control :math:`K_c` is to set :math:`N_{partitions}`
+when you create a bag/RDD. In MsPASS that is normally set by
+using the `number_partitions` optional argument in the `read_distributed_data`
+function. Any other approach requires advanced configuration options
+described in documentation for dask and spark.
+
+Reduce Operations
+~~~~~~~~~~~~~~~~~~~
+The schedulers used in MsPASS are commonly described as ways to
+implement the "map-reduce paradigm". As noted above, our experience is
+that most seismic processing workflows are most effectively expressed
+as a chain of map operators applied to a bag/RDD. There are, however,
+two common algorithms that can be expressed as "reduce" operators:
+(1) one-pass stacking (i.e. an average that does not require an
+interative loop such as an M-estimator.), and (2) forming ensembles on the
+fly from a bag/RDD of atomic data. These two examples have fundamentally different
+memory constraints.
+
+A stacking algorithm that produces a smaller number of output signals
+than inputs, which is the norm, is less subject to memory issues.
+That is particularly true if the termination of the workflow saves
+the stacked data to a databases. To be more concrete, here is
+a sketch of a simple stacking algorithm summing common source gathers
+aligned by a P wave arrival time defined in each object's Metadata
+container with the key "Ptime". The data are grouped for the
+reduce(fold) operation by the Metadata key `source_id`:
+
+.. code-block:: python
+
+ def ator_by_Ptime(d):
+ """
+ Smaller helper function needed for alignment by Ptime key.
+ """
+ # A more robust version should test for validity - here assume data
+ # was preprocessed to be clean
+ t0 = d["Ptime"]
+ return d.ator(t0)
+ def key_func(d):
+ """
+ Used in foldby to define group operation - here with source_id
+ """
+ return d["source_id"]
+
+ from mspasspy.reduce import stack
+ # Assumes data was preprocessed to be clean and saved with this tag
+ query={"data_tag" : "read_to_stack_data"}
+ cursor = db.wf_TimeSeries.find({})
+ # assumes npartitions is set earlier in the code - see text for discussion
+ mybag = read_distributed_data(db,cursor,number_partitions=npartions)
+ mybag = mybag.map(ator_by_Ptime)
+ mybag = mybag.map(WindowData,-5.0,30.0)
+ # foldby is dask method of bag - pyspark has a different function mame
+ mybag = mybag.foldby(keyfunc, stack)
+ mybag = mybag.map(db.save_data,data_tag="stacked_data")
+ resulst = mybag.compute()
+
+The DAG for this workflow with 2 sources and 5 partitions looks like this:
+
+.. _Reduce_figure:
+
+.. figure:: ../_static/figures/ReduceFigure.png
+ :width: 600px
+ :align: center
+
+ Data flow diagram for example reduce operation to stack two
+ sources with a bag/rdd made up of atomic TimeSeries data with
+ five partitions. Dashed lines show concept of how partitions
+ divide up the container of data illustrated as a stack of black
+ rectangles. Each rectangle represents a single TimeSeries object.
+ The arrows illustrate data flow from left to right in the figure.
+ As illustrated earlier the reader loads each partition and then
+ makes that data available for processing by other tasks. Processing
+ tasks are illustrated by black rectangles with arrows joining them
+ to other tasks that illustrate the DAG for this workflow.
+ Note that in this case each partition has to be linked to each
+ instance of the stack task. This illustrates why the scheduler has
+ to keep all data in memory before running the stack process as the
+ foldby function has no way of knowing what data is in what partition.
+
+We emphasize the following that are the lessons you should learn from
+the above:
+
+- The dask `foldby` method of bag combines two concepts that define the
+ "reduce" operation: (1) a function defining how data to be
+ stacked are grouped, and (2) a function telling dask how the data object
+ are to be combined. The first is the small function we created
+ called `keyfunc` that returns the value of the `source_id`. The second
+ is the mspass stack function which will function correctly as a
+ "reduce" operator (For more on that topic see the section titled :ref:`parallel_processing`.)
+- In this workflow the left side of the graph is a
+ chain of two map operators like the earlier example in this section.
+ The difference here is the set of pipelines terminate to foci
+ directed at the stack function. That illustrates graphically how
+ the `stack` function merges multiple inputs into a single output.
+ In this case, it does that by simply summing the inputs to produce
+ one output for each `source_id`. In terms of memory use this means
+ the final output should normally be much smaller than the inputs.
+- Our example above shows a best practice that would be normal use for
+ any stack algorithm. That is, the final operator is a call to the
+ `save_data` method of the database handle (`db` symbol in this example).
+ The default behavior of `save_data` is to return only the ObjectId
+ of the inserted waveform document.
+ As a result, on the last line when the `compute` method is
+ called, dask initiates the calculations and arranges to have
+ the output of `save_data` returned to the scheduler node. That approach
+ is useful to reduce memory use in the scheduler node and data traffic
+ as calling the output of the `compute` method is the content of the
+ bag converted to a python list. If the output is known to be small
+ one can change the options of `save_data` to return the outputs from stack.
+- Notice that the number of branches on the left side of the DAG is set
+ by the number of partitions in the bag, not the number of objects in the
+ bag. Dask and spark both do this, as noted above, to reduce the size of the
+ DAG the scheduler has to handle.
+- The biggest potential bottleneck in this workflow is the volume of
+ interprocess communication required between the workers running the
+ `ator_by_Ptime` and `WindowData` functions and the `stack` operator.
+ With a large number of sources a significant fraction of the `WindowData`
+ outputs may need to be moved to a different node running `stack`.
+- The related issue with a `foldby` operation to that immediately above
+ is the memory requirements. The intermediate step, in this workflow,
+ of creating the bag of `stack` outputs should, ideally, fit in memory
+ to reduce the chances of significant "spilling" by workers assigned the
+ `stack` task. The reason is that the grouping function implicit in
+ the above workflow cannot know until the entire input bag is processed
+ where to send all the outputs of the map operations. The stack outputs
+ have to be held somewhere until the left side of the DAG completes.
+- A final memory issue is the requirements for handling the input.
+ As above the critical, easily set option is the value assigned to the `npartitions`
+ parameter. We recommend computing the value of :math:`K_{map}` with
+ the formula above and setting up the run to assure
+ :math:`K_{map}<1`. Unless the average number of inputs to `stack` are
+ small that should normally also guarantee the output of the `stack`
+ task would not spill.
+
+A second potential form of a "reduce" operation we know of in MsPASS is
+forming ensembles from a bag of atomic objects. A common example where
+this will arise is converting `TimeSeries` data to `Seismogram` objects.
+A `Seismogram`, by definition, is a bundle created by grouping a set of
+three related component `TimeSeries` object. The MsPASS `bundle` function,
+in fact, requires an input of a `TimeSeriesEnsemble`. A workflow to do that
+process would be very similar to the example above using `stack`, but the
+`stack` function would need to be replaced by a specialized function that would
+assemble a `TimeSeriesEnsemble` from the outputs of the `WindowData` function.
+To do this process one could follow that function by a map operator
+to run `bundle`. We have tried that, but found it is a really bad idea.
+Unless the entire data set is small enough to fit two copies of the data in
+memory that job can run for very long times from massive spilling or abort
+on a memory fault. We recommend an ensemble approach utilizing
+the database to run bundle as described in the next section.
+
+Utilizing Ensembles Effectively
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A large fraction of seismic workflows are properly cast into a framework
+of processing data grouped into ensembles. Ensemble-based processing,
+however, is prone to producing exceptional memory use pressure. The reason
+is simply that the size of the chunks of data the system needs to handle
+are larger.
+
+Let us first consider a workflow that consists only of a pipeline of
+map processes like the example above. The memory use can still be
+quantified by :math:`K_{map}` but use the following to compute the
+nominal data object size:
+
+.. math::
+
+ S_d = \bar{N}_{member} \bar {S}_d + S_{em}
+
+where :math:`\bar{N}_{member}` is the average number of ensemble
+members, :math:`\bar {S}_d` is the average member size, and
+:math:`S_{em}` is the nominal size of each ensemble Metadata container
+(normally a small factor anyway). Note :math:`S_d` is the value
+:math:`S_d` defined above for `TimeSeries` or `Seismogram` objects for
+`TimeSeriesEnsemble` and `SeismogramEnsemble` objects respectively.
+An ensemble-based workflow that terminates in a stacking operation
+that reduces an ensemble to an atomic data object will have less
+memory pressure, but is still subject to the same memory pressure
+quantified by :math:`K_{map}`.
+
+There is an important class of ensemble processing we noted in the
+previous section: using the `bundle` function to create
+`SeismogramEnsemble` objects from an input `TimeSeriesEnsemble`
+container. Any data originating as miniseed data from an FDSN
+data center that needs to be handled as three-component data
+would need to pass through that process. The following is an
+abbreviated sketch of a standard workflow for that purpose for
+data logically organized as by source:
+
+.. code-block:: python
+
+ #imports would normally be here - omitted for brevity
+ def make_source_id_queries(db):
+ """
+ Demonstrates how to generate a list of queries to use as
+ input for read_distributed_data to build a bag of ensembles.
+ """
+ srcidlist = db.wf_miniseed.distinct("source_id")
+ querylist = list()
+ for id in srcidlist:
+ query = {"source_id" : id}
+ querylist.append(query)
+ return querylist
+
+ # Initialization code for database handle (db) would normally be here
+ matcher = MiniseedMatcher(db)
+ querylist = make_source_id_queries(db) # defined above
+ number_partitions = len(querylist)
+ mybag = read_distributed_data(querylist,
+ db,
+ collection="wf_miniseed",
+ npartitions=number_partitions,
+ )
+ mybag = mybag.map(detrend) # not required but more efficiently done at this stage
+ mybag = mybag.map(normalize,matcher) # needed to define orientation attributes
+ mybag = mybag.map(bundle_seed_data)
+ mybag = mybag.map(db.save_data)
+ mybag.compute()
+
+
+This algorithm uses only map operators but can be very memory intensive if
+the ensembles are large. The reason is that the function `bundle_seed_data`
+by necessity has to have two copies of the data in memory; it works through
+the `TimeSeries` and assembles the appropriate group of three such
+objects into `Seismogram` objects. The example shows the simplest approach
+to reduce memory use. We create the dask bag with the `read_distributed_data`
+function. We pass it the optional parameter
+`npartitions` set so each enemble is treated as a single partition.
+If the ensemble size is large (:math:`K_{map}<1`) three approaches can be considered
+to improve performance.
+
+#. A common approach is to download data over a longer window than actually
+ needed for a particular study. e.g. one might have an archive of
+ teleseismic event files with miniseed data segments of the order of
+ one hour in length. If the focus is only P waves, windowing
+ with `WindowData` as used in the earlier example could reduce the data
+ size by an order of magnitude.
+#. Although we've never tried this, it should be feasible to create a
+ sequence of MongoDB queries that would sort miniseed data appropriately
+ and group them into smaller bundles of the order of 3 that could be
+ scanned and "bundled" into atomic `Seismogram` objects with the
+ function :code:`BundleSeedGroup`. That workflow would be similar to
+ the one above but the list of queries passed to `read_distributed_data`
+ would be more complex and usually much larger.
+#. If all else fails the workflow can be run as a serial job.
+ For small data sets that can be the best alternative. For very large
+ data sets the time required can be problematic and would only be
+ feasible if the workflow is designed to be restarted from where the
+ last run stopped. For example, the authors ran a benchmark on
+ a desktop system with
+ an extended USArray dataset with all lower 48 station broadband stations
+ in 2013. A job to do the process above alone would have required of the
+ order of weeks on a desktop machine
+ for one year of data. That is a feasible, but awkward calculation
+ by any definition.
+
+There is one final type of ensemble processing worth noting.
+There are many examples where a logical organization is to
+read data as atomic data objects, apply some standard tasks like
+windowing and filtering, and then group the data and assemble them into
+ensembles for subsequent processing that requires the data to be grouped
+(e.g. a moment-tensor inversion requires data to be assembled into
+source-based ensembles.). The problem is that the grouping operation is
+a form of "reduce/fold" that is can be done efficiently only if the
+results fit in cluster memory. For that case most will likely find the
+approach using MongoDB discussed in the next section is superior
+because it is more readily scaled to arbitrarily large data sets.
+
+Using MongoDB to Manage Memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should always keep in mind that the ultimate, scalable solution for
+memory management is the MongoDB database. If an algorithm applied to
+a dataset is memory intensive one question to consider is if there is a
+solution using MongoDB? The example immediately above is an example;
+running the lower-level :code:`BundleSeedGroup` could, in principle, be
+used to break the problem into smaller chunks. With the right incantation sent to
+MongoDB that algorithm is likely a good alternative way to create
+`Seismogram` objects from single station groups of `TimeSeries` objects.
+
+There are a number of other common algorithms that we know from experience
+can be handled most easily by utilizing MongoDB.
+
+#. Any algorithm that requires data to be sorted into a specific order
+ with one or more header (Metadata) attributes is best initiated with
+ MongoDB. There are ways to order a data set in the middle of a workflow,
+ but dask and spark documentation warn that can create a performance
+ issue. Further, assembling the atomic data into ensembles with
+ a function like dask foldby is subject to the memory issues discussed
+ above. Hence, in our experience using MongoDB is a more scalable approach.
+ MongoDB sorting, particularly if used with an appropriate
+ index, is a very efficient way to build a sorted and grouped data set. We should
+ note that ordered data ALWAYS require data to be grouped and
+ loaded into an ensemble container. The reason is that dask and spark
+ do not necessarily preserve order in a map operator. That is, the
+ data in an output bag may be shuffled relative to the input in a map
+ operation. Hence,
+ processing workflows cannot depend on order as is common practice in
+ all seismic reflection processing packages we are aware of.
+#. Dismantling ensembles into atomic components can only be done at present by
+ saving the ensemble data and then reading it back as atomic objects.
+#. As noted in many places in this user's manual MsPASS uses the idea of
+ a "live" attribute on the native data objects to flag a datum as bad.
+ Such data are carried along in a bag/RDD and consume space because
+ most functions that kill such data leave the data array intact.
+ If a lot of data have been killed, which is common in a major editing
+ step like the snr or edit module functions, memory pressure can often be
+ drastically reduced by removing the dead data. The cleanest way to do
+ that, and preserve the record of what was killed, is to do an intermediate
+ save of the data set and then recreate a new bag/RDD for subsequent
+ processing of the edited data by reading it back again before continuing.
+ In our experience, it is generally useful
+ to treat this as step in processing where the result needs to be reviewed
+ before continuing anyway. The jupyter notebook you create
+ along with records in the database will then
+ preserve your edits.
diff --git a/docs/source/user_manual/mongodb_and_mspass.rst b/docs/source/user_manual/mongodb_and_mspass.rst
new file mode 100644
index 000000000..a94637317
--- /dev/null
+++ b/docs/source/user_manual/mongodb_and_mspass.rst
@@ -0,0 +1,1091 @@
+.. _mongodb_and_mspass
+
+Using MongoDB with MsPASS
+==============================
+*Prof. Gary L. Pavlis*
+------------------------
+Overview and Roadmap
+-----------------------
+There are a huge number of internet and printed resources on the
+Database Management system used in MsPASS called `MongoDB`.
+The reason is that MongoDB is one of the most heavily used open source
+packages in the modern software ecosystem. For that reason in earlier
+verisions of our User's Manual we simply punted the ball and told User's
+to consult online sources. It became clear, however, that the
+firehose of information that approach creates is not for everyone.
+Hence, we created this section to reduce the firehose to a, hopefully,
+manageable stream of information.
+
+The first section, which is titled "Concepts", is introductory material.
+The material, however, is broken into subsections directed a people
+with specific backgrounds. If the title matches you start there. If
+you fit none of the descriptions, read them all. The sections after that
+are organized by the letters of the standard acronymn used in
+many sources on database system: CRUD (Create, Read, Update, and Delete).
+The final section covers an auxiliary issue of MongoDB indexes.
+indexes are critical for read performnce, but are not required.
+
+Concepts
+-------------
+MongoDB for RDBMS Users
+~~~~~~~~~~~~~~~~~~~~~~~~~
+Users who are familiar with Relational DataBase Management Systems (RDBMS)
+will find MongoDB has many similar concepts, but also is fundamentally
+different from any RDBMS. In and RDBMS the conceptual model of the data
+structure the system manages is a table (A "relation" in database theory
+is synonymous with "table".) That is, however, better thought of as an
+implementation detail for what the system is aiming to do.
+Computer science textbooks on database theory make heavy use of set theory
+at think of a database as an abstract set of "attribute" that define
+some property and have some relationship. Some kinds of data fit the
+tabular model well, but some do not fit the concept at all.
+Seismic data objects are a case in point; traditional "header" attributes
+like those in SAC or SEGY map well into the RDBMS model well things like
+the sample data and response information are awkward, at best, to handle
+that way. If you are a user of a system like Antelope that uses CSS3.0
+you will know what I mean. The other fundamental weakness of the
+tabular model is handling the "null" problem. That is, if an attribute
+is optional the table has too define what defines a null value.
+MongoDB suffers neither of those issues, but has its own "implementation
+details" that we will discuss later.
+
+A relation (table) indexes a given piece of data by row and column.
+aka key that defines the column index. The rows (tuples) have
+an implicit index defined by the "primary keys" of the relation.
+(They sometimes have alternate keys, but that is a detail.)
+The primary key always has some indexing scheme to speed lookup
+by that key. One way to think of MongoDB's data model is to
+think of it a collection of binary blobs (a list) with one or more
+fast indexing methods that allow random retrieval of any of the
+things it stores by the index. It is like a relation with only
+tuples and the tuples have no restrictions other than
+a requirement that they contain the
+indexing attribute. These open-ended tuples are called "documents".
+The contents of the documents are not, however, just a binary blob.
+The contents are REQUIRED to be indexable as key-value pairs
+(The same concept as a python dictionary discussed in the next section.).
+The "value" associated with the key has no restriction, but
+the approach makes no sense unless the "value" can be translated into
+something that matches the concept that key is supposed to reference.
+For example, a seismic station "name" is a unique tag seismologists
+use describe a particular observatory/site. Earthquake seismologists
+nearly always use human-generated names stored as a string.
+Multichannel seismic data managment systems (e.g. Seismic Unix)
+commonly use an integer index to define a given position because
+survey flags are commonly defined by sequential numbers defining positions
+along a line/profile. Conceptually, however, both approaches require the
+same thing. An index (station name or station index number) is used
+to define a particular geographic location. Different keywords (keys)
+are used to define particular concepts. For the case in point,
+in MsPASS we use
+"lat", "lon", and "elev" as keys to define an instruments geographic
+position. Fetching all three from a particular document will produce all
+the data required to define a unique position. The same thing is done
+in an RDBMS, but the table model is used to define the model of how
+to fetch a particular attribute by a similar key. e.g. in the CSS3.0
+schema the same three attributes have the same keys.
+That is, the CSS3.0 standard defines the keys "lat", "lon", and "elev"
+as latitude, longitude, and elevation.
+
+In MongoDB the equivalent of a table (relation) is called a "collection".
+A more technical definition of a "collection" is a set of "documents"
+accessible with a random access iterator. What that means, in practice,
+is that a collection act like an RDBMS table in the sense that the
+contents can be sorted into any order or subsetted with a query.
+There is not such thing, however, as "SELECT". Documents have to
+always be eaten whole and chewed on a bit if you need to access only
+a limited set of attributes in each document. The thing that makes
+a relational database "relational" is not done well by MongoDB.
+That is, various forms of a "join" between two tables are a fundamental
+operation in most, if not all operational relational database systems.
+MongoDB has the equivalent in what it calls "normalization", but
+our experience is it is much slower than comparable operations with
+an RDBMS. We discuss normalization at length in the
+related section of the User's Manual titled :ref:`Normalization`.
+The results of a query
+can also be "grouped", but require very different programming constructs than
+SQL.
+
+Perhap the most important common construct used by all RDBMS systems I
+know of and MongoDB is the concept of a "cursor". In an RDBMS a cursor
+is a forward-iterator (i.e. it can only be incremented) that loops over the set of tuples returned by a query.
+In MongoDB is is more-or-less the same thing with different words.
+A MongoDB cursor is a forward-iterator that can be used to work through
+a set of documents returned by a query. We will see numerous examples
+of using cursors in the MsPASS User's manual and any source discussing
+MongoDB.
+
+MongoDB for Python programmers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+All python programmers will be familiar with the container
+commonly called a "dictionary" (dict). Other sources call the same concept
+an "associative array" (Antelope) or "map container" (C++ and Java).
+Why that is important for MongoDB is simple: the python binds for
+MongoDB (`pymongo`) used a dict to structure the content of what
+MongoDB calls a "document" AND pymongo uses dict containers as the base of
+its query language. Two simple examples from the related tutorial notebook
+illustrate this.
+
+*Document output example:*
+
+.. code-block:: python
+
+ from bson import json_util
+ doc = db.wf_minised.find_one()
+ print(json_util.dumps(doc,indent=2))
+
+Produces:
+
+.. code-block:: python
+
+ Paste here
+
+*Query example:*
+
+.. code-block:: language
+
+ query=dict()
+ query['sta' : 'AAK']
+ query['chan'] : 'BHZ'
+ query['loc'] = '00'
+ print("query content in pretty form")
+ print(json_util.dumps(doc,indent=2))
+ doc=db.wf_miniseed.find_one(query)
+ print("output of query")
+ print(json_util.dumps(doc,indent=2))
+
+
+MongoDB for Pandas Users
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+Most users who have had any significant experience with python will
+likely have encountered pandas. The name "pandas" is
+one of those strained acronymns. Multiple online sources indicate the
+name comes from "panel data", which is basically an stretch of a name for
+a table. That insight is fundamental as pandas can be thought of as
+little more a python version of a spreadsheet. In addition, more
+elaborate features of the panda API can be used to mimic much of
+an RDBMS functionality.
+
+Since pandas are little more than an API for manipulating tables,
+linking pandas to MongoDB differs little from linking an RDBMS table
+to MongoDB. What I mean by that is perhaps best illustrated by
+an example. The `Antelope software `__
+used by many seismologists is a "flat-file" RDBMS. It stores tabular
+data in simple text files that can be viewed with standard unix tools.
+(Note most RDBMS systems hide data behind the API like MongoDB does and
+the data are stored in some binary set of files accessible only through
+a server.) Antelope uses the CSS3.0 schema. One of way pandas can
+be used with MsPASS is to import CSS3.0 tables. With Antelope
+files that is easily done with the `read_fsf` function in pandas. The
+following illustrates an alternative way to create a `site` collection
+from an Antelope "site" table.
+
+.. code-block:: python
+
+ import pandas
+ from obspy import UTCDateTime
+ keys = ['sta','ondate','offdate','lat','lon','elev','statype','refsta','dnorth','deast','lddate']
+ widths = [6,8,8,9,9,9,50,4,6,9,9,17] # need the antelope schema file to get these
+ df = pandas.read_fwf('demo.site',names=keys,widths=widths)
+ doclist = df.to_dict('records')
+ # This loop is needed to convert ondate and offdate to starttime and
+ # endtime used in MsPASS.
+ for doc in doclist:
+ # In CSS3.0 these are integers for year day. UTCDateTime
+ # converts correctly ONLY if it is first converted to a string
+ ondate=str(doc['ondate'])
+ offdate=str(doc['offdate'])
+ starttime=UTCDateTime(ondate).timestamp()
+ enddate=UTCDateTime(offdate).timestamp()
+ doc['starttime']=starttime
+ doc['endtime']=endtime
+ # this script assumes db is a MongoDB Database handle set earlier
+ db.site.insert_many(doclist)
+
+A few details worth noting about this example:
+
+- The list of keywords assigned to the symbol `keys` is needed because
+ Antelope wfdisc fles do not have attribute names as the first line of
+ the file. The list used above uses CSS3.0 attribute names. The order
+ is significant as the names are tags on each column of data loaded
+ with `read_fsf`.
+
+- The `widths` symbol is set to a list of fixed field widths. They ere
+ derived from the antelope schema file.
+
+- The call to the pandas `to_dict` method converts the pandas table to
+ a list of python dictionaries.
+
+- The for loop after the call to `to_dict` is not strictly necessary.
+ It is used in this example to produce a "site collection" consistent
+ with the MsPASS namespace. This is an example of a disconnect in
+ concept between two database systems. CSS3.0 is an older standard and
+ the committee that developed it elected to store the "ondate" and "offdate"
+ fields as integers that specified time to the nearest day. The SEED
+ standard changed the equivalent to a time stamp normally specified as
+ a unix epoch time or a date string. Here we convert the time to a
+ unix epoch time through obspy's UTCDateTime class.
+
+- The last line is the only MongoDB component of this script. More examples
+ like this are seen below. A key point here is that `insert_many` can
+ handle any number of documents defined in doclist. It is, of course,
+ memory limited because pandas and `doclist` are all in memory. The
+ del call in the script demonstates good practice to release potentially
+ large memor objects like `df` after they are no longer needed.
+
+The above example works for the special case of Antelope text-based
+database files. The pandas API, as experienced pandas users know,
+has a rich set of readers that can read nearly any imaginable
+tabular data format from files, sql servers, and online sources. These are documented
+`here `__ and include
+Excel, csv, and json formatted files, SQL servers, and jargon most of
+us have never seen. I have found that for research problems the fact that MongoDB
+documents are completely agnostic about content can be very helpful.
+For a new problem it is trivial to create a new collection and start putting
+things (documents) into it and have the data available by MongoDB queries.
+Readers should realize the schema we imposed on seismic waveform collections
+was imposed to provide a standardized namespace for keys to allow the
+framework to be extended without breaking lower level functionality.
+For the exploration stages of a research problem having now schema
+constraints is a very useful feature of MongoDB. Importing data through
+pandas is a particularly simple way to import many forms of data
+you may acquire from internet sources today.
+
+A final key point about pandas is that both dask and pyspark
+have a parallel equivalent. Both refer to the
+equivalent of a pandas data structure as
+a `DataFrame`. A large fraction of the pandas API are available
+in the dask and pyspark DataFrame API. Experienced pandas users
+may find it helpful in handling large tabular data sets to develop
+applications with MsPASS that use the DataFrame API to manipulate
+the tabular data. With dask or pyspark most pandas operations
+can be parallelized.
+
+Queries (Read of CRUD)
+-----------------------
+Query language
+~~~~~~~~~~~~~~~~
+In my experience the single most important usage of a database like
+MongoDB in MsPASS research data processing is defining queries to
+select a subset of data holdings or to define groupings (ensembles)
+to be processed together. A barrier to usage, however, is that
+MongoDB uses a unique and rather strange query language that users
+familiar with a language like SQL will find foreign. Furthermore,
+the biggest weakness I've seen in any
+online source I've found on MongoDB usage is a failure to
+address the fundamental syntax of the query language.
+All sources seem to think the best way to understand the
+language is from examples. Somewhat true, but many of us find it
+easier to remember a few basic rules than a long list of
+incantations. This section is an attempt to provide some
+simple rules that can, I hope, help you better understand the
+MongoDB query language. Here are what seem to me to be the
+fundamental rules:
+
+1. All queries use a python dictionary to contain the instructions.
+2. The key of a dictionary used for query normally refers to an attribute
+ in documents of the collection being queried. There is an exception
+ for the logical OR and logical AND operators (discussed below).
+3. The "value" of each key-value pair is normally itself a python
+ dictionary. The contents of the dictionary define a simple
+ language (Mongo Query Language) that resolves True for a match
+ and False if there is no match.
+4. The keys of the dict containers that are on the value side of
+ a query dict are normally operators. Operators are defined with
+ strings that begin with the "$" symbol.
+5. Simple queries are a single key-value pair with the value either
+ a constant or a dictionary with a single operator key. e.g.
+ to a test for the "sta" attribute being the constant "AAK" the
+ query could be either `{"sta" : "AAK"}` or `{"sta" : {"$eq" : "AAK"}}`.
+ The form with constant value only works for "$eq".
+6. Compound queries (e.g. time interval expressions) have a value
+ with multiple operator keys.
+
+In the examples below, refer back to these rules to help you remember
+these fundamentals.
+
+Query methods
+~~~~~~~~~~~~~~~~
+Querying (read) is again a "collection operation". That is, if we set
+the symbol `db` to a MsPASS or MongoDB `Database` object, the query
+functions are "methods" of a collection object. (see longer discussion
+above in the "Create" section) There are three standard methods for
+the "Read" part of CRUD. We will show examples of all three below.
+
+1. `find_one` returns a document that is the first document found matching
+ a query operator.
+2. `find` returns a MongoDB
+ `Cursor object `__
+ that can be used to iterate through query that returns many documents.
+3. `count_documents` is a utility function used to bound how many documents
+ match a particular query.
+
+Examples of the use of each of the three functions above:
+
+.. code-block:: python
+
+ query={'sta' : 'AAK'} # shorthand for {'sta' : {'$eq' : 'AAK'}}
+ doc = db.site.find_one(query)
+ print("First matching document in site collection for query=",query)
+ print(doc)
+ print("All documents in site collection for query=",query)
+ cursor = db.site.find(query)
+ for doc in cursor:
+ print(doc)
+ n_matches = db.site.count_documents(query)
+ print("Number of documents matching query=",query," is ",n_matches)
+
+`find` and `find_one` are the basic document-level fetching methods.
+The examples above show the most common, simple usage.
+Both, however, actually have three positional arguments with defaults
+you should be aware of.
+
+1. `arg0` defines the query operator. The default is an empty dictionary
+ that is interpreted as "all".
+2. `arg1` defines a "projection" operator. That means it is expected to
+ be a python dictionary defining what attributes are to be retrieved or
+ excluded from the returned value(s). For RDBMS users a "projection"
+ in MongoDB is like the SELECT clause in SQL. That idea is best
+ illustrated by examples below.
+3. `arg2` is an "options" operator. I have personally never found a
+ use for any of the listed options in the MongoDB documenation. I can't
+ even find an online example so "options" are clearly an example of
+ "an advanced feature" you can ignore until needed.
+
+Note also that a `find_one` returns only a single "document", which
+pymongo converts to a python dictionary. The `find` method, in
+contrast returns a pymongo `Cursor` object. A `Cursor` is, in the
+the jargon of data structures, a "forward iterator". That means it can
+only be traversed in one direction from the first to last document retrieved
+by the MongoDB server. There is a `rewind` method for the cursor object
+but it is of use largely for interactive debugging.
+
+We will next consider a series of increasingly complicated examples.
+
+Simple (single key) queries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Single key queries are always of the form:
+`{key : expression}` where `key` is the attribute that is to be tested
+by the query and `expression` is either: (1) another dictionary or
+(2) a single value. An example is the same one we used above.
+
+.. code-block:: python
+
+ query={'sta' : 'AAK'}
+ query2={'sta' : {'$eq' : 'AAK'}}
+
+`query1 and `query1` are completely equivalent.
+Both are equality tests for the attribute with
+the key "sta" matching a particular, unique name "AAK".
+
+A similar inequality test for waveforms having `starttime` values
+after a particular date is the following:
+
+.. code-block:: python
+
+ from obspy import UTCDateTime
+ t_cutoff = UTCDateTime('2012-07-28T00:00:00.00')
+ # query here needs to convert to a unix epoch time (timestamp method)
+ # for numerical comparison to work
+ query = {'starttime' : {'$gt' : t_cutoff.timestamp()}}
+ cursor = db.wf_miniseed.find(query)
+
+MQL has a rich collection of operators.
+`This page `__
+of the MongoDB documentation has the complete list. A particularly useful
+one for most seismologists that is typically omitted from introductory
+tutorials is the
+`$regex `__
+operator. $regex can be used to apply a unix regular expression in
+a query operation. Most seismologists are familiar with the regular
+expression syntax from using the unix shell. The following, for
+example, could be used to select only PASSCAL temporary experiments
+from a site collection:
+
+.. code-block:: python
+
+ query={'net' : {'$regex' : 'X.'}}
+
+Note that works because of an FDSN convention that net codes starting
+with "X" are shorter term deployments. Regular expressions are a rich
+language for text-based filtering. See the link above or do a web
+search for more examples.
+
+Multiple key queries
+~~~~~~~~~~~~~~~~~~~~~~~
+A query to test the value of more than one attribute uses a dictionary
+with multiple keys. In most cases the key
+defines an attribute to be each tested for matches by the query operation.
+The key can, however, also sometimes be an operator, in which case
+the dictionary would be called a "compound query" (see example below).
+For a normal example, the following can be used to find all documents for
+all channels for station with net code "II" and station code "PFO":
+
+.. code-block:: python
+
+ query = dict()
+ query['net'] = 'II'
+ query['sta'] = 'PFO'
+ cursor = db.find(query)
+ for doc in cursor:
+ print(doc)
+
+I used an explicit code to set the query dict container for instructional
+purposes. That form emphasizes that `query` is a python dictionary
+and the query uses 'net' and 'sta' attributes.
+Most online sources use the inline form for defining a python
+dictionary. That is, the following could be used to replace the query definition
+above:
+
+.. code-block:: python
+
+ query = {'net' : 'II', 'sta' : 'PFO'}
+
+For simple queries the inline form is generally easier. I have found,
+however, that for complex queries like examples below the form using
+key-value setting pairs is less error prone. Complex inline expressions
+can easily get confused with which curly backet belongs where.
+
+A final important point about multiple attribute queries is that
+there is and implied "AND" opertions between the dictionary components.
+For example, the example query above could be stated in workds as:
+`net attribute is 'II' AND sta attribute is 'PFO'`. A logical "OR"
+query equivalent requires a compound query (next section).
+
+Compound queries
+~~~~~~~~~~~~~~~~~~~
+Compound queries mean multiple conditions applied to one or more attributes.
+A type example is a very common one in seismology. That is, waveform
+data are always stored as segments with each segment having a start time
+(`starttime` in the stock MsPASS namespace) and ending time
+(`endtime` in the MsPASS namespace). We often want to extract
+a waveform segment with a particular time span from a database indexing
+an entire dataset. That dataset may be larger windows downloaded
+previously or an archive of continuous data commonly stored as day files.
+The problem is complicated by the fact that a requested time window
+may span the artificial gap at day boundaries in continuous data or
+data gaps that are marked in the set of wf documents as a break
+with a particular time window.
+
+With that long introduction, here is an example for a single channel
+request. In particular, this example reads a month of "LHZ"
+channel data for station "PFO" and loads the results into a
+`TimeSeriesEnsemble`:
+
+.. code-block:: python
+
+ from obspy import UTCDateTime
+ # Example to select the month of June of 2012
+ tsutc = UTCDateTime('2012-06-01T00:00:00.0')
+ teutc = UTCDateTime('2012-07-01T00:00:00.0')
+ ts=tsutc.timestamp()
+ te=teutc.timestamp()
+ query = dict()
+ query['net'] = 'II'
+ query['sta'] = 'PFO'
+ query['chan'] = 'LHZ'
+ query['loc'] = '00'
+ query['$and'] = [
+ {'starttime' : {'$lte' : te} },
+ {'endtime' : {'$gte' : ts} }
+ ]
+ cursor = db.wf_miniseed.find(query)
+ ens = db.read_data(cursor,collection='wf_miniseed')
+
+That is a complex query by any definition, but it illustrates several
+features of MQL, some of which are new and some of which were discussed earlier:
+
+1. The dictionary of this key uses both attribute names
+ ('net','sta','chan', and 'loc') and an operator ('$and').
+2. The four attribute keys defined implied == (equality) matches on the
+ seed channel code keywords. As noted above there is an implied logical
+ AND between the four seed station code matching components. (The "$and"
+ is different.)
+3. Notice the subtle detail that the '$and' operator key is associated with
+ a python list (implied by the [] symbols) instead of a python dict
+ or simple value like all previous examples. The logical AND is
+ applied to all components of the list. This example has two components
+ but it could be as many as needed. The components of the list are
+ MQL dictionary expressions that resolve True or False.
+4. This example shows the application of a query to create a cursor
+ passed to the `read_data` method of `Database`. That is the standard
+ way in MsPASS to get a bundle of data we call a `TimeSeriesEnsemble`.
+ In this case, the ensemble will contain all waveform segments for the LHZ
+ channel of the IRIS-Ida station PFO (loc code 00) that have any samples
+ recorded in the month of June 2012.
+
+A final point for this section is another shorthand allowed in the MQL
+language. That is, the "$and" operator above is not actually required.
+The same query as above could, in fact, have been written as follows:
+
+.. code-block:: python
+
+ query = {
+ 'net' : 'II',
+ 'sta' : 'PFO',
+ 'chan' : 'LHZ',
+ 'loc' : '00',
+ {'starttime' : {'$lte' : te} },
+ {'endtime' : {'$gte' : ts} }
+ }
+
+In this case I used the inline syntax because it more clearly shows
+the point. That is, a query defined by a series of expressions has
+an implied "AND" logical operator for all separate expressions.
+For this example, you would say that in words as:
+net code is II AND sta code is PFO AND channel code is LHZ AND ...
+For that reason the used of the $and opertor above is not actually
+required. Note, however, if an query logical expression involves
+an OR clause the list of expressions syntax is required. Here,
+for example, is a similar query to above with an OR clause.
+This query would always retrieve horizontal components and handle the
+obnoxious channel code variation of E,N,Z naming versus 1,2,Z naming.
+It also drops the "loc" matching and would thus ignore the loc code
+and retrieve data from all sensors at PFO.
+
+.. code-block:: python
+
+ query = {
+ 'net' : 'II',
+ 'sta' : 'PFO',
+ '$or' : ['chan' : 'LHE', 'chan' : 'LHN', 'chan' : 'LH1', 'chan' : 'LH2'],
+ {'starttime' : {'$lte' : te} },
+ {'endtime' : {'$gte' : ts} }
+ }
+
+Finally, the previous example also can be used to illustrate a
+clearer solution with the `$regex` operator. Most $or clauses I've
+encountered are easier to express with a regular expression.
+The above could thus be express equivalently with this one:
+
+.. code-block:: python
+
+ query = {
+ 'net' : 'II',
+ 'sta' : 'PFO',
+ 'chan' : {'$regex' : 'LH.'},
+ {'starttime' : {'$lte' : te} },
+ {'endtime' : {'$gte' : ts} }
+ }
+
+Geospatial queries
+~~~~~~~~~~~~~~~~~~~~~
+MongoDB has a fairly sophisticated geospatial querying feature.
+A first order thing you must realize about geospatial indexing is that
+to be useful two things are required:
+
+1. The attribute(s) you want to query should be structured into a
+ special data type called a
+ `GeoJSON object `__.
+ The only example packaged that way by MsPASS is the coordinates of
+ seismic instruments stored in the "site" and"channel" collections
+ and source spatial coordinates defined in the standard "source" collection.
+ For all the "lat" and "lon" keys define the latitude
+ and longitude directly and are copied stored in a GeoJSON point object
+ with the key `location` in "site" and "channel" and "epicenter" in "source".
+ A limitation of MongoDB's geospatial query engine is it is much like
+ ArcGIS and is tuned to coordinate-based queries. To add a depth
+ constraint requires a compound query mixing geospatial and a range
+ query over depth.
+2. All geospatial queries REQUIRE creating a
+ `geospatial index `__.
+ Most MsPASS users will ALWAYS want to use what MongoDB calls a
+ "2dsphere" index. Their "2d" index uses a map projection and is
+ designed only for local scale software apps at a city scale.
+ The "2d" index is not accurate for the scale of most seismology
+ research problems. An exception is that UTM coordinates may work
+ with a "2d" index, but I have no direct experience
+ with that approach. That could be useful with active source data
+ where survey coordinates are use UTM coordinates.
+
+The most common usage for geespatial queries I know in seismology is
+limiting the set of seismic instruments and/or sources based on a
+geographical area. MQL implements geospatial queries as
+a special type of operator. i.e. the definitions of the query
+are used like '$gt', '$eq', etc.
+
+Here is a simple example to retrieve documents from the site collection
+for all stations within 500 km of my home in Bloomington, Indiana.
+It is a minor variant of a similar example in the tutorial linked to
+this page.
+
+.. code-block:: python
+
+ query = {"location":{
+ '$nearSphere': {
+ '$geometry' : {
+ 'type' : 'Point',
+ 'coordinates' : [-86.5264, 39.1653]
+ },
+ '$maxDistance' : 500000.0,
+ }
+ }
+ }
+ cursor = db.site.find(query)
+ for doc in cursor:
+ print(doc)
+
+Note the complex, nested operators that characterize all MongoDB
+geospatial queries. I trust the verbose names make clear how this
+query works provided you realize the location of Bloomington is
+around 39 degrees latitude and the distance parameters have to
+be defined in meters. Note a few key details:
+
+1. MQL's geospatial query language is best done with
+ `geoJSON `__. This example defines a
+ geoJSON point
+ and a search radius. In all cases, the key at the top level of
+ the query is an MQL operator. In this case the operator is
+ "$nearSphere". Note the first character "$" that is universally
+ used to define a key as an operator in MQL. This example is
+ a typical geospatial query made up of a multi-level document/dictionary
+ with multiple operators at different levels.
+2. The distance specification is in meters and the geographical
+ coordinate data are in degrees. As far as I can tell that is the
+ norm for MongoDB. (Some older sources suggest some operators
+ once used radian units, but that seems to be the distant past.)
+3. Once constructed the query is used like any other dictionary
+ passed to find. This example doesn't use any projection to
+ keep the example simple, but it could have.
+
+The set of spatial query operators are document in
+`this page `__
+of the MongoDB documentation. Most of the complexity is in the
+second level of attributes passed to the operator specified in geoJSON.
+That is, for spherical geometry, which I again stress is the only thing
+you should use, there are only three operator:
+(1) `nearSphere` that I illustrated above, and (2) `geoWithin`
+used to search inside a specified geometric shape (e.g. a polygon
+but can be other things), and (3) `geoIntersects` that
+"selects documents whose geospatial data intersects with a specified GeoJSON object ...".
+
+Although the spatial query operators are a powerful tool to allow
+geospatial queries comparable to some elements of a GIS system, there
+are some major caveats and warnings:
+
+1. It is quite clear that the geospatial features of MongoDB
+ have evolved significantly in recent years.
+ Why that matters is I find a lot of online sources
+ contain out-of-date information.
+ To make matters worse, MongoDB's documentation on the topic
+ does a poor job of describing this evolution
+ and older documentation has examples that I found wouldn't work.
+ That may change, but be warned you are likely in for some hacking.
+2. From what I can glean from fighting with this feature, the
+ current problem was created by a evolution of MongoDB that seems to
+ have begun around 2020. It appears the earliest attempts to add
+ geospatial queries to MongoDB used a "legacy" format to define
+ coordinates. e.g. a specific lon-lat can be specified in "legacy"
+ format like this:`{ "coords" : [-102.7724, 33.969601]}`. The same
+ information defined in geoJSON is:
+
+ .. code-block:: python
+
+ { "coords" :
+ {
+ "type": "Point",
+ "coordinates": [
+ -102.7724,
+ 33.969601
+ ]
+ }
+ }
+
+ From my experience you should avoid the legacy format and only use
+ geoJSON specifications in MongoDB documents. To make that easier
+ there is a convenience function in the `mspasspy.db.database`
+ module called `geoJSON_doc`. It can be used to create the obscure
+ document structure MongoDB requires for simple lat,lon point data.
+3. A limitation of the (current) MongoDB implementation is the
+ `count_documents` method does not seem to work for any valid
+ query I can construct. Internet chatter suggests that is the norm.
+ I have found that using `count_documents` to test a query to
+ report the size of a query return is a good way to debug complex
+ queries. Since all geospatial queries are complex by
+ almost any definition that is problematic. I find that to debug
+ a geospatial query it is helpful to isolate the query in a
+ jupyter notebook box run it until the query runs without an error.
+ The example code block immediately above is a good model.
+ Use the same structure, but remove the print loop until you get the
+ query to work.
+
+I would stress that in spite of these caveats, the integration of
+geospatial query functions in the MongoDB are an important functionality
+that can simplify a lot of research workflows. If your work requires
+any kind of geospatial grouping, it is worth investing the effort to
+understand MongoDB's geospatial operators and how we use them in MsPASS.
+
+Sorting
+~~~~~~~~~~
+There are many situations where an algorithm using input from
+a MongoDB query requires a list sorted by one or more keys.
+Defining a sort is straightforward but a little bit weird
+until you understand the logic. It will be easier to
+explain that with a simple example. Here is a query that returns
+a cursor to retrieve documents defining LHZ waveforms from
+station PFO (it uses a duplicate of one of the compound queries
+from above) but this time we sort the result by starttime
+(a type example of a sort requirement):
+
+.. code-block:: python
+
+ # ts and te are epoch times defing the time range as above
+ query = {
+ 'net' : 'II',
+ 'sta' : 'PFO',
+ 'chan' : 'LHZ',
+ 'loc' : '00',
+ {'starttime' : {'$le' : te} },
+ {'endtime' : {'$ge' : ts} }
+ }
+ cursor = db.wf_miniseed.find(query).sort("starttime",1)
+ ens = db.read_data(cursor,collection='wf_miniseed')
+
+There are two key points this example illustrates:
+
+1. `sort` is defined as a "method" of the "Cursor" object returned by find.
+ That is more than a little
+ weird but a common construct in python which is an object-oriented language.
+ Most of us can remember it better by just thinking of it as an clause
+ added after find and separated by the "." symbol. Because it is a method
+ of cursor the sort clause could have been expressed as another statement
+ after the find like this: `cursor = cursor.sort("starttime,1)")`
+2. The sort expression for a single key can be thought of as calling a
+ function with two arguments. The first it the key to use for the
+ sort and the second defines the direction of the sort. Here I
+ used "1" which means sort into an ascending sequence. When the result is
+ passed to the `read_data` it guarantees the waveforms in the
+ ensemble created by `read_data` will be in increasing starttime order.
+ You would use -1 if you wanted to sort in descending order.
+ (Note: some sources will use the verbose symbols `pymongo.ASCENDING`
+ instead of 1 and `pymongo.DESCENDING` instead of -1. For me 1 and -1
+ are a lot easier to remember.) In typical python way there is also
+ a default for the sort order of 1. i.e. in the sort call above
+ we could have omitted the second argument.
+
+Sorting on multiple keys requires a slightly different syntax. Again, an
+example will make this clearer. This code segment prints a report for
+the entire contents of the channel collection sorted by seed channel code:
+
+.. code-block:: python
+
+ sort_clause = [
+ ("net",1),
+ ("sta",1),
+ ("chan",1),
+ ("starttime",1)
+ ]
+ cursor = db.channel.find()
+ cursor = cursor.sort(sort_clause)
+ print("net sta chan loc starttime")
+ for doc in cursor:
+ # conditional to handle common case with loc undefined
+ if 'loc' in doc:
+ print(doc['net'],doc['sta'],doc['chan'],doc['loc'],UTCDateTime(doc['starttime']))
+ else:
+ print(doc['net'],doc['sta'],doc['chan'],'UNDEFINED',UTCDateTime(doc['starttime']))
+
+The main thing to notice is that when using multiple keys for a sort they
+must be defined as a python list of python tuples (arrays defined with [] will
+also work). That usage is potentially confusing for two reasons you should
+be aware of:
+
+1. Most examples you will see of a single key sort use just the key name
+ (ascending order is the default) or two arguments version like that I used
+ above. Multiple key sorts require a completely different type for arg0;
+ a python list of tuples.
+2. Most examples you will find in a routine internet search with a phrase
+ like "mongodb sort with mutiple keys" will show the syntax you can use
+ with the "mongo shell". The problem is that the mongo shell speaks
+ a different language (Javascript) that uses a syntax that looks like
+ it is defining an inline python dictionary definition, but it is not.
+ That is, with
+ the mongo shell the sort above could be written as:
+ `{'net':1, 'sta':1, 'chan':1, 'starttime':1}`. That is not a python
+ dictionary, however, even though the syntax is exactly the same.
+ In Javascript that is a list where the order of the list means something.
+ If that were translated to a python dictionary it would not work
+ because order of input is not preserved in a python dictionary. Hence,
+ the pymongo API has to use a list to preserve order.
+
+Report generators
+~~~~~~~~~~~~~~~~~~~~
+One of the important applications of queries in MsPASS is to generate
+a human readable report on the content of a database that is to be used
+as input for processing. An option for experienced programmers familiar with the
+incantations of detailed formatting of text output is to create a
+custom formatting function to generate a report from a cursor input.
+For mere mortals, however, there are two much simpler options:
+
+1. For small numbers of documents the `json_util` package can be useful.
+2. Pandas are your friend for producing output visualized well with a table.
+
+The examples in this section show how to set up both. The related tutorial
+notebook for this section of the User's Manual provide hands on examples
+and why raw output can be ugly.
+
+A typical example of `json_util` is that you might want to look at the gross
+structure of one or more documents created by running something like the
+MsPASS `index_mseed_file` method of `Database`. Something like the
+following can be useful to use in a python notebook run interactively
+to work out data problems:
+
+.. code-block:: python
+
+ from bson import json_util
+ doc = db.wf_miniseed.find_one()
+ print(json_util.dumps(doc,indent=2))
+
+The `indent=2` argument is essential to create an output that is
+more readable than what would be otherwise produced by the much
+simpler to write expression `print(doc)`.
+
+Many quality control reports are conveniently visualized with a well
+formatted table display. As noted above pandas are your friend in
+creating such a report. Here is an example that creates a report of all
+stations listed in the site collection with coordinates and the time
+range of recording. It is a variant of a code block in our
+MsPASS `mongodb_tutorial<>`__ (TODO: hyperlink to github )
+
+.. code-block:: python
+
+ import pandas
+ cursor=db.site.find({})
+ doclist=[]
+ for doc in cursor:
+ # Not essential, but produces a more readable table with date strings
+ doc['starttime']=UTCDateTime(doc['starttime'])
+ doc['endtime']=UTCDateTime(doc['endtime'])
+ doclist.append(doc)
+ df = pandas.DataFrame.from_dict(doclist)
+ print(df)
+
+Saves (Create of CRUD)
+------------------------
+The first letter of CRUD is the save operation. A save of some kind
+is usually the first thing one does in building a seismic dataset
+as there needs to be some what to populate the database collections.
+Our "getting_started" tutorial illustrates the most common
+workflow: populating the "site", "channel", "source", and (usually)
+"wf_miniseed". This section focuses more on the general problem of
+loading some other data that doesn't match the standard mspass schema.
+An example, which we use for the hands on supplement to this section
+in our notebook tutorials, is downloading and loading the current CMT
+catalog and loading it into a nonstandard collection we all "CMT".
+In this manual we focus on the fundamentals of the pymongo API for
+saving documents.
+
+There are two methods of `Database.collection` that you can use to
+save "documents" in a MongoDB collection. They are:
+
+1. `insert_one` as the name implies is used to save on and only
+ one document. It is usually run with one argument that is assumed
+ to be a python dictionary containing the name-value pairs that
+ define the document to be saved and subsequently managed by
+ MongoDB.
+2. `insert_many` is used to insert multiple documents. It expects
+ to receive a python list of dictionaries as arg0 each of which is
+ like input sent to `insert_one`.
+
+An important thing to realize is that `insert_many` is not at all
+the same thing as running a loop like this:
+
+.. code-block:: python
+
+ # Wrong way to do insert_many
+ for doc in doclist:
+ db.CMT.insert_one(doc)
+
+The reason is that the loop above does an independent transaction for
+each document and the loop blocks until the MongoDB server acknowledges
+success. `insert_many`, in contrast, does a bulk update. It automatically
+breaks up the list into chunk sizes it handles as one transaction.
+With a large save `insert_many` can be orders of magnitude faster
+than the same number of one-at-a-time transactions.
+
+Here is a partial example of save from the related tutorial notebook:
+
+.. code-block:: python
+
+ doclist = ndk2docs(fname)
+ print("Number of CMT records in file=",fname,' is ',len(doclist))
+ r=db.CMT.insert_many(doclist)
+ n=db.CMT.count_documents()
+ print("Number of documents in CMT collection is now ",n)
+
+
+Updates (U of CRUD)
+--------------------
+Updates have a similar API to the insert/create API. That is, there
+are again two different collection methods:
+
+1. `update_one` is used to replace all or some of the data in one document.
+2. `update_many` is used to replace all or some attributes of many documents
+ in a single transaction.
+
+There is, however, a special feature called a `bulk_write` that can be useful
+in some situations. I cover that more specialized function at the end of
+this section.
+
+Although they have options, both `update_one` and `update_many`
+are usually called with two arguments. *arg0* is an MQL matching query and
+*arg1* defines what is to be changed/added. The only real difference
+between `update_one` and `update_many` is that `update_one` will only
+change the first occurence it finds if the match query is not unique.
+For that reason, `update_one` is most commonly used with an `ObjectId`
+match key. For example, this segment would be a (slow) way to add
+a cross-reference id link to wf_TimeSeries documents with
+documents from a channel collection produced
+created by loading from an Antelope sitechan table. It uses the foreign
+key "chanid" in CSS3.0 to find a record and then uses `update_one` to
+set the MsPASS standard cross-reference name `channel_id` in wf_TimeSeries.
+
+.. code-block:: python
+
+ # some previous magic has been assumed to have set chanid in
+ # wf_TimeSeries (feasible with a CSS3.0 wfdisc table)
+ # This examplei is for illustration only and is not of direct use
+ cursor = db.wf_TimeSeries.find({})
+ for doc in cursor:
+ if 'chanid' in doc:
+ chandoc = db.channel.find_one({'chanid' : doc['chanid']})
+ # find_one failures return None so this is a test for a valid return
+ if chandoc:
+ cid = chandoc['_id']
+ wfid = doc['_id']
+ db.wf_TimeSeries.update_one({'_id' : wfid},{'channel_id' : cid })
+
+The `update_many` method is more commonly used with more complex queries
+to set a constant for the group of documents. The example below uses
+`update_many` to build source cross-reference ids for a dataset created
+by extracting waveforms using event origin times as the start times.
+We can then do a match (arg0) using a range test with a small tolerance
+around the origin time. This fragment, unlike the `update_one` example,
+is a useful prototype for a common organization of a dataset that initiates
+from common source gathers:
+
+.. code-block:: python
+
+ # define a +- range relative to origin time for starttime
+ ot_range=1.0
+ # loop over all source records to drive this process
+ cursor = db.source.find({})
+ for doc in cursor:
+ otime = doc['time']
+ ts = otime - ot_range
+ te = otime + ot_range
+ match_query = {
+ 'starttime' : {
+ {'$gte' ts, '$lte' : te}
+ }
+ }
+ srcid = doc['_id'] # ObjectId of this source document
+ update_doc = {'source_id' : srcid}
+ db.wf_TimeSeries.update_many(match_query,update_doc)
+
+Finally, a more advanced approach that is useful for large numbers of
+random updates with a large data set is the pymongo collection method
+called `bulk_write `__.
+An example of how to use this method can be found in the MsPASS function
+`bulk_normalize `__.
+Briefly, the idea is to manually build up blocks of atomic-level updates.
+That approach is necessary only in the case where there is no simple
+recipe for creating smaller number of matching operators like my
+`insert_many` example above. That is, the `bulk_normalize` function uses
+does unique matches for each wf document with an object_id. It makes
+sense in that context because it assumes the matching was done externally
+with one of the MsPASS matchers.
+
+Delete (D of CRUD)
+--------------------
+As noted elsewhere, we take the position that for most design uses
+of MsPASS delete operations should be rare. To reiterate, the reason
+is that MsPASS was designed with the idea that the database is used
+to manage an assembled data set. Appending more data is expected to
+be the norm, but deleting data unusual. For most cases, for example,
+it is easier to rebuild something like a `wf_miniseed` collection
+than design a deletion operation to selectively remove some data.
+Nonetheless, that statement motivates the example we give below.
+
+The API for deletion has a strong parallel to insert and update. That is,
+there are two basic method: `delete_one` deletes one document and
+`delete_many` can be used to delete a set of documents matching a
+query defined with arg0.
+
+As an example, suppose we have a large dataset assembled into Seismogram
+objects indexed with `wf_Seismogram` and we find station "XYZ" had
+something fundamentally wrong with it for the entire duration of the
+dataset. Assume "XYZ" doesn't require a "net" qualifier to be unique
+this code fragment could be used to delete all entries for "XYZ".
+It is complicated by a bit by the fact that multiple site entries can
+occur for the same station due to time-dependent metadata:
+
+.. code-block:: python
+
+ cursor = db.site.find({'sta': 'XYZ'})
+ for doc in cursor:
+ sid = doc['_id']
+ query = {'site_id' : sid}
+ n = db.wf_Seismogram.count_documents(query)
+ print("Number of documents to be deleted for station XYZ =",n)
+ print("for time period ",UTCDateTime(doc['starttime']),UTCDateTime(doc['endtime']))
+ db.wf_Seismogram.delete_many(query)
+
+Indexes
+--------------
+An index is desirable in any database system to improve read performance.
+Without an index a query requires a linear search through the entire
+database to find matching records. As a result read and update performance on
+any database system can be improved by orders of magnitude with a properly
+constructed index. On the other hand, an indexes can slow write performance
+significantly. I have found that in data processing with MsPASS the main
+application of indexes is to normalizing collections. They fit the
+constraint above. That is, normalizing data is normally written once with
+occasional updates and is mostly used during read operations.
+
+A first point to recognize is that MongoDB ALWAYS defines an index on the
+magic attribute key "_id" for any collection. Any additional
+index needs to be created with the collection method called
+`create_index `__.
+The index can be defined for one or more keys and set to define an
+increasing or decreasing sequence. e.g. the following will create an
+index on the channel collection appropriate for miniseed metadata
+that require at least the four keys used to provide a unique match:
+
+.. code-block:: python
+
+ db.channel.create_index(
+ [
+ "net",
+ ("sta", pymongo.DESCENDING),
+ "chan",
+ "time"
+ ]
+ )
+
+Note arg0 is a list of attribute names and/or 2-element tuples.
+Tuples are needed only if the default ascending order is to be
+switch to descending. I did that for illustration above for "sta",
+but it wouldn't normally be necessary.
+
+There are two utility functions for managing indexes in a collection:
+
+1. `list_indexes` returns a `Cursor` that can be iterated
+ to show details of all indexes defined for a collection. e.g. the
+ above section to create a special index for channel might be
+ followed by this line to verify it worked:
+
+.. code-block:: python
+
+ cursor = db.channel.list_indexes()
+ for doc in cursor:
+ print(json_utils.dumps(doc,indent=2))
+
+2. There is a `delete_index` that can be used to remove an index from
+ a collection. It uses the index name that is returned on creation
+ by `create_index` or can be obtained by running `list_indexes`.
+
+A final point about indexes is special case of "geospatial indexes"
+discussed above. A geospatial index is a very different thing than
+a "normal" index on one or more keys. Consult online sources if
+you need to learn more about that topic.
diff --git a/docs/source/user_manual/normalization.rst b/docs/source/user_manual/normalization.rst
index e8b5036a7..029e1aad9 100644
--- a/docs/source/user_manual/normalization.rst
+++ b/docs/source/user_manual/normalization.rst
@@ -20,15 +20,16 @@ a set of tables that are linked to the waveform index (wfdisc)
using a relational database "join". MongoDB is not relational
but handles the same issue by what they call the :code:`normalized`
versus the :code:`embedded` data model
-(MongoDB's documentation on this topic can be found `here `__).
+(MongoDB's documentation on this topic can be found
+`here `__).
Normalization is conceptually similar to a relational database join, but
is implemented in a different way that has implications on performance.
For small datasets these issues can be minor, but for very large
data sets we have found poorly designed normalization algorithms
-can be serious bottleneck to performance.
+can be a serious bottleneck to performance.
A key difference all users need to appreciate
-is that with a relational database a "join" is always a global operation done between all
+is that with a relational database, a "join" is always a global operation done between all
tuples in two relations (tables or table subsets). In MongoDB
normalization is an atomic operation made one document (recall a document
is analogous to a tuple) at a time. Because all database operations are
@@ -48,7 +49,7 @@ tables are easily loaded into a Dataframe with one line of python code
(:code:`read_csv`). That abstraction is possible because a MongoDB "collection"
is just an alternative way to represent a table (relation).
-Before proceeding it is important to give a pair of definitions we used repeatedly
+Before proceeding it is important to give a pair of definitions we use repeatedly
in the text below. We define the :code:`normalizing` collection/table as the
smaller collection/table holding the repetitious data we aim to cross-reference.
In addition, when we use the term :code:`target of normalization`
@@ -58,62 +59,104 @@ waveform index collections (wf_miniseed, wf_TimeSeries, or wf_Seismogram)
or, in the case of in-line normalization functions, the Metadata container of
one of the MsPaSS data objects.
-Data reader normalization method
+Normalization with readers
--------------------------------------
Overview
++++++++++++++
Almost all workflows begin with a set of initializations. In a
-production workflow using MsPASS that is normally followed immediately by one of
-two MongoDB constructs: (1) a query of one of the waveform collections
-that returns a MongoDB :code:`cursor` object, or (2) a function call that
-creates an RDD/bag of query strings. The first is used if the processing
-is limited to atomic level operations like filtering while the second is
-the norm for ensemble processing. In either case, :code:`normalization`
-is best done through the readers that particular workflow uses to create the
-working RDD/bag. In both cases a key argument to the read functions is
-:code:`normalize`. In all cases :code:`normalize` should contain a
-python list of strings defining mongodb collection names that should be
-used to "normalize" the data being read.
-
-Normalization during read operation has two important limitations
-users must recognize:
-
-#. The cross-reference method to link waveform documents to normalizing
- data is fixed. That is, in all cases the cross-reference is always
- based on the :code:`ObjectId` of the normalizing collection with a
- frozen naming convention. For example, if we want to normalize Seismogram data
- with the :code:`site` collection, all MsPASS readers expect to find the
- attribute :code:`site_id` in :code:`wf_Seismogram` documents that
- define the (unique) :code:`ObjectId` of a document in the :code:`site`
- collection. If a required id is not defined that datum is silently dropped.
-#. By default the readers load the entire contents of all documents in the normalizing
- collection. That can waste memory. For example, channel collection
- documents created from FDSN web service and saved with the
- :py:meth:`save_inventory ` method always
- contain serialized response data that may or may not be needed. If that
- is a concern, the easiest solution is to use the :code:`exclude`
- argument to pass a list of keys of attributes that should not be
- loaded. An alternative is to use the inline normalization
- functions described later in this section. They can provide more
- control on what is loaded and should normally be used as an
- argument to a map operator.
+production workflow using MsPASS initialization
+is normally followed immediately by one of
+two MongoDB constructs:
+
+#. For serial processing most workflows reduce to an outer loop
+ driven by a MongoDB :code:`Cursor` object. A
+ :code:`Cursor` is the output of the standard "find" method
+ for the handle to any MongoDB collection.
+#. Parallel workflows in MsPASS are driven by a call to the
+ :py:func:`read_distributed_data `
+ function.
+
+In either case, :code:`normalization` can always
+be accomplished one of two ways:
+
+#. The two core readers in MsPASS
+ (:py:func:`read_data`
+ for parallel workflows) both have a :code:`normalize` argument that can
+ contain a list of one of more normalization operators. When applied this
+ way normalization is more-or-less treated as part of the process of
+ constructing the data object that form the dataset.
+#. Normalization can be applied within a workflow as illustrated in
+ examples below.
+
+Both approaches utilize the concept of a :code:`normalization operator`
+we discuss in detail in this section. Readers familiar with relational
+database concept may find it helpful to view a :code:`normalization operator`
+as equivalent to the operation used to define a database join.
+
+This section focuses on the first approach. The second is covered in
+a later section below. The most common operators for normalization while
+reading are those using a cross-referencing id key. We discuss those
+concepts first before showing examples of normalization during read.
+Note the next section on ids is equally relevant to normalization in
+a workflow, but we include it here because it is more central to
+normalization during reading.
+
Defining Cross-referencing IDs
++++++++++++++++++++++++++++++++++
-Because the readers use ObjectIds to provide the standard cross-reference
-method for normalization, MsPASS has functions for the common matching
-schemes. The simplest to use is :py:func:`normalize_mseed `.
+An "id" is a common concept in all database implementations.
+Relational datadata schemas like CSS3.0 have numerous integer ids
+used for join keys between tables. Integers were traditionally used as
+cross-reference keys as it is relatively easy to maintain uniqueness
+and computers do few operations faster than an integer equality test.
+MongoDB uses a custom data object they call an
+`ObjectId `__.
+Conceptually, however, an ObjectId is simply an alternative way to
+guarantee a unique key for a database document
+(equivalent to a tuple in relational database theory) to the more
+traditional integer keys.
+Note this is in contrast to using integer keys where the set of possible
+values is finite and some mechanism is required to ask the database
+server for a key to assure it is unique. ObjectIds can be generated by
+a workflow without interaction with the MongoDB server.
+You can learn more about this aspect of ObjectIds
+`here `__.
+
+Using the ObjectId methods provides the fastest normalization methods
+available in MsPASS. Currently the most common model for
+data processing is a collection of miniseed files downloaded from
+FDSN data services and/or a collection of files created from a
+field experiment. Once these files are indexed with
+the :py:meth:`index_mseed ` method
+they can be read directly to initiate a processing workflow.
+Such data can be normalized with the
+operator :py:class:`MiniseedMatcher `
+without using an Id, but in our experience that is not advised
+for two reasons. First, the complexity of SEED data makes it challenging
+to know if the :code:`channel` collection is complete. We have found
+many examples of incomplete or inaccurate station data downloaded
+from FDSN that cause some fraction of waveforms in a large dataset to not have any
+matching :code`channel` entry. A second, more minor issue, is that
+the complexity of the algorithm used by
+:py:class:`MiniseedMatcher`
+makes it inevitably slower than the comparable Id-based algorithm
+:py:class:`ObjectIdMatcher `.
+We suggest that unless you are absolutely certain of the
+completeness of the :code:`channel` collection, you should use the
+Id-based method discussed here for doing normalization while reading.
+
+Because miniseed normalization is so fundamental to modern seismology data,
+we created a special python function called
+:py:func:`normalize_mseed `.
It is used for defining :code:`channel_id`
(optionally :code:`site_id`) matches in the :code:`wf_miniseed` collection.
-Use this function when your workflow is based on a set of miniseed files.
-The actual matching is done by the using the complicated SEED standard of the
-station name keys commonly called net, sta, chan, and loc codes and
-a time stamp inside a defined time interval. That complex match is, in fact,
-a case in point for why we use ObjectIds as the default cross-reference. The
-:py:func:`normalize_mseed `
+This function is implemented with the matcher called
+:py:class:`MiniseedMatcher ` mentioned earlier.
+The :py:func:`normalize_mseed `
function efficiently handles the lookup and
database updates by caching the index in memory and using a bulk update
method to speed update times. We strongly recommend use of this function
@@ -159,12 +202,18 @@ from running normalize_mseed in the example above:
.. code-block:: python
from mspasspy.client import Client
+ from mspasspy.database.normalize import MiniseedMatcher
dbclient = Client()
db = dbclient.get_database("mydatabase")
+ # channel is the default collection for this class
+ channel_matcher = MiniseedMatcher(db)
# loop over all wf_miniseed records
cursor = db.wf_miniseed.find({})
for doc in cursor:
- d = db.read_data(doc,normalize=["channel"])
+ d = db.read_data(doc,
+ normalize=[channel_matcher],
+ collection="wf_miniseed",
+ )
# processing functions here
# normally terminated with a save operation or a graphic display
@@ -178,43 +227,154 @@ The following does the same operation as above in parallel with dask
from mspasspy.client import Client
from mspasspy.db.database import read_distributed_data
+ from mspasspy.database.normalize import MiniseedMatcher
+
dbclient = Client()
db = dbclient.get_database("mydatabase")
+ channel_matcher = MiniseedMatcher(db)
# loop over all wf_miniseed records
cursor = db.wf_miniseed.find({})
- dataset = read_distributed_data(db,normalize=["channel"])
+ dataset = read_distributed_data(cursor,
+ normalize=[channel_matcher],
+ collection='wf_miniseed',
+ )
# porocessing steps as map operators follow
# normally terminate with a save
dataset.compute()
Reading ensembles with normalization is similar. The following is a
-serial job that reads ensembles and normalizes each ensemble with data from
-the source and channel collections. It assumes not only
-normalize_mseed has been run on the data but some version of bulk_normalize
-was used to set the source_id values for all documents in wf_miniseed.
+serial job that reads ensembles and normalizes the ensemble with data from
+the source and channel collections. It assumes source_id was defined
+previously.
+
+.. code-block:: python
+
+ from mspasspy.client import Client
+ from mspasspy.db.normalize import MiniseedMatcher, ObjectIdMatcher
+ dbclient = Client()
+ db = dbclient.get_database("mydatabase")
+ channel_matcher = MiniseedMatcher(db)
+ source_matcher = ObjectIdMatcher(db,
+ collection="source",
+ attributes_to_load=["lat","lon","depth","time","_id"],
+ )
+ # this assumes the returned list is not enormous
+ sourceid_list = db.wf_miniseed.distinct("source_id")
+ for srcid in sourceid_list:
+ cursor = db.wf_miniseed.find({"source_id" : srcid})
+ ensemble = db.read_data(cursor,
+ normalize=[channel_matcher],
+ normalize_ensemble=[source_matcher])
+ # processing functions for ensembles to follow here
+ # normally would be followed by a save
+
+Note that we used a different option to handle the `source` collection
+in this example. This is an example of creating a set of
+"common source gathers" (all data from a common source) so it is
+natural to post the source attributes to the ensemble's `Metadata`
+container instead of each enemble "member". Putting the
+`source_matcher` object as the target for the `normalize_ensemble`
+argument accomplishes that. For ensembles loading data to members
+is the implied meaning of any target for the `normalize` argument.
+
+.. note::
+ The normalize_ensemble feature was added on version 2 of MsPASS.
+ Older versions did not implement that extension.
+
+Normalization with a workflow
+----------------------------------
+Normalization within a workflow uses the same "Matcher" operators but
+is best done through a function call in a serial job or with a map
+operator in a parallel job. It is perhaps easiest to demonstrate how
+this is done by rewriting the examples above doing normalization during
+read with the equivalent algorithm for normalization as a separate
+step within the workflow.
+
+First, the serial example:
+
+.. code-block:: python
+
+ from mspasspy.client import Client
+ from mspasspy.database.normalize import MiniseedMatcher,normalize
+ dbclient = Client()
+ db = dbclient.get_database("mydatabase")
+ # channel is the default collection for this class
+ channel_matcher = MiniseedMatcher(db)
+ # loop over all wf_miniseed records
+ cursor = db.wf_miniseed.find({})
+ for doc in cursor:
+ d = db.read_data(doc,collection="wf_miniseed")
+ d = normalize(d,channel_matcher)
+ # processing functions here
+ # normally terminated with a save operation or a graphic display
+
+Next, the parallel version of the job immediately above:
+
+.. code-block:: python
+
+ from mspasspy.client import Client
+ from mspasspy.db.database import read_distributed_data
+ from mspasspy.database.normalize import MiniseedMatcher,normalize
+
+ dbclient = Client()
+ db = dbclient.get_database("mydatabase")
+ channel_matcher = MiniseedMatcher(db)
+ # loop over all wf_miniseed records
+ cursor = db.wf_miniseed.find({})
+ dataset = read_distributed_data(cursor,collection="wf_miniseed")
+ dataset = dataset.map(normalize,channel_matcher)
+ # processing steps as map operators follow
+ # normally terminate with a save
+ dataset.compute()
+
+Finally, the example for reading ensembles:
.. code-block:: python
from mspasspy.client import Client
+ from mspasspy.db.normalize import MiniseedMatcher, ObjectIdMatcher, normalize
dbclient = Client()
db = dbclient.get_database("mydatabase")
+ channel_matcher = MiniseedMatcher(db)
+ source_matcher = ObjectIdMatcher(db,
+ collection="source",
+ attributes_to_load=["lat","lon","depth","time","_id"],
+ )
# this assumes the returned list is not enormous
sourceid_list = db.wf_miniseed.distinct("source_id")
for srcid in sourceid_list:
cursor = db.wf_miniseed.find({"source_id" : srcid})
- ensemble = db.read_ensemble_data(cursor,normalize=["channel","source"])
+ ensemble = db.read_ensemble_data(cursor, collection="wf_miniseed")
+ ensemble = normalize(ensemble,channel_matcher,apply_to_members=True)
+ ensemble = normalize(ensemble,source_matcher)
+
# processing functions for ensembles to follow here
# normally would be followed by a save
+Note that we had to set `apply_to_members` True to have the normalize
+function process all enemble members. Normal behavior for that function
+with ensembles is to normalize the ensemble Metadata container as is
+done with the `source_matcher` line. Both are necessary to match the
+examples for normalizing during read which the above were designed to
+produce identical result by different paths.
+
+.. note::
+ The `apply_to_members` argument is a feature added in version 2 of MsPASS.
-Normalization within a Workflow
+Normalization Operators
-------------------------------
+Overview
+++++++++++++
+This section covers the available normalization operators in MsPASS.
+It focuses on design concepts and listing the available features.
+See the examples above and following this section for more nuts and bolts
+details. The examples below all use the normalization within a workflow
+approach.
+
Concepts
++++++++++++++
-An alternative to normalization during a read operation is to match records
-in a normalizing collection/table on the fly and load desired attributes
-from that collection. We abstract that process to two concepts
+Normalization can be abstracted as two concepts
that need to be implemented to make a concrete normalization procedure:
#. We need to define an algorithm that provides a match of records in
@@ -226,10 +386,12 @@ that need to be implemented to make a concrete normalization procedure:
standard normalization operation requires the match be one-to-one.
We abstract both of these operations in a novel way in MsPASS
-described in the two sections below.
+through a standardized API we call a "matcher".
Matchers
+++++++++++++++
+Normalization requires a rule that defines how documents in
+the normalizing collection match documents in the target.
A match can be defined by
something as simple as a single key string match or it
can be some arbitrarily complex algorithm. For example,
@@ -242,7 +404,7 @@ outcome for each document/tuple/row. That is, the algorithm returns
True if there is a match and a False if the match fails.
In MsPASS we define this abstraction in an object-oriented perspective
using inheritance and an abstract base class that defines the
-core generic operation. You can read the docstrings of
+core generic operation. You can read the docstrings of
:py:class:`BasicMatcher `
for details.
Note that the API requires a concrete instance of this base class to
@@ -254,7 +416,9 @@ is the primary method for one-to-one matches.
Note we require even unique matchers to implement :py:meth:`find ` since one is
simply a special case of "many".
-The choice of those two names (:py:meth:`find ` and :py:meth:`find_one `) was not
+The choice of those two names
+(:py:meth:`find `
+and :py:meth:`find_one `) was not
arbitrary. They are the names used to implement the same concepts in MongoDB
as methods of their database handle object. In fact, as a convenience the
normalize module defines the intermediate class
@@ -300,7 +464,7 @@ collection is large and the matching algorithm can use an effective
MongoDB index, or (2) the dataset is small enough that the cost of the queries
is not overwhelming.
-When the normalizing collection is small we have found a much efficient way
+When the normalizing collection is small we have found a much faster way
to implement normalization is via a cacheing algorithm. That is, we
load all or part of a collection/table into a data area
(a python class :code:`self` attribute) "matcher" object
@@ -331,9 +495,10 @@ implemented as two intermediate classes used similarly to
to store it's internal cache. The Pandas library is robust and
has a complete set of logical constructs that can be used to construct
any query possible with something like SQL and more. Any custom,
- concrete implementations of :py:class:`BasicMatcher `
+ concrete implementations of
+ :py:class:`BasicMatcher `
that match the small normalizing collection assumption would be
- best advised to utilize this API.
+ best advised to utilize the pandas API.
These two intermediate-level classes have two features in common:
@@ -351,7 +516,7 @@ These two intermediate-level classes have two features in common:
These two classes differ mainly in what they require to make them
concrete. That is, both have abstract/virtual methods that are required
-to make a concrete implemntation.
+to make a concrete implementation.
:py:class:`DictionaryCacheMatcher `
requires implementation of
:py:meth:`cache_id `
@@ -362,7 +527,7 @@ different keys to access attributes stored in the database and
the equivalent keys used to access the same data in a workflow.
In addition, there is a type mismatch between a document/tuple/row
abstraction in a MongoDB document and the internal use by the matcher
-class family. That is, pymongo treats represents a "document" as a
+class family. That is, pymongo represents a "document" as a
python dictionary while the matchers require posting the same data to
the MsPASS Metadata container to work more efficiently with the C++
code base that defines data objects.
@@ -394,7 +559,7 @@ is a hyperlink to the docstring for the class:
* - :py:class:`OriginTimeMatcher `
- match data with start time defined by event origin time
-Noting currently all of these have database query versions that differ only
+Noting that currently all of these have database query versions that differ only
by have "DB" embedded in the class name
(e.g. the MongoDB version of :code:`EqualityMatcher` is :code:`EqualityDBMatcher`.)
@@ -420,8 +585,8 @@ idea is most clearly seen by a simple example.
attribute_list = ['_id','lat','lon','elev']
matcher = ObjectIdMatcher(db,collection="site",attributes_to_load=attribute_list)
# This says load the entire dataset presumed staged to MongoDB
- cursor = db.wf_miniseed.find({}) #handle to entire data set
- dataset = read_distributed_data(cursor) # dataset returned is a bag
+ cursor = db.wf_TimeSeries.find({}) #handle to entire data set
+ dataset = read_distributed_data(cursor,collection='wf_TimeSeries') # dataset returned is a bag
dataset = dataset.map(normalize,matcher)
# additional workflow elements and usually ending with a save would be here
dataset.compute()
@@ -430,10 +595,10 @@ This example loads receiver coordinate information from data that was assumed
previously loaded into MongoDB in the "site" collection. It assumes
matching can be done using the site collection ObjectId loaded with the
waveform data at read time with the key "site_id". i.e. this is an
-inline version of what could also be accomplished (more slowly) by
-calling :code:`read_distribute_data` with "site" in the normalize list.
+inline version of what could also be accomplished by
+calling :code:`read_distribute_data` with a matcher for site in the normalize list.
-Key things this example demonstrates in common to all in-line
+Key things this example demonstrates common to all in-line
normalization workflows are:
+ :code:`normalize` appears only as arg0 of a map operation (dask syntax -
@@ -498,7 +663,7 @@ Example 3: source normalization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example shows an example of how to insert source data into
-a parallel workflow. As above use the dask syntax for a map operator.
+a parallel workflow. As above we use the dask syntax for a map operator.
This example uses the matcher called :code:`OriginTimeMatcher`
which works only for waveform segments where the start time of the
signal is a constant offset from the event origin time.
@@ -554,33 +719,33 @@ id for the source collection :code:`source_id`.
.. code-block:: python
from mspasspy.client import Client
- from mspasspy.db.normalize import ObjectIdMatcher
- from mspasspy.db.database import read_ensemble_data
+ from mspasspy.db.normalize import ObjectIdMatcher,MiniseedMatcher
+ from mspasspy.io.distributed import read_distributed_data
- def read_common_source_gather(db,collection,srcid):
+ def srcidlist2querylist(srcidlist):
"""
- Function needed in map call to translate a single source id (srcid)
- to a query, run the query, and load the data linked to that source_id
+ Small function used to build query list from a list of source ids.
+ Uses a new feature of read_distribute_data from version 2 forward
+ that allows creation of a bag/rdd from a list of python dict containers
+ assumed to be valid MongoDB queries.
"""
- dbcol = db[collection]
- query = {"source_id" : srcid }
- # note with logic of this use we don't need to test for
- # no matches because distinct returns only not null source_id values dbcol
- cursor = dbcol.find(query)
- ensemble = db.read_ensemble(db, collection=collection)
- ensemble["source_id"] = srcid
- return ensemble
+ querylist=list()
+ for srcid in srcidlist:
+ query={'source_id' : srcid}
+ querylist.append(query)
+ return querylist
dbclient = Client()
db = dbclient.get_database("mydatabase")
+ channel_matcher=MiniseedMatcher(db)
# Here limit attributes to be loaded to source coordinates
attribute_list = ['_id,''lat','lon','depth','time']
source_matcher = ObjectIdMatcher(db,collection="source",
attributes_to_load=attribute_list,load_if_defined=["magnitude"])
# MongoDB incantation to find all unique source_id values
sourceid_list = db.wf_Seismogram.distinct("source_id")
- dataset = dask.bag.from_sequence(sourceid_list)
- dataset = dataset.map(lambda srcid : read_common_source_gather(db, "wf_Seismogram", srcid))
+ querylist=srcidlist2querylist(sourceid_list)
+ dataset = read_distributed_data(querylist,normalize=[channel_matcher])
# dataset here is a bag of SeismogramEnsembles. The next line applies
# normalize to the ensemble and loading the attributes into the ensemble's
# Metadata container.
@@ -622,7 +787,7 @@ We know of three solutions to that problem:
:py:class:`DictionaryCacheMatcher `,
and :py:class:`DataFrameCacheMatcher `).
One could also build directly on the base class, but we can think of no
- example where would be preferable to extending one of the intermediate
+ example where that would be preferable to extending one of the intermediate
classes. The remainder of this section focuses only on some hints for
extending one of the intermediate classes.
@@ -701,7 +866,7 @@ intermediate classes you should use to build your custom matcher are:
- The :py:class:`DatabaseMatcher `
requires implementing only one method called
:py:meth:`query_generator `.
- Tha method needs to create a python dictionary in pymongo syntax that is to
+ That method needs to create a python dictionary in pymongo syntax that is to
be applied to the normalizing collection. That query would normally be
constructed from one or more Metadata attributes in a data object but
time queries may also want to use the data start time and endtime available
@@ -717,7 +882,7 @@ intermediate classes you should use to build your custom matcher are:
The other method,
:py:meth:`db_make_cache_id `,
needs to do the same thing and create identical keys.
- The difference being that
+ The difference between the two is that
:py:meth:`db_make_cache_id `
is used as the data loader to create the dictionary-based cache while
:py:meth:`cache_id `
diff --git a/docs/source/user_manual/numpy_scipy_interface.rst b/docs/source/user_manual/numpy_scipy_interface.rst
new file mode 100644
index 000000000..1b552d3dd
--- /dev/null
+++ b/docs/source/user_manual/numpy_scipy_interface.rst
@@ -0,0 +1,323 @@
+.. _numpy_scipy_interface:
+
+Using numpy/scipy with MsPASS
+============================================
+Overview
+-------------
+The `numpy` and `scipy` packages are heavily utilized for scientific computing
+in a wide range of fields. One of the reasons for that is performance.
+For a number of reasons that are beside the point, python arrays
+have horrible performance. That is particularly true of linear algebra
+operations common in all signal processing algorithms.
+All numpy algorithms we've ever tested are orders of magnitude faster than the same
+algorithm implemented as a pure python code with python arrays. Since most
+scipy algorithms are built on top of numpy, the same holds for scipy.
+Similarly, obspy uses numpy arrays exclusively. As a result, an early
+design constraint in MsPASS was to provide as clean an interface to
+numpy/scipy as possible. However, because we chose to implement our own
+core data objects in C++, that presents a disconnect that causes a few
+anomalies in the MsPASS API. This section of the User's Manual
+documents these issues and provides guidance for how to effectively mix
+numpy/scipy algorithms with `TimeSeries`, `Seismogram`, and ensemble objects.
+It does not address a prototype data type we call a `Gather` that can be used
+for uniform ensembles. The `Gather` object is a pure python class that uses
+multidimensional numpy arrays to store sample data and is not subject to
+the issues addressed in this section.
+
+TimeSeries Data
+-----------------
+Recall the mspass `TimeSeries` class is used to abstract the concept of
+a single channel of fixed sample rate seismic data spanning a given time period.
+As such we used the standard signal processing abstraction of storing
+the data as a vector of numbers.
+
+Every computing language that has any role in numerical computing defines
+a syntax for arrays. A vector in this context is a one-dimensional array.
+A problem with python's builtin array is that it isn't actually at all the
+same thing as an array in FORTRAN of C. Both FORTRAN and C define
+an array as a fixed block of memory of a specified length containing
+a sequence of numbers of the same type. (e.g. a C double array of length
+10 is a block of :math:`8*10` bytes.) Numpy performance is achieved by
+redefining an array to assume the data are in a contiguous memory block and
+using wrappers that allow low-level operations to actually be performed
+by compiled C or FORTRAN functions. We do the same thing in MsPASS
+but our data objects do not contain numpy array classes per se, but
+we use the same approach to allow low-level operations to be
+done in C++. Because we use the same concept (arrays are contiguous
+blocks of memory) the arrays can mostly be used interchangeably, but
+not always.
+
+So what is the difference? If you execute this code fragment in MsPASS:
+
+.. code-block:: python
+
+ import numpy as np
+ from mspasspy.ccore.seismic import TimeSeries
+ x_numpy = np.zeros(100)
+ x_mspass = TimeSeries(100)
+ print("Type of numpy object=",type(x_numpy))
+ print("Type of mspass data vector=",type(x_mspass.data))
+
+You should get this output:
+
+.. code-block:: python
+
+ Type of numpy object=
+ Type of mspass data vector=
+
+The point is that both define two conceptually similar things that are a
+vector of sample data. In this example
+these are stored with the symbols `x_numpy` a `x_mspass.data`.
+The print statements emphasize they are
+different data types. They are not, however, an apples and oranges
+kind of comparison but more of a Jonathan versus Gala apple comparison.
+Python is very cavalier about type. Many sources call this "duck typing"
+from the cliche "if it walks like a duck and quacks like duck it must be duck".
+Because of "duck typing" we can, in most cases, interchange the use of the
+two classes `ndarray` and `DoubleVector`.
+For example, you can nearly always send `DoubleVector` data to a
+numpy or scipy function and it will work just find. Here are a few
+examples:
+
+.. code-block:: python
+
+ import numpy as np
+ import scipy
+ # Setting data to nonzero values - not essential but otherwise
+ # trivial zero vector will cause problems for the log example
+ for i in range(x_mspass.npts):
+ x_mspass.data[i]=float(i+1)
+ x = np.log10(x_mspass.data)
+ y = np.sqrt(x_mspass.data)
+ z=np.cos(x_mspass.data)
+ yh = scipy.signal.hilbert(x_mspass.data)
+
+That works because the `DoubleVector` is what numpy's documentation
+calls "array like" (a duck type concept). In fact, any numpy or scipy
+function that accepts an "array like" argument can accept the `data`
+member of a `TimeSeries` and work. In reality, `DoubleVector` is
+a C++ "vector container" that in C++ would be declared as:
+
+.. code-block:: C
+
+ std::vector x;
+
+In MsPASS we use stock "bindings" from pybind11 that allow us to do
+this magic. The real magic is that in most cases no copying of the
+array data is required. (For C programmers the bindings use
+a pointer (address) not a copy.)
+
+This magic works for most, but not necessarily all, vector operators that
+mix `DoubleVector` and `ndarray` types. e.g. with the symbols above defined
+all of the following will work and do what is expected:
+
+.. code-block:: python
+
+ z=x_numpy-x_mspass.data
+ z=x_mspass.data+x_numpy
+ x_mspass.data -= z
+ z=x_numpy*x_mspass.data # a peculiar numpy operation - not a dot product
+ x_mspass.data *= 10.0
+
+The main thing to avoid is an assignment where the left hand side
+and right hand side resolve to different types. For instance, either of
+the following will fail with a `TypeError` exception if you try to run them:
+
+.. code-block:: python
+
+ z=np.ones(100)
+ x_mspass.data = x_numpy
+ x_mspass.data = x_numpy + z
+
+The solution is to use a python version of a type cast to a `DoubleVector`:
+
+.. code-block:: python
+
+ x_mspass.data = DoubleVector(x_numpy)
+ x_mspass.data = DoubleVector(x_numpy + z)
+
+That is necessary because in both cases the right hand side is an
+`ndarray`. The construct used is a call to the bindings of a copy constructor
+for the C++ std::vector container so it seems to come at the cost of a memory copy.
+On the other hand, that is orders of magnitude faster than a python loop to do the
+same operations.
+
+Be aware that if you mix types in a vector operation you can get some surprising
+results. For example, the following code will generate a `TypeError`
+exception:
+
+.. code-block:: python
+
+ ts=TimeSeries(100)
+ z=np.ones(100)
+ x_mspass.data = z + ts.data
+
+while simply reversing the order of `z` and `ts.data` like this
+
+.. code-block:: python
+
+ ts=TimeSeries(100)
+ z=np.ones(100)
+ x_mspass.data = ts.data + z
+
+Works and does set `x_mspass.data` to the vector sum of the two symbols
+on the right hand side. The reason is that the + operator in python
+seems to resolve the type to the type of the leftside of the + operator.
+Why is "deep in the weeds", but the best advice is to avoid mixed type
+operations if possible. The more robust way to write the above is this:
+
+.. code-block:: python
+
+ ts=TimeSeries(100)
+ z = np.ones(100)
+ x = z + np.array(ts.data)
+ x_mspass.data = DoubleVector(x)
+
+Finally, users who are matlab fans may like to write python code using
+the vector ":" operator. For example, with numpy operations like the
+following can be used in a typical numpy algorithm:
+
+.. code-block:: python
+
+ z=np.ones(100)
+ x=z[5:10]
+ print(len(x))
+
+That will print "5" - the length of the subvector extracted in x.
+The : operator DOES NOT work with a `DoubleVector`. If you have an
+algorithm using the colon operator, do the vector operations with
+numpy and then copy the result into the `TimeSeries` data member.
+An important warning about any such assignments to `TimeSeries.data` is you
+MUST be sure the size of the vector on the right hand side is the
+same size as the left hand side. Consider this segment:
+
+.. code-block:: python
+
+ ts.data=TimeSeries(100) # creates data vector 100 samples long
+ z=np.ones(100)
+ x=z[5:10]
+ ts.data = DoubleVector(x)
+ print(len(ts.data))
+
+The output of this is "5" showing that, as expected, ts.data is replaced
+by the subvector x. The BIG PROBLEM that can cause is it will create an
+inconsistency in the size set by the `npts` attribute of `TimeSeries`. The
+correct solution for the above is this minor variant calling the `set_npts` method
+of `TimeSeries`
+
+.. code-block:: python
+
+ ts.data=TimeSeries(100) # creates data vector 100 samples long
+ z=np.ones(100)
+ x=z[5:10]
+ ts.set_npts(5)
+ ts.data = DoubleVector(x)
+ print(len(ts.data))
+
+Seismogram Data
+~~~~~~~~~~~~~~~~~
+MsPASS uses a different data type to hold data from three-component sensors.
+The motivation is described in the section titled
+:ref:`Data Object Design Concepts`.
+The result is that the array holding the sample data is a matrix
+instead of a vector. Some implementation details of note are:
+
+1. In MsPASS the matrix has 3 rows and `npts` columns. That means the
+ rows of the matrix an be extracted to create a `TimeSeries`
+ subset of the data while the columns are single "vector" samples
+ of ground motion.
+2. Currently MsPASS uses a simple, lightweight matrix class to store the sample
+ data called :class:`mspasspy.ccore.utility.dmatrix`. A key reason we
+ made that choice was control over methods defined for the class. A
+ case in point is the `shape` attribute that we will see momentarily
+ is important for working with numpy.
+3. For performance `dmatrix` is implemented in C++. The class has methods
+ for all standard matrix operations with python bindings for those operator.
+ The corollary to that statement is that when working with `Seismogram` data
+ use the native operators when possible.
+4. A `dmatrix` stores the array data in a contiguous block of memory
+ in what numpy would call "FORTRAN order". ("C order" is reversed.)
+ FORTRAN stores a matrix as a contiguous block of memory with the
+ row index "varying fastest". Exchanging the sample data stored
+ this way with numpy is like that with the std::vector used in
+ `TimeSeries` but the exchange uses a pointer to the first sample of the
+ contiguous block of memory held by an instance of a `dmatrix`
+ class referenced with the symbol `Seismogram.data`.
+5. We use a variation of the same "magic" in the pybind11 binding code
+ that allows the `Seismogram.data` matrix to interact cleanly with
+ numpy and scipy functions that require a matrix. Like the scalar
+ case there are impedance mismatches, however, that can complicate
+ that exchange. We discuss these below.
+
+Almost all of what was discussed above about using `TimeSeries`
+arrays with numpy are similar. For example, although the result is
+meaningless, you can run the following code snippet with
+MsPASS and it will run and produce the expected result:
+
+.. code-block:: python
+
+ import numpy as np
+ from mspasspy.ccore.seismic import Seismogram
+ x_mspass =Seismogram(100)
+ # initialize the matrix to some meaningless, nonzero values
+ for i in range(3):
+ for j in range(x_mspass.npts):
+ x_mspass.data[i,j]=i+j+1
+ # these apply the math operation for each number in the data matrix
+ z=np.cos(x_mspass.data)
+ z=np.exp(x_mspass.data)
+ z=np.sqrt(x_mspass.data)
+ # matrix operation that multiplies all samples by 2.0
+ x_mspass.data *= 2.0
+ # matrix operator adding content of numpy matrix z to data matrix
+ y=x_mspass.data + z
+ # matrix operation adding y and x_mspass.data matrices
+ x_mspass.data += y
+ # Same thing done with a binary + operator
+ # use a copy to show more likely use
+ z=Seismogram(x_mspass.data)
+ x_mspass_data = z + y
+
+As with `TimeSeries` at mismatch will occur if the operation
+yields a type mismatch. For example, the following minor variant of
+above will run:
+
+.. code-block:: python
+
+ z=Seismogram(x_mspass)
+ y=np.ones([3,100])
+ x_mspass.data = z.data + y
+
+In contrast, the following that might seem completely equivalent
+will raise a `TypeError` exception:
+
+.. code-block:: python
+
+ z=Seismogram(x_mspass)
+ y=np.ones([3,100])
+ x_mspass.data = y + z.data
+
+The reason is that for the right hand side resolves to a
+numpy "ndarray" and the left hand side is a `dmatrix`.
+Exactly like above, this simpler assignment will fail exactly the same way:
+
+.. code-block:: python
+
+ x_mspass.data = y
+
+The solution is similar to that we used above for `TimeSeries`:
+
+.. code-block:: python
+
+ x_mspass.data = dmatrix(y)
+
+That is, `DoubleVector` is changed to `dmatrix`.
+
+Finally, the warnings above about the ":" operator with `TimeSeries` apply equally
+to `Seismogram`. If you write a python code with the color operator do
+all those operations with numpy arrays and use the `dmatrix` copy constructor
+to load the result into the object's `data` array. The warning about
+making sure the `npts` member attribute is consistent with the result
+applies as well. The only thing to remember with a `Seismogram` is that
+`npts` is the number of columns of the data matrix, not the total number of
+samples.
diff --git a/docs/source/user_manual/parallel_io.rst b/docs/source/user_manual/parallel_io.rst
new file mode 100644
index 000000000..68774541c
--- /dev/null
+++ b/docs/source/user_manual/parallel_io.rst
@@ -0,0 +1,913 @@
+.. _parallel_io:
+
+Parallel IO in MsPASS
+============================================
+*Gary L. Pavlis*
+--------------------
+
+What is parallel IO and why is it needed?
+--------------------------------------------
+
+There is an old adage in higher performance computing that a supercomputer
+is a machine that turns CPU-bound jobs into IO-bound jobs. Any user of
+MsPASS who runs the framework on a sufficiently large number of nodes will
+learn the truth of that adage if they throw enough cores at a data
+processing problem they need to solve. Few seismologists have the
+background in computing to understand the fundamental reasons why that claim is
+true, so I will first briefly review the forms of IO in MsPASS and how they
+can limit performance.
+
+Axiom one to understand about IO is that ALL IO operations have a potential
+bottleneck defined by the size of the pipe (IO channel) the data are
+being pushed/pulled through. If you are moving the data through a
+network channel the highest throughput possible is the number of bytes/s
+that channel can handle. If you are reading/writing data to a local disk
+the most data you can push through is speed the disk can sustain for reads/writes.
+I next discuss factors in MsPASS that define how MsPASS may or may not saturate
+different IO channels.
+
+Like everything in computing IO systems have evolved significantly with time.
+At present there are three fundamentally different IO systems MsPASS
+handles.
+
+1. *Regular file IO*. For decades normal reads and writes to a file system
+ are defined by a couple of key abstractions. The first is *buffering*.
+ When you "open" a file in any computing language the language creates a data
+ object we can generically call a "file handle" (a "FILE" in C and often
+ called a "file object" in python documentation). Under the hood that file
+ handle always includes a "buffer" that is a block of contiguous memory
+ used by any reads or writes with that file handle. A write happens very fast if
+ the buffer is not full because the pipe speed is defined by memory bandwidth.
+ Similarly, reads are very fast if the data are already loaded into memory.
+ That process has been tuned for decades in operating system code to
+ make such operations as fast as possible. Problems happen when you try to
+ push more data than the buffer size or try to read a block of data larger
+ than the buffer. Then the speed depends upon how fast the disk controller
+ (network file servers for distributed files)
+ can move the data from a disk to the buffer. Similar constraints occur when
+ reading more data that can fit in the buffer. The second concept
+ that is important to undertand for
+ regular file IO is *blocking*. All current file IO operations in MsPASS use
+ blocking IO, which is the default for standard IO operations in C++ and python.
+ There are many online articles on this concept, but the simple idea is that
+ when you issue a read or write function call in C++ or python
+ (the two primary languages used directly in MsPASS) the program stops
+ until the operating system defines the operation as completed. That delay
+ will be small if the data written fit in the buffer or being read are already
+ loaded in the buffer. It not, the program sits in a wait state until
+ the operating system says it is ready.
+2. *Network connections*. Internet IO is a much more complex operation than
+ file-based IO. Complexity also means the variance of performance is wildly
+ variable. There are a least three fundamental reasons for that in the current
+ environment. First, the two ends of a connection are usually completely
+ independent. For example, MsPASS has support for web-service queries
+ from FDSN data centers. Those sources have wildly different performance, but
+ they all have one thing in common: they have glacial speeds relative to
+ any HPC cluster IO channel. Second, long-haul connections are always
+ slower than a local connection. Packets from remote sources often have
+ to move through dozens of routers while local internet connections in
+ a cluster/cloud system can by many orders of magnitude faster. Finally,
+ in every example I know of internet data movement involves far more
+ computer software lines to handle the complexity of the operations.
+ As a result internet IO operations always consume more CPU cycles than
+ simple file-based IO operations. In MsPASS internet operation are
+ currently limited to web-service interaction with FDSN data centers, but
+ it is expected that cloud computing support that we are planning to
+ develop will involve some form of internet-style IO within the cloud
+ service.
+3. *MongoDB transactions*. In MsPASS we use MongoDB as a database to
+ manage data. MongoDB, like all modern dbms system, uses a
+ client-server model for all transactions with the database.
+ Without "getting into the weeds", to use a cliche, the key idea
+ is that interactions with MongoDB are all done through a manager
+ (the MongoDB server). That is, an application can't get any
+ data stored in the database without politely asking the manager (server).
+ (Experienced readers will recognize that almost all internet services
+ also use some form of this abstraction.) The manager model
+ will always introduce a delay because any operation requires
+ multiple IO interactions: request operation, server acknowledge,
+ ready, receive, acknowledge success. Some of those are not required in some
+ operations but the point is a conversation is required between the
+ application and the server that always introduces some delay. That
+ delay is nearly always a lifetime in CPU cycles. In our experience a
+ typical single, simple MongoDB transaction like `insert_one` takes at
+ least of the order of a millisecond. That may seem fast to you as a human,
+ but keep in mind that is about a million clock cycles on a modern CPU.
+ On the other hand, "bulk" operations (e.g. `insert_many`) with this model
+ often take about the same length of time as a single document
+ transaction like `insert_one`. In addition, MongoDB "cursor"
+ objects help the server anticipate the next request and also dramatically
+ improves database throughput. The point is, that appropriate use
+ of bulk operators can significantly enhance database throughput. We
+ discuss implementation details of how bulk read/writes
+ have been exploited in the parallel readers and
+ writers of MsPASS.
+
+.. note::
+
+ At the time this document was written IRIS/Earthscope was in the process of
+ developing a fundamental revision of their system to run in a cloud system.
+ IO in cloud systems has it's own complexity that we defer for now.
+ When that system mature and MsPASS evolves to use it look for updates
+ to this section. Parallel IO and cloud systems have an intimate
+ relation, and we anticipate future use of parallel file systems in the
+ cloud environment could dramatically improve the performance of some workflows.
+
+With that introduction what then is parallel IO? Like most modern
+computing ideas that one can mean different things in different contexts.
+In the context of MsPASS at present it means exploiting the parallel
+framework to *reduce* IO bottlenecks. I emphasize *reduce* because
+almost any workflow can become IO bound if you throw enough processors at
+it. Tuning a workflow often means finding the right balance of memory and
+number of CPUs needed to get the most throughput possible. In any case,
+the approach used at present is to utilize multiple workers operating
+independently. Since almost all seismic processing jobs boil down to
+read, process, and write the results, the issue is how to balance the read
+and writes at the start and end of the chain with CPU tasks in the middle.
+
+In the current implementation of MsPASS parallel IO is centered on
+two functions defined in the module `mspasspy.io.distributed` called
+`read_distributed_data` and `write_distributed_data`. At present the
+best strategy in MsPASS to reduce IO bottlenecks is to use these
+two functions as the first and last steps of a parallel workflow.
+I now describe details of how these two functions operate and limitations of what
+they can do. I finish this section with examples of how the parallel
+readers and writes can be used in a parallel workflow. Note that to
+simplify the API we designed this interface to handle both atomic
+and ensemble objects. The way the two are handled in reads and writes,
+however, are drastically different. Hence, there is a subsection for
+the reader and writer descriptions below for atomic (TimeSeries
+and Seismogram data) and ensemble (TimeSeriesEnsemble and SeismogramEnsemble data).
+
+read_distributed_data - MsPASS parallel reader
+------------------------------------------------
+Overview
+~~~~~~~~~~~
+The `docstring `__
+for this function is an important sidebar to this section. As usual it gives
+details about usage, but also discusses a bit of an oddity of this function
+I reiterate here. The design goal to use a common function name to create a
+parallel container for all seismic data objects and at the same time
+provide the mechanism to parallelize the reader created some anomalies in
+the argument structure. In particular, three arguments to this function
+interact to define the input needed to generate a parallel container
+when the lazy(spark)/delayed(dask) operations are initiated
+(the operation usually initiated by a dask `compute` or pyspark `collect`).
+The docstring for the `read_distibuted_data` function gives more details,
+but a key point is that
+arg0 (`data` in the function definition) can be one of three things:
+(1) a :class:`mspasspy.db.database.Database` object, (2) an implementation of
+a `Dataframe` (dask, pyspark, or pandas), or (3) a list of python
+dictionaries that define valid MongoDB queries. The first two options
+can be used to create a parallel dataset of atomic seismic objects.
+Item 3 is the only direct mechanism in MsPASS to create a parallel dataset of
+ensembles. I describe how that works in the two sections below on atomic and
+ensemble data.
+
+Atomic data
+~~~~~~~~~~~~~
+A bag/RDD of atomic data can be constructed by `read_distributed_data`
+from one of two possible inputs: (1) MongoDB documents retrieved through
+a :class:`mspasspy.db.database.Database` handle driven by an optional
+query, or (2) an implementation of a Dataframe. It is
+important to realize that both cases set the initial
+content the bag/RDD as the same thing: a sequence of python dictionaries
+that are assumed to be MongoDB documents with sufficient content
+to allow construction of one atomic seismic object from each document.
+
+Forming the initial bag/RDD of documents has different input delay issues for
+Dataframe versus Database input. It important to recognize the strengths and
+weaknesses of the alternative inputs.
+
+1. *Dataframe*. Both dask and pyspark have parallel
+ implementations for Dataframe. For either scheduler creating the initial bag/RDD
+ amounts to converter methods defined for that scheduler. Specifically,
+ for dask we use the `to_bag` method of their Dataframe and for pyspark
+ we run the `to_dict` method to convert the Dataframe to a pyspark RDD.
+ Whether or not this input type is overall faster than reading form
+ a Database depends upon how you create the Dataframe. We implemented
+ Dataframe as an input option mainly to support import of data indexed
+ via a relational database system. In particular, dask and spark both have
+ well-polished interfaces for interaction with any SQL server. In addition,
+ although not fully tested at this writing,
+ an Antelope "wfdisc" table can be imported into a dask or spark Dataframe
+ through standard text file readers. I would warn any reader that the
+ the devil is in the details in actually using a
+ relational database via this mechanism, but
+ prototypes demonstrate that approach is feasible for the framework.
+ You should just realize there is not yet any standard solution.
+
+2. *Database*. Creating a bag/RDD of atomic objects
+ from a Database is done with a completely different algorithm but
+ the algorithm uses a similar intermediate container to build the bag/RDD.
+ An important detail we implemented for
+ performance is that the process uses a MongoDB cursor to create an
+ intermediate (potentially large) list of python dictionaries. With
+ dask that list is converted to a bag with the `from_sequence` method.
+ With spark the RDD is created directly from the standard
+ `parallelize` method of `SparkContext`. A key point is that a using a
+ cursor to sequentially load the entire data set has a huge impact on
+ speed. The same list of data loaded using a MongoDB cursor versus the same
+ documents loaded randomly by single document queries differ by many orders of
+ magnitude. The reason is that MongoDB stages (buffers) documents that
+ define the cursor sequence. A sequential read with a cursor is largely
+ limited by the throughput of the network connection between a worker
+ and the MongoDB server. On the other hand, that approach is memory
+ intensive as the `read_distributed_data` by default will attempt to
+ read the entire set of documents into the memory of the scheduler node.
+ Most wf documents when loaded are of the order of 100's of bytes. Hence,
+ a million wf document list will require of the order of 0.1 Gbytes, which
+ on modern computers is relatively small. Anticipating the possibility of
+ even larger data sets in the future, however, `read_distributed_data` has a
+ `scratchfile` option that will first load
+ the documents into a scratch file and use an appropriate dask or
+ spark file-reader to reduce the memory footprint of creating the
+ bag/RDD.
+
+Both input modes create an intermediate bag/RDD equivalent to a large
+list of documents. The function internally contains a map operator that
+calls the constructor for either `TimeSeries` or `Seismogram` objects
+from the attributes stored in each document. The output of the
+function with atomic data is then an bag/RDD of the atomic data
+defined by the specified collection argument. Note that any
+constructor failures in the reader with have the boolean
+attribute with the key `is_abortion` set True. The name is
+appropriate since objects that fail on constructor a "unborn".
+
+Ensemble data
+~~~~~~~~~~~~~~~
+Building a bag/RDD of ensembles is a very different problem than building
+a bag/RDD of atomic objects. The reason is that ensembles are more-or-less
+grouped bundles of atomic data. In earlier versions of MsPASS we
+experimented with assembling ensembles with a reduce operator.
+That can be done and it works, BUT is subject to a very serious memory
+hogging problem as described in the section on
+:ref:`memory management`. For that reason, we
+implemented some complexity in `read_distributed_data` to reduce
+the memory footprint of a parallel job using ensembles.
+
+We accomplished that in `read_distributed_data` by using a completely
+different model to tell the function that the input is expected to
+define an ensemble. Specifically, the third option for the type of
+arg0 (`data` symbol in the function signature) is a list of python
+dictionaries. Each dictionary is ASSUMED to be a valid MongoDB query
+that defines the collection of documents that can be used to
+construct the group defining a particular ensemble. Between the oddity of
+MongoDB's query language and the abstraction of what an ensemble means
+it is probably best to provide an example to clarify what I mean.
+The following can be used to create a bag of `TimeSeriesEnsemble`
+objects that are a MsPASS version of a "common source/shot gather":
+
+.. code-block:: python
+
+ srcid_list = db.wf_TimeSeries.distinct('source_id')
+ querylist=[]
+ for srcid in srcid_list:
+ querylist.append({'source_id' : srcid})
+ data = read_distributed_data(querylist,collection='wf_TimeSeries')
+ ... processing code goes here ...
+
+Notice that the for loop creates a list of python dictionaries
+that when used with the MongoDB collection find method will
+yield a cursor. Internally the function iterates over that cursor
+to load the atomic data to create ensemble container holding
+that collection of data. A weird property of that concept
+in this context, however, is that when and where that happens
+is controlled by the scheduler. That is, I reiterate that
+`read_enemble_data` only creates the template defining the task the
+workflow has to complete to "read" the data and emit a container of,
+in this case, `TimeSeriesEnsemble` objects. That is why this
+is a parallel reader because for this workflow the scheduler would
+assign each worker a read operation for one ensemble as a task.
+Hence, constructing the ensembles, like the atomic case above, is
+always done with one each worker initiating a processing chain by
+constructing, in the case above, a `TimeSeriesEnsemble` that is passed
+down the processing chain.
+
+There is one other important bit of magic in the `read_distributed_data`
+that is important to recognize if you need to maximize input speed.
+`read_distributed_data` can exploit a feature of
+:class:`mspasspy.db.database.Database` that can dramatically reduce
+reading time for large ensembles. When reading ensembles if the
+`storage_mode` argument is set to "file", the data were originally
+written with the (default) format of "binary", and the file grouping
+matches the ensemble (e.g. for the source grouping example above
+the sample data are stored in files grouped by source_id.)
+there is an optimized algorithm to load the data. In detail, the
+algorithm sorts the inputs by the "foff" attribute and reads the sample
+data sequentially with C++ function using the low-level binary
+C function `fread`. That algorithm can be very fast as buffering
+creates minimal delays in successive reads and, more importantly,
+reduces the number of file open/close pairs compared to
+a simpler iterative loop with atomic readers.
+See the docstring of :class:`mspasspy.db.database.Database` for details.
+
+write_distributed_data - MsPASS parallel writer
+---------------------------------------------------
+Overview
+~~~~~~~~~~
+Parallel writes present a different problem from reading.
+The simplest, but not fastest, approach to writing data is to use the
+`save_data` method of :class:`mspasspy.db.database.Database` in a loop.
+Here is a (not recommended) way to terminate a workflow in that way
+for a container of `TimeSeries` objects:
+
+.. code-block:: python
+
+ ... Processing workflow above with map/reduce operations ...
+ data = data.map(db.save_data,storage_mode='file')
+
+Although the save operation will operate in parallel it is has two
+hidden inefficiencies that can increase run time.
+
+1. Every single datum will invoke at least one transaction with the MongoDB
+ server to save the wf document created from the Metadata of each
+ `TimeSeries` in the container. Worse, if we had used the default
+ storage_mode of "gridfs" the sample data for each datum would have to
+ be pushed through the same MongoDB server used for storing the wf documents.
+2. Each save is this algorithm requires an open, seek, write, close operation on a particular
+ file defined for that datum.
+
+The `write_distributed_data` function was designed to reduce these known
+inefficiencies. For the first, the approach used for both atomic and
+enemble data is to do bulk database insertions. At present the only
+mechanism for reducing the impact of item 2 is to utilize ensembles
+and store the waveform data in naturally grouped files. (see examples below)
+
+.. note::
+
+ A possible strategy to improve IO performance with atomic operations
+ is to use on of several implementation of parallel files.
+ With that model the atomic-level open/close inefficiency could
+ potentially be removed. The MsPASS team has experimented with this
+ approach but because there is currently no standardized support
+ for that feature. Future releases may add that capability.
+
+An additional issue with saving data stored in a bag/RDD is a memory
+issue. That is, most online examples of using dask or spark terminate
+a workflow with a call to the scheduler's method used to convert a
+lazy/delayed/futures entity (bag or RDD) into a result.
+(The dask function is `compute` and the comparable pyspark function is `collect`.).
+Prior to V2 of MsPASS the `save_data` function, if used as above, would return
+a copy of the datum is saved. In working with large data sets we learned
+that following such a save with `compute` or `collect` could easily abort
+the workflow with a memory fault. The reason is that the return of `compute`
+and `collect` when called on a bag/RDD is an (in memory) python list of
+the data in the container. To reduce the incidence of this problem
+beginning in V2 `save_data` was changed to return only the ObjectId of the
+saved waveform by default. For the same reason `write_distributed_data`
+does something similar; it returns a list of the ObjectIds of saved
+wf documents. In fact, that is the only thing it ever returns.
+That has a very important corollary that all users must realize;
+`write_distributed_data` can ONLY be used as the termination of a
+distinct processing sequence. It cannot appear as the function to be applied in
+a map operator. In fact, user's must recognize that unless it
+aborts with a usage exception, `write_distributed_data`
+always calls dask bag's `compute` method or pyspark's rdd `collect` method
+immediately before returning. That means that `write_distributed_data`
+always initiates any pending delayed/lazy computations defined
+earlier in the script for the container.
+Here is a typical fragment for atomic data:
+
+.. code-block:: python
+
+ data = read_distributed_data(db,collection='wf_TimeSeries')
+ data = data.map(detrend)
+ # other processing functions in map operators would typically go here
+ wfidslist = write_distributed_data(data,collection='wf_TimeSeries')
+ # wfidslist will be a python list of ObjectIds
+
+Saving an intermediate copy of a dataset within a workflow is
+currently considered a different problem than that solved by
+`write_distributed_data`. Example 4 in the "Examples" section below
+illustrates how to do an intermediate save.
+
+In addition to efficiency concerns, users should also always keep in mind
+that before starting a large processing task they should be sure the
+target of the save has sufficient storage to hold the processed data.
+The target of all saves is controlled at the top level by the
+`storage_mode` argument. There are currently two options.
+When using the default of "gridfs" keep in mind the data sample will be stored
+in the same file system as the database. When `storage_mode="file"` is
+used the storage target depends upon how the user chooses to set the two
+attributes `dir` and `dfile`. They control the file names where the
+sample data will be written. Below I describe how to set these two
+attributes in each datum of a parallel dataset.
+
+Atomic data
+~~~~~~~~~~~~~~
+Atomic data are handled in three stages by `write_distributed_data`.
+These three stages are a pipeline with a bag/RDD entering the top of the
+pipeline and a list of ObjectIds flowing out the bottom.
+
+1. The sample data of all live data (The sample data for any datum marked dead
+ are always dropped.) are saved. That operation occurs in a map operator
+ so each worker performs this operation independently. Note the limitation
+ that with gridfs storage all that data has to be pushed through the
+ MongoDB server. For file storage an open, seek, write, close operation is
+ required for each datum. If multiple workers attempt to write to the same
+ file, file locking can impact throughput. Note that is not, however,
+ at all a recommendation to create one file per datum. As discussed
+ elsewhere that is a very bad idea with large data sets.
+2. Documents to be saved are created from the Metadata of each live datum.
+ The resulting documents are returned in a map operation to overwrite the
+ data in the bag/RDD.
+ At the same stage dead data are either "buried" or "cremated". The
+ former can be a bottleneck with large numbers of data marked dead as it
+ initiates a transaction with MongoDB to save a skeleton of each dead datum in
+ the "cemetery" collection. If the data are "cremated" no record of them
+ will appear in the Database.
+3. The documents that now make up the bag/RDD are saved to the Database.
+ The speed of that operation is enhanced by using a bulk insert by
+ "partition" (bag and RDD both define the idea of a partition. See the
+ appropriate documentation for details.) That reduces the number of
+ transactions with the MongoDB to the order of :math:`N/N_p` where :math:`N` is the
+ number of atomic data and :math:`N_p` is the number of partitions defined for
+ the bag/RDD. Said another way, that algorithm reduces the time to save
+ the wf documents by approximately a factor of :math:`1/N_p`.
+
+What anyone should conclude from the above is that there are a lot of complexities
+in the above that can produce large variances in the performance of a write operation
+with `write_distributed_data`.
+
+Ensembles
+~~~~~~~~~~~~
+Ensembles, in many respects, are simpler to deal with than atomic data.
+The grouping that is implicit in the concept of what defines
+an ensemble may, if properly exploited, add a level of
+homogeneity that can significantly improve write performance relative to the
+same quantity of data stored as a bag/RDD of atomic objects. With
+ensembles the bag/RDD is assumed a container full of a common type of
+ensemble. Like the atomic case the algorithm is a pipeline
+(set of map operators) with ensembles entering the top and ObjectIds
+exiting the pipeline that are returned by the function. An anomaly
+is that with ensembles the return is actually a list of lists of
+ids, with one list per ensemble and each list containing
+the list of ids saved from that ensemble. In addition, the pipeline is
+streamlined to two stages (task) run through map operators:
+
+1. Like the atomic case the first thing done is to save the sample data.
+ A special feature of the ensemble writer is that if the
+ storage_mode argument is set to 'file' and the format is not
+ changed from the default ('binary'), the sample data will be written
+ in contiguous blocks provided 'dir' and 'dfile' are set in the
+ ensemble Metadata container. In that situation the operation is done with
+ a C++ function using fwrite in a mode limited only by the speed of
+ the target file system. As with the comparable feature noted
+ above for the reader a further huge advantage this gives is
+ reducing the number of file open/close pairs to the number of
+ ensembles in the data set.
+2. A second function applied through a map operator does two
+ different tasks that are done separately in the atomic algorithm:
+ (a) translation of each member's Metadata container to a
+ python dictionary that is suitable as a MongoDB document, and
+ (b) actually posting the documents to the
+ defined collection with the MongoDB interface.
+ There are two reasons these are merged in this algorithm.
+ The first is that grouping for a bulk insert is natural
+ with an ensemble. The function calls insert_many on the collection
+ of documents constructed from live members of the ensemble.
+ The second is an efficiency in handling dead data.
+ A problem arises because of the fact that
+ ensemble members can be killed one of two ways: (a) they arrive
+ at the writer dead, or (b) the conversion from Metadata to
+ a document has a flaw that the converter flags as invalid.
+ The first is normal. The second can happen if required Metadata
+ attributes are invalid or, more commonly, if the `mode` argument
+ is set as "cautious" or "pedantic". In both cases the contents of
+ dead data are, by default, "buried". Like the atomic case large
+ numbers of dead data can create a performance hit as each dead datum
+ has a separate funeral (inserting a document in the "cemetery"
+ collection). In contrast, the documents for live data are saved
+ with bulk write using the MongoDB `insert_many` collection method
+ as noted above.
+ With the same reasoning as above this algorithm reduces database transaction
+ time for this writer by a factor of approximately :math:`1/N_m`
+ here :math:`N_m` is the average number of live ensemble members
+ in the data set.
+
+
+
+Handling storage_mode=="file"
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+There are major complication in handling any data where the output is
+to be stored as a set of files. There are two technical issues
+users may need to consider if tuning performance is essential:
+
+1. Read/write speed of a target file system can vary by orders of
+ magnitude. In most cases speed comes at a cost and the storage space
+ available normally varies inversely with IO speed. Our experience
+ is that if multiple storage media are available the fastest
+ file system should be used as the target used by MongoDB for
+ data storage. The target of sample data defined by the schema
+ you use for `dir` and `dfile` may need to be different to assure
+ sufficient free space.
+2. Many seismologists favor the model required by SAC for data storage.
+ SAC requires a unique file name for each individual datum.
+ (SAC only supports what we call a `TimeSeries`.)
+ *For a large data set that is always a horrible model for defining
+ the data storage layout.* There are multiple reasons that
+ make that statement a universal truth today, but the reasons are
+ beside the point for this manual. Do a web search if you want to
+ know more. The point is your file definition model should
+ never use the one file per atomic datum unless your data set is small.
+ On the other hand, the opposite end member of one file for the
+ entire data set is an equally bad idea for the present implementation
+ of MsPASS. If all workers are reading or writing to a single
+ file you are nearly guaranteed to throttle the workflow from
+ contention for a single resource (file locking). Note we have
+ experimented with parallel file containers that may make the
+ one file for the dataset model a good one, but that capability is
+ not ready for prime time. For now the general solution is to
+ define the granularity in whatever structure makes sense for your
+ data. e.g. if you are working with event data, it usually makes sense
+ to organize the files with one file per event.
+
+With those caveats, I now turn to the problem of how you actually
+define the file layout of files saved when you set `storage_mode='file'`
+when running `write_distributed_data`.
+
+The problem faced in producing a storage layout is that different
+research projects typically need a different layout to define some
+rational organization. MsPASS needs to support the range of options
+from a single unique file name for each atomic datum saved to
+all the data stored in one and only one file. As noted above, for
+most projects the layout requires some series of logically defined
+directories with files at the leaves of the directory tree.
+The approach we used utilizes Metadata (MongoDB document) attributes
+with key names
+borrowed from CSS3.0. They are:
+
+- `dir` directory name. This string can define a full or relative path.
+- `dfile` file name at the leaf node of a file system path.
+- `foff` is the file offset in bytes to the first byte of data for a given datum. It is essential when multiple data objects are saved in a the same file. Readers use a "seek" method to initiate read at that position.
+- `npts` the number of samples that define the signal for an atomic datum.
+ Note that for `TimeSeries` data with default raw output that translates to
+ :math:`8 \times npts` bytes and for `Seismogram` objects the
+ size is :math:`3\times 8 \times npts`.
+- When using a format other than the default of "binary", we use the
+ `nbytes` argument to define the total length of binary data to be loaded.
+ That is necessary with formatted data because every format has a different
+ formula to compute the size.
+
+In writing data to files, the first two attributes (`dir` and `dfile`)
+have to be defined for the writer as input.
+The others are computed and stored on writing in the document associated
+with that datum when the save is successful. Rarely, if ever, do
+you want to read from files and have the writer use the same file to write
+the processed result. That is, in fact, what will happen if you
+read from files and then run `write_distributed_data` with
+`storage_mode='file'`. Instead, you need a way to set/reset the values of
+`dir` and `dfile` for each datum.
+Note that "datum" in this context can
+be either each atomic datum or ensemble objects. The default behavior
+for ensembles is to have all ensemble members written to a common file
+name defined by the `dir` and `dfile` string defined in the ensemble's
+Metadata container. In either case, the recommended way to set the
+`dir` and `dfile` arguments is with a custom function passed through a
+map operator. Perhaps the easiest way to see this is to give an
+example that is a variant of that above:
+
+.. code-block:: python
+
+ def set_names(d):
+ """
+ Examples setting dir and dfile from Metadata attributes assumed
+ to have been set earlier. Example sets a constant dir value
+ with file names set by the string representation of source_id.
+ """
+ dir = 'wf/example_project' # sets dir the same for all data
+ # this makes setting dfile always resolve and not throw an exception
+ # elog entry is demontrates good practice in handling such errors.
+ if 'source_id' in d:
+ dfile = "source_{}.dat".format(str(source_id))
+ else:
+ dfile="UNDEFINED_source_id_data"
+ message = "set_names (WARNING): source_id value is undefined for this datum\n"
+ message += "Data written to default dfile name={}".format(dfile)
+ d.elog.log_error(message,ErrorSeverity.Complaint)
+ d['dir'] = dir
+ d['dfile'] = dfile
+ return d
+ data = read_distributed_data(db,collection='wf_TimeSeries')
+ data = data.map(detrend)
+ # other processing functions in map operators would typically go here
+ data = data.map(set_names)
+ wfidslist = write_distributed_data(data,collection='wf_TimeSeries')
+ # wfidslist will be a python list of ObjectIds
+
+A final point for this section is that to make the writer as robust as
+possible there is a default behavior to handle the case where
+`dir` and/or `dfile` are not defined. The default for `dir` is
+the current (run) directory. The handling of `dfile` is more elaborate.
+We use a "uuid generator" to create a unique string to define dfile.
+Although that makes the save robust, be aware this creates the very
+case we stated above should never ever be used: the SAC model with
+one file name per datum.
+
+Examples
+----------
+Example 1: Default read/write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This example illustrates the simplest example for initiating a workflow
+with `read_distributed_data` and terminating it with `write_distributed_data`.
+It also illustrates a couple of useful generic tests to verify
+things went as expected:
+
+.. code-block:: python
+
+ # Assumes imports and db defined above
+ data = read_distributed_data(db,collection='wf_TimeSeries')
+ # processing functions in map operators would go here
+ #
+ # note we don't call computer/collect after write_distributed_data
+ # it initates the lazy computations
+ wfidslist = write_distributed_data(data,
+ collection='wf_TimeSeries',
+ data_tag='example1',
+ )
+ # this will give the maximum number of data possible to compare to nwf
+ print("Size of list returned by write_distributed_data=",len(wfidslist))
+ # This is the number actually saved
+ nwf = db.wf_TimeSeries.count_documents({'data_tag' : 'example1'})
+ print("Number of saved wf documents=",nwf)
+ # this works only if cemetery was empty at the start of processing
+ ndead = db.cemetery.count_documents({})
+ print("Number of data killed in this run=",ndead)
+
+Note the reader always reads the data as directed by attributes of
+the documents in the `wf_TimeSeries` collection.
+The writer defaults to writing data to `gridfs` to the same collection,
+but with a `data_tag` used to separate data being written from the
+input indexed in the same collection.
+
+Example 2: atomic writes to file storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This example is a minor variant of the example in the section discussing
+how `dir` and `dfile` are used with file IO above. There are three differences:
+
+1. It organizes output into directories defined by SEED station code and
+ writes file all the files from a given year in different files.
+ (e.g. path = "II_AAK_BHZ_00/1998").
+2. The reader access the wf_miniseed collection. That assures the seed
+ station codes should be defined for each datum.
+3. I use the pyspark variant which requires the SparkContext constructs
+ see in this example.
+
+.. code-block:: python
+
+ def set_dir_dfile(d):
+ """
+ Function used to set dir and dfile for example 2.
+
+ dir is set to a net_sta_chan_loc name (e.g. II_AAK_BHZ_00) and
+ dfile is set to the year of the start time of each datum.
+ Used for make so input d is assumed to be a TimeSeries.
+ Important: assumes the seed codes are set with the
+ fixed keys 'net','sta','chan', and 'loc'. That works in this
+ example because example uses miniseed data as an origin.
+ Edited copy is returned. Dead data are returned immediately with no change.
+ """
+ if d.dead():
+ return d
+ if d.is_defined('net'):
+ net=d['net']
+ else:
+ net=''
+ if d.is_defined('sta'):
+ sta=d['sta']
+ else:
+ sta=''
+ if d.is_defined('chan'):
+ chan=d['chan']
+ else:
+ chan=''
+ if d.is_defined('loc'):
+ loc=d['loc']
+ else:
+ loc=''
+ # notice that if none of the seed codes are defined the directory
+ # name is three "_" characters
+ dir = net+'_'+sta+'_'+chan+'_'+loc
+ d['dir'] = dir
+ t0 = d.t0
+ year = UTCDateTime(t0).year
+ d['dffile'] = str(year)
+ return d
+
+ # these are needed to enable spark instead of dask defaults
+ import pyspark
+ sc = pyspark.SparkContext('local','example2')
+ # Assume other imports and definition of db is above
+ data = read_distributed_data(db,
+ collection='wf_miniseed',
+ scheduler='spark',
+ spark_context=sc,
+ )
+ data = data.map(lambda d : set_dir_dfile(d)) # spark syntax
+ # processing functions in map operators would go here
+ #
+ # note we don't call computer/collect after write_distributed_data
+ # it initates the lazy computations
+ wfidslist = write_distributed_data(data,
+ collection='wf_TimeSeries',
+ storage_mode='file',
+ scheduler='spark',
+ spark_context=sc,
+ data_tag='example2',
+ )
+ # These are identical to example 1
+ # this will give the maximum number of data possible to compare to nwf
+ print("Size of list returned by write_distributed_data=",len(wfidslist))
+ # This is the number actually saved
+ nwf = db.wf_TimeSeries.count_documents({'data_tag' : 'example2'})
+ print("Number of saved wf documents=",nwf)
+ # this works only if cemetery was empty at the start of processing
+ ndead = db.cemetery.count_documents({})
+ print("Number of data killed in this run=",ndead)
+
+Example 3: Parallel read/write of ensembles
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This example illustrates some special considerations needed to handle
+ensembles. Features this illustrate are:
+
+#. The example reads and forms `TimeSeriesEnsemble` objects grouped by
+ the `source_id` attribute. The algorithm shown will only work if
+ a previous workflow has set the `source_id` value in each datum.
+ Any datum without `source_id` defined would be dropped from this
+ dataset.
+#. We show the full set of options for normalization with ensembles.
+ Ensemble normalization is complicated by the fact that there are two
+ completely different targets for normalization: (a) ensemble Metadata, and
+ (b) each (atomic) ensemble member. The reader in this example
+ does that at load time driven by the two arguments:
+ `normalize_ensemble` and `normalize`. As the names imply
+ `normalize_ensemble` is applied to the ensemble's Metadata
+ container while the operators defined in `normalize` are applied in
+ a loop over members. This example loads source data in the
+ ensemble and channel data into ensemble members.
+#. This example uses an approach that is a complexity
+ required as an implementation detail for the parallel reader
+ to support normalization by ensemble by the reader. It uses
+ the `container_to_merge` option that provides a generic way
+ to merge a consistent bag/RDD of other data into the container
+ constructed by `read_distributed_data`. By "consistent" I
+ mean the size and number of partitions in the bag/RDD passed
+ with that argument must match that of the container being constucted
+ by `read_distributed_data`. In this case, what using that argument
+ does is load a `source_id` value in the ensemble Metadata of each
+ component of the `data` container constucted by `read_distibuted_data`.
+ The reader has a structure that the algorithm to merge the two
+ containers is run before attempting to do any normalization.
+ (i.e. any normalization defined by either `normalize` or `normalize_ensemble`.)
+#. The writer uses `storage_mode='files'` and the default "format". As
+ noted above when undefined the format defaults to a raw binary
+ write of the sample data to files with the C fwrite function.
+ We set `dir` and `dfile` in the ensemble's Metadata container
+ that the writer takes as a signal to write all ensemble data in
+ the same file defined by the ensemble `dir` and `dfile`.
+
+.. code-block:: python
+
+ def set_dir_dfile(d):
+ """
+ Function used to set dir and dfile for example 3.
+
+ This example sets dir as a constant and sets the file
+ name, which is used by ensemble, with the source_id string
+ and a constant suffix of .dat
+ """
+ if d.dead():
+ return d
+ dir='wf_example3'
+ suffix='.dat'
+ # this example can assume source_id is set. Could not get
+ # here otherwise
+ srcid = d['source_id']
+ dfile = str(srcid)
+ dfile += suffix
+ d['dir']=dir
+ d['dfile']=dfile
+ return d
+
+ # Assume other imports and definition of db is above
+ # This loads a source collection normalizer using a cache method
+ # for efficiency. Note it is used in read_distibuted_data below
+ source_matcher = ObjectIdMatcher(db,
+ "source",
+ attributes_to_load=['lat','lon','depth','time','_id'])
+
+ srcid_list = db.wf_miniseed.distinct('source_id')
+ querylist=[]
+ for srcid in srcid_list:
+ querylist.append({'source_id' : srcid})
+ source_bag = bag.from_sequence(querylist)
+ # This is used to normalize each member datum using miniseed
+ # station codes and time
+ mseed_matcher = MiniseedMatcher(db)
+ data = read_distributed_data(db,
+ collection='wf_miniseed',
+ normalize=[mseed_matcher],
+ normalize_ensemble=[source_matcher],
+ container_to_merge=source_bag,
+ )
+ # algorithms more appropriate for TimeSeries data would be run
+ # here with one or more map operators
+
+ # normalization with channel by mseed_matcher allows this
+ # fundamenal algorithm to be run. Converts TimeSeriesEnsemble
+ # objects to SeismogramEnsemble objects
+ data = data.map(bundle)
+
+ # other processing functions for Seismogram in map operators would go here
+
+ # finally set the dir and dfile fields
+ data = data.map(set_dir_dfile)
+ # note we don't call computer/collect after write_distributed_data
+ # it initates the lazy computations
+ wfidslist = write_distributed_data(data,
+ collection='wf_Seismogram',
+ storage_mode='file',
+ scheduler='spark',
+ data_tag='example3',
+ )
+ # These are identical to example 1
+ # this will give the maximum number of data possible to compare to nwf
+ print("Size of list returned by write_distributed_data=",len(wfidslist))
+ # This is the number actually saved
+ nwf = db.wf_Seismogram.count_documents({'data_tag' : 'example3'})
+ print("Number of saved wf documents=",nwf)
+ # this works only if cemetery was empty at the start of processing
+ ndead = db.cemetery.count_documents({})
+ print("Number of data killed in this run=",ndead)
+
+Example 4: Intermediate processing result save
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+It is sometimes necessary, particularly in a research context,
+to have a workflow save an intermediate result. In the context of
+a parallel workflow, that means one needs to do a save
+within a sequence of calls to map/reduce operators.
+As noted above `write_distributed_data` always is a terminator
+for a chain of lazy/delayed calculations. It always returns
+some version of a list of ObjectIds of the saved wf documents.
+
+One approach for an intermediate save is to immediately
+follow a call to `write_distributed_data` with a call to
+`read_ensemble_data`. In general that approach, in fact,
+is what is most useful in the context. Often the reason for
+an intermediate save is to verify things are working as you
+expected. In that case, you likely will want to
+explore the data a bit before moving on anyway.
+e.g. I usually structure work with MsPASS into a set of
+notebooks were each one ends up with the data set saved in a particular,
+often partially processed, state.
+
+An alternative that can be useful for intermediate saves
+is illustrated in the following example:
+
+.. code-block:: python
+
+ data = read_distributed_data(db,collection='wf_TimeSeries')
+ # set of map/reduce operators would go here
+ data = data.map(lambda d : db.save_data(
+ d,
+ collection='wf_TimeSeries',
+ data_tag='save1',
+ return_data=True,
+ )
+ # more map/reduce operators
+ wfids = write_distributed_data(data,
+ collection='wf_TimeSeries',
+ data_tag='finalsave',
+ )
+
+where we used mostly defaults on all the function calls to keep the
+example simple. Rarely would that be the right usage.
+A critical feature is the `return_data=True` option send to the
+`save_data` method of `Database`. With that option the method
+returns a copy of the atomic datum it received with additions/changes
+created by the saving operation.
+
+The approach above is most useful for production workflows where
+the only purpose of the intermediate save is as a checkpoint in the
+event something fails later in the workflow and you need to the
+intermediate case because it was expensive to compute. As noted
+above, it may actually be faster to do the following instead:
+
+.. code-block:: python
+
+ data = read_distributed_data(db,collection='wf_TimeSeries')
+ # set of map/reduce operators would go here
+ write_distributed_data(data,
+ collection='wf_TimeSeries',
+ data_tag='save1',
+ )
+ data = read_distributed_data(db,
+ query={'data_tag' : 'save1'},
+ collection='wf_TimeSeries')
+
+ # more map/reduce operators
+ wfids = write_distributed_data(data,
+ collection='wf_TimeSeries',
+ data_tag='finalsave',
+ )
diff --git a/docs/source/user_manual/parallel_processing.rst b/docs/source/user_manual/parallel_processing.rst
index 12757f21e..1fa50e2c4 100644
--- a/docs/source/user_manual/parallel_processing.rst
+++ b/docs/source/user_manual/parallel_processing.rst
@@ -1,4 +1,4 @@
-.. _parallel_processesing:
+.. _parallel_processing:
Parallel Processing
===========================
@@ -27,7 +27,7 @@ while migration takes an input ensemble and produces a (modified) ensemble.
All these examples can be implemented in MsPASS framework to take advantage
of the repetitious nature of the processors. That is, all can be the thought of
as black boxes that take one or more seismic data objects as input and emit
-one or more modified data objects (not necessarily of the same type or number as
+more or modified data objects (not necessarily of the same type or number as
the inputs). A full processing workflow is assembled by chaining a series of
processing steps with data moving between the processing steps by a system
dependent process. Dask and Spark generalize this by fragmenting the processing
@@ -36,7 +36,7 @@ process largely opaque to you as the user. An added benefit that is less
obvious is that lower-level parallel granularity is possible for a
python function designed to exploit the capability. This user manual,
however, focuses exclusively on algorithm level granularity. We refer
-the reader to extensive printed and online documentation for Dask and Spark
+the reader to extensive printed and online documentation for DASK and Spark
for functionality needed to add parallelism inside an individual python
function.
@@ -59,11 +59,11 @@ Resiliant Distributed Dataset (RDD) while Dask calls the same thing a "bag".
In MsPASS the normal content of a bag/RDD is a dataset made up of *N*
MsPASS data objects: TimeSeries, Seismogram, or one of the ensemble of
-either of the atomic types. An implicit assumption in the current
+either of the atomic types. An implicit assumption in the current
implementation is that any processing
was proceeded by a data assembly and validation phase.
Procedures for data import and data validation
-are described in the section :ref:`importing_data`.
+are described in this section :ref:`importing_data`.
A key concept here is that "importing" a data set to MsPASS means the
entire data set has been assembled and loaded either into the gridfs
file system internal to MongoDB or an index has been built for all data
@@ -72,7 +72,7 @@ be a mix of the two
because the MsPASS readers abstract the entire process of loading data
for processing. In addition, any required normalization data (i.e.
data in the *site*, *channel*, and/or *source* collections) must be
-loaded and cross-referencing validated as described in :ref:`database_normalization`.
+loaded and cross-referencing validated as described in :ref:`normalization`.
Once the data set is fully assembled the bag/RDD defining the abstraction of the
data is easily defined by a variant of the following code section:
@@ -110,9 +110,9 @@ Overview
-----------
A second key concept we utilize for processing algorithms in MsPASS is the
abstraction of
-`functional programming `__,
+`functional programming`__,
which is a branch of programming founded on
-`lambda calculus `__.
+`lambda calculus`__.
For most seismologists that abstraction is likely best treated only as
a foundational concept that may or may not be helpful depending on your
background. It is important, however,
@@ -133,25 +133,11 @@ and
Noting that Spark calls the later operation the (more common) name *reduce*.
These two constructs can be thought of as black boxes that handle inputs
-as illustrated below in Figure :numref:`map_reduce_figure`:
+as illustrated below:
-.. _map_reduce:
+ - simple figure here showing map and reduce in a graphical form -
-.. figure:: ../_static/figures/map_reduce.png
- :width: 600px
- :align: center
-
- Figure :numref:`map_reduce_figure`. Illustration of map-reduce concepts.
- The boxes labeled "Transformation" are map operators and those
- labeled "Action" are reduce operators. Both are alternative terms used
- by some sources for the same concepts. The top boxes illustrate the
- overall workflow as four steps. The middle section shows how that
- would be fragmented and handled as distinct tasks by Spark (Dask is similar).
- The lower set of boxes illustrate how other component of MsPASS interact
- with Spark/Dask.
-
-We expand on the map/transformation and reduce/action operators in subsection
-below.
+We expand on each of these constructs below.
The map operator
--------------------
@@ -165,10 +151,10 @@ the application of a simple filter in dask is:
.. code-block:: python
# Assume dbhandle is set as a Database class as above
- cursor = dbhandle.wf_TimeSeries.find({})
- d_in = read_distributed_data(dbhandle,cursor)
- d_out = d_in.map(signals.filter, "bandpass", freqmin=1, freqmax=5, object_history=True, alg_id='0')
- d_compute = d_out.compute()
+ cursor=dbhandle.wf_TimeSeries.find({})
+ d_in=read_distributed_data(dbhandle,cursor)
+ d_out=d_in.map(signals.filter, "bandpass", freqmin=1, freqmax=5, object_history=True, alg_id='0')
+ d_compute=d_out.compute()
This example applies the obpsy default bandpass filter to all data
stored in the wf_TimeSeries collection for the database to which dbhandle
@@ -179,10 +165,9 @@ output in the created (new) bag *d_out*. The last line is way you tell dask t
"go" (i.e. proceed with the calculations) and store the computed result in the *d_compute*.
The idea and reasons for the concept of of "lazy" or "delayed"
operation is discussed at length in various sources on dask (and Spark).
-We refer the reader to online sources easily found by searching for
-the keyword for more on this general topic.
+We refer the reader to (LIST OF A FEW KEY URLS) for more on this general topic.
The final output, which we chose above to give a new symbol name
-of :code:`d_compute`, is a list containing the processed data.
+of :code:`d_compute`, is bag containing the processed data.
The same construct in Spark, unfortunately, requires a different set of
constructs for two reasons: (1) pyspark demands a functional
@@ -205,7 +190,7 @@ That operator is more or less a constructor for the container that Spark
calls an RDD that is assigned the symbol d_out in the example above.
The following line, which from a programming perspective is a call to the map method of the RDD we call
d_out, uses the functional programming construct of a lambda function.
-This tutorial in `realpython.com `_
+This tutorial in `realpython.com `_
and `this one `_ by w3schools.com
are good starting points.
@@ -232,10 +217,10 @@ stacking all the members of an ensemble:
.. code-block:: python
- ensemble = db.read_ensemble_data(cursor)
- stack = TimeSeries(d.member[0])
- for i in range(len(d.member) - 1):
- stack += ensemble.member[i + 1]
+ ensemble=db.read_ensemble_data(cursor)
+ stack=TimeSeries(d.member[0])
+ for i in range(len(d.member)-1):
+ stack += ensemble.member[i+1]
That code is pretty simple because the += operator is defined for the TimeSeries
class and handles time mismatches. It is not robust for several reasons and
@@ -245,7 +230,8 @@ single result stored with the symbol :code:`stack`.
We will get to the rules that constrain `Reduce` operators in a moment, but
it might be more helpful to you as a user to see how that algorithm
-translates into dask/spark. MsPASS has a parallel :py:func:`stack ` algorithm.
+translates into dask/spark. MsPASS has a parallel stack algorithm found
+`here`_
It is used in a parallel context as follows for dask:
.. code-block:: python
@@ -258,8 +244,8 @@ For spark the syntax is identical but the name of the method changes to reduce:
res = rdd.reduce(lambda a, b: stack(a, b))
-The :py:func:`stack ` symbol refers to a python function that is actually quite simple. You can view
-the source code `here `_.
+The :code:`stack` symbol refers to a python function that is actually quite simple. You can view
+the source code `here`_.
It is simple because most of the complexity is hidden behind the +=
symbol that invokes that operation in C++ (`TimeSeries::operator+=` for anyone
familiar with C++) to add the right hand side to the left hand side of
@@ -275,7 +261,7 @@ which is a generic wrapper to adapt any suitable reduce function to MsPASS.
The final issue we need to cover in this section is what exactly is meant
by the phrase "any suitable reduce function" at the end of the previous paragraph?
To mesh with the reduce framework used by spark and dask a function has
-to satisfy `the following rules `_
+to satisfy `the following rules`_
1. The first two arguments (a and b symbols in the example above)
must define two instances of the same type
@@ -342,17 +328,16 @@ overhead is relatively small unless the execution time for
processing is trivial.
For more information, the dask documentation found
-`here `_ is a good
-starting point..
+`here`_ is a good
+starting point.
Examples:
~~~~~~~~~~~~~
-
Atomic Data Example
-------------------------------
The simplest workflow is one that works only with atomic
data (i.e. TimeSeries or Seismogram objects). The example
-in the Data Model section above is of this type.
+example in the Data Model section above is of this type.
The following fragment is similar with a few additional processing steps.
It reads all data indexed in the data base as Seismogram objects,
runs a demean operator,
@@ -362,32 +347,21 @@ using the data start time, and then saves the results.
.. code-block:: python
- cursor = db.wf_Seismogram.find({})
+ cursor=db.wf_Seismogram.find({})
# read -> detrend -> filter -> window
# example uses dask scheduler
data = read_distributed_data(db, cursor)
data = data.map(signals.detrend,'demean')
- data = data.map(signals.filter,"bandpass", freqmin=0.01, freqmax=2.0)
+ data = data.map(signals.filter,"bandpass",freqmin=0.01,freqmax=2.0)
# windowing is relative to start time. 300 s window starting at d.t0+200
- data = data.map(lambda d : WindowData(d, 200.0, 500.0, t0shift=d.t0))
- res = data.map(db.save_data,collection="wf_Seismogram")
+ data = data.map(lambda d : WindowData(d,200.0,500.0,t0shift=d.t0))
data_out = data.compute()
Ensemble Example
----------------------
-This example shows a common construct used to build ensembles as the
-working data object. In this case the workflow is working with what
-in reflection processing would be called a common-source-gather.
-It uses a construct that exploits MongoDB's functionality for selecting
-data. It also demonstrates how a map operator can have drastically
-different inputs than outputs. In this case, we first assemble
-a list of sources to be process defined by the :code:`source_id` attribute.
-This example uses a MongoDB incantation to get a list of unique source_id values.
-It then converts the list to a dask bag (:code:`bag.from_sequence` line).
-We then apply the custom function defined at the top of this code block to
-query MongoDB and return a ensemble of all data with that particular source id.
-It then applies a set of signal processing algorithms similar to above noting
-how the MsPASS functions automatically handle the type switch.
+This example needs to use function to build a query, put the query
+in a map call, and then run an ensemble process.
+Here is an untested prototype for this manual
.. code:: python
@@ -397,15 +371,211 @@ how the MsPASS functions automatically handle the type switch.
# note with logic of this use we don't need to test for
# no matches because distinct returns only not null source_id values dbcol
cursor = dbcol.find(query)
- ensemble = db.read_ensemble(db, collection=collection)
+ ensemble = db.read_ensemble(db,collection=collection)
return ensemble
dbcol = db.wf_Seismogram
srcidlist = db.wf_Seismogram.distinct("source_id")
data = dask.bag.from_sequence(srcidlist)
- data = data.map(lambda srcid : read_common_source_gather(db, "wf_Seismogram", srcid))
- data = data.map(signals.detrend, 'demean')
- data = data.map(signals.filter, "bandpass", freqmin=0.01, freqmax=2.0)
+ data = data.map(lambda srcid : read_common_source_gather(db,"wf_Seismogram",srcid))
+ data = data.map(signals.detrend,'demean')
+ data = data.map(signals.filter,"bandpass",freqmin=0.01,freqmax=2.0)
# windowing is relative to start time. 300 s window starting at d.t0+200
- data = data.map(lambda d : WindowData(d, 200.0, 500.0, t0shift=d.t0))
+ data = data.map(lambda d : WindowData(d,200.0,500.0,t0shift=d.t0))
data_out = data.compute()
+
+New Organization for discussion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Cluster fundamentals
+~~~~~~~~~~~~~~~~~~~~~~~
+Overview of what one has to deal with to configure a parallel system
+in a distributed cluster versus a multicore workstation. Here are things
+I can think of we need to discuss:
+
+- batch Schedulers
+- node-to-node communications
+- containers in a distributed environment
+- to shard or not to shard, that is the question
+- io performance issues and choices (relates to file system related configuration)
+
+Configuration
+~~~~~~~~~~~~~~~
+subsections for each of the above topics centered on example.
+
+I think we should reorganize the script to have related
+setups grouped by the categories we choose for this
+user manual section (as much as possible - there may
+be some order dependence)
+
+Start of old section
+~~~~~~~~~~~~~~~~~~~~~~
+Configuration
+~~~~~~~~~~~~~~~~~~
+Overview
+------------
+Some configuration will be needed to run MsPASS in a HPC system or
+a departmental cluster. The reason is that the
+environment of an HPC cluster has numerous complications not found on a
+desktop system. The example we give
+here is what we use for testing the system on Stampede2 at TACC.
+This section can be thought of as a lengthy explanation centered on
+the example in our github page for configuring MsPASS in a
+large, distributed memory system like TACC's Stampede2.
+To read this page we recommend you open a second winodw or tab on
+your web browser to the current file in the mspass source code
+directory called :code:`scripts/tacc_examples/distributed_node.sh`.
+The link to the that file you can view on your web browser is
+`here`__.
+We note there is an additional example there for running MsPASS
+on a single node at TACC called :code:`scripts/tacc_examples/single_node.sh`
+you can access directly
+`here`__,
+The single node setup is useful for testing and may help your understanding
+of what is needed by being much simpler. We do not discuss that
+example further here, however, because a primary purpose for using
+MsPASS is processing data in a large HPC cluster like TACC.
+
+nxt para needs to say tis is a shelll script and the section below
+are grouped by functional issues then list them (singularity, mongodb, and ?)
+
+
+Workload Manager Setup
+-------------------------
+It uses a workload manager software installed there called :code:`Slurm`
+and the associated command keyword :code:`SBATCH`. If your
+system does not have Slurm there will be something similar
+(notably Moab or Torque) that
+you will need to substitute. Perhaps obvious but things like
+file system configuration will need changes to match your local environment.
+
+:code:`Slurm` is used as a batch control system to schedule a "batch" job on
+a large cluster like Stampede2. Batch jobs are submitted to be run on
+compute notes by submitting a file the command line tool called :code:`sbatch`.
+The submitted file is a expected to be a unix shell script that runs
+your "batch job". To be run under :code:`Slurm` the
+shell script normally defines a set of run configuration parameters
+defined in the first few lines of the script. Here is a typical examples:
+
+.. code-block:: bash
+
+ #!/bin/bash
+ #SBATCH -J mspass # Job name - change as approrpiate
+ #SBATCH -o mspass.o%j # Name of stdout output file redirection
+ #SBATCH -p normal # Queue (partition) name
+ #SBATCH -N 2 # Total # of nodes requested (2 for this example)
+ #SBATCH -n 2 # Total # of mpi tasks
+ #SBATCH -t 02:00:00 # Run time (hh:mm:ss)
+
+This example requests 2 nodes (-N 2) for a run time of 2 hours (-t line) submitted
+to TACC's "normal" queue (-p normal). Note the :code:`Slurm` configuration parameters
+are preceded by the keyword :code:`#SBATCH`. The lines begin with the "#"
+symbol which the unix shell will treat as a comment. That is done for a
+variety of reasons but one important practical one is to test the syntax of a
+script on a head node without having to submit the full job.
+
+MsPASS was designed to be run in a container. For a workstation environment
+we assume the container system being used is docker. Running
+MsPASS with docker is described on
+`this wiki page`__.
+All HPC systems we know have a docker compatible system called
+:code:`singularity`. Singularity can be thought of as docker for a large
+HPC cluster. The most important feature of singularity for you as a user
+is that it uses exactly the same container file as docker. i.e. you "pull" the
+docker container and that is used by singularity in a very similar fashion to
+the way it used by docker as follows:
+
+.. code-block:: bash
+ singularity build mspass.simg docker://wangyinz/mspass
+
+For more about running MsPASS with singularity consult our
+wiki page found
+`here`__.
+Since our examples here were constructed on TACC' Stampede2 you may also
+find it useful to read their page on using singularity found
+`here`__
+
+There is a single node mode you may want to run for testing.
+You can find an example of how to configure Stampede2 to run on a single
+node in the MsPASS scripts/tacc_examples found on github
+`here`__.
+We focus is manual on configuration for a production run using multiple
+nodes, that is a primary purpose of using MsPASS for data processing.
+The example we give here is the
+
+There are two ways we could deploy our system on stampede2, which are single node mode and distributed mode.
+You could refer those two job script in our /scripts/tacc_examples folder in our source code. Here we would
+introduce the common parts and elements in both scripts.
+
+In both modes, we would specify the working directory and the place we store our docker image. That's why
+these two lines are in the job scripts:
+
+.. code-block:: bash
+
+ # working directory
+ WORK_DIR=$SCRATCH/mspass/single_workdir
+ # directory where contains docker image
+ MSPASS_CONTAINER=$WORK2/mspass/mspass_latest.sif
+
+The paths for these two variables can be changed according to your case and where you want to store the image.
+And it doesn't matter if the directory doesn't exist, the job script would create one if needed.
+
+Then we define the SING_COM variable to simplify the workflow in our job script. On Stampede2 and most of HPC
+systems, we use Singularity to manage and run the docker images. There are many options to start a container
+using singularity, which you could refer to their documentation. And for those who are not familiar with Singularity,
+here is a good `source`_
+to get start with.
+
+.. code-block:: bash
+
+ # command that start the container
+ module load tacc-singularity
+ SING_COM="singularity run $MSPASS_CONTAINER"
+
+Then we create a login port based on the hostname of our primary compute node we have requested. The
+port number is created in a way that guaranteed to be unique and availale on your own machine. After the
+execution of your job script, you would get the ouput file, and you could get the url for accessing the
+notebook running on your compute node. However, from you own computer, you should use this login port to
+access it instead of 8888 which typically is the port we will be using in jupyter notebook because
+we reserve the port for transmitting all the data and bits through the reverse tunnel.
+
+.. code-block:: bash
+
+ NODE_HOSTNAME=`hostname -s`
+ LOGIN_PORT=`echo $NODE_HOSTNAME | perl -ne 'print (($2+1).$3.$1) if /c\d(\d\d)-(\d)(\d\d)/;'`
+
+Next, we create reverse tunnel port to login nodes and make one tunnel for each login so the user can just
+connect to stampede.tacc. The reverse ssh tunnel is a tech trick that could make your own machine connect to
+the machines in the private TACC network.
+
+.. code-block:: bash
+
+ for i in `seq 4`; do
+ ssh -q -f -g -N -R $LOGIN_PORT:$NODE_HOSTNAME:8888 login$i
+ done
+
+For single node mode, the last thing we need to do is to start a container using the command we defined
+before:
+
+.. code-block:: bash
+
+ DB_PATH='scratch'
+ SINGULARITYENV_MSPASS_DB_PATH=$DB_PATH \
+ SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR $SING_COM
+
+Here we set the environment variables inside the container using this syntactic sugar SINGULARITYENV_XXX.
+For more information, you could view the usage`here`_.
+We define and set different variables in different containers we start because in our start-mspass.sh, we
+define different bahavior under different *MSPASS_ROLE* so that for each role, it will execute the bash
+script we define in the start-mspass.sh. Though it looks complicated and hard to extend, this is prabably
+the best way we could do under stampede2 environment. In above code snippet, we basically start the container
+in all-in-one way.
+
+There are more other ways we start a container and it depends on what we need for the deployment. You could
+find more in the distributed_node.sh job script. For example, we start a scheduler, a dbmanager and a front-end
+jupyter notebook in our primary compute node and start a spark/dask worker and a MongoDB shard replica on each
+of our worker nodes. Also you could find that the environment variables needed are different and you could find
+the corresponding usage in the start-mspass.sh script in our source code. We hide the implementation detail and
+encapsulate it inside the Dockerfile. One more thing here is we specify number of nodes in our sbatch options,
+for example 4, stampede2 would reserve 4 compute nodes for us, and we would use 1 node as our primary
+compute nodes and 3 nodes as our worker nodes. Therefore, if you need 4 worker nodes, you should sepcify 5
+as your sbatch option for nodes.
diff --git a/docs/source/user_manual/schema_choices.rst b/docs/source/user_manual/schema_choices.rst
new file mode 100644
index 000000000..cc4e96900
--- /dev/null
+++ b/docs/source/user_manual/schema_choices.rst
@@ -0,0 +1,41 @@
+.. _schema_choices:
+
+What database schema should I use?
+=======================================
+
+This is another question with endmembers. At one end is a research workflow
+that is highly experimental and subject to large changes. At the other end
+is a production workflow you are constructing to run on a large dataset
+and a cluster with many processors.
+
+For the former (research code), the answer is usually
+none. The baggage of a schema mostly gets in the way.
+The best default for that situation is the one called `mspass-lite`.
+It differs from the global default, `mspass`, in that it imposes no
+read-only restriction that can cause confusing errors in some situations.
+For this situation if you aim to use the database with the name tag
+`mydb` use the following as the top box of your jupyter notebook file
+defining your workflow:
+
+.. code-block:: python
+
+ from mspasspy.db.client import DBClient
+ dbclient = DBClient()
+ db = dbclient.get_database("mydb",schema="mspass_lite.yaml")
+
+For a more tuned application, you should first consider a related
+question: Does my workflow need schema enforcement? If the answer is no,
+just use the `mspass-lite` schema as illustrated above.
+In most case, however, it is our experience that for large datasets
+schema constraints can provide an important way to avoid mysterious errors.
+For the case with a yes answer there are two options. For most
+application the standard `mspass` schema should be sufficient. It provides
+base functionality for source and receiver metadata. It also defines
+the `wf_miniseed` collection that is essential when the processing
+sequence uses miniseed data as the parent data. If the standard schema does not
+work for you, look at the `data/yaml` directory for the mspass source code
+tree for alternatives you may find there. If none of those work for you
+it is relatively easy to extend any of the existing schema files.
+See related sections in this User's Manual for guidance and the docstring
+for the class :py:class:`DatabaseSchema` and
+:py:class:`MetadataSchema`.
diff --git a/docs/source/user_manual/signal_to_noise.rst b/docs/source/user_manual/signal_to_noise.rst
new file mode 100644
index 000000000..6cc1481ec
--- /dev/null
+++ b/docs/source/user_manual/signal_to_noise.rst
@@ -0,0 +1,210 @@
+.. _signal_to_noise:
+
+Signal to Noise Ratio Estimation
+===================================
+Concepts
+_____________
+When analyzing passive array data most seismology analysis focuses
+on "transient signals" that are the wiggly lines that are the core
+data of seismology. One of the most basic quantitative measures of
+"goodness" of a particular signal is what is universally called
+signal-to-noise ratio. Robust estimates of signal-to-noise ratio are
+an essential component of any workflow that doesn't resort to brute
+force interactive trace editing by a human. A fundamental problem we
+recognized in building this component of MsPASS is that there is no
+standard definition of how to measure signal-to-noise ratio. The reason is
+that not all metrics perform equally on all signals. For that reason we
+decided it was necessary to supply a range of options for signal-to-noise
+estimation to allow anyone to develop a "robust" set of automated
+criteria for data editing.
+
+It is useful to first give a generic definition of signal-to-noise
+ratio. Sections below describe specific metrics based on the following
+generic formula:
+
+.. math::
+
+ snr = \frac{signal\_metric}{noise\_metric}
+
+i.e. we divide some measures of the signal amplitude by some measure of
+noise amplitude. There are complexities in defining what "the signal"
+and "noise" mean and what measure is used for each. We can address the
+first in a common way for all MsPASS implementations. Choices for the
+second are discussed below.
+
+All MsPASS algorithms are driven by definitions of the time windows that
+define the part of an input datum is signal and noise. All current
+implementations are driven by a simple data object we call a `TimeWindow`.
+You can read the docstring but everything useful about this object
+can be gleaned from running this code fragment in the mspass container:
+
+.. code-block:: python
+
+ from mspasspy.ccore.algorithms.basic import TimeWindow
+ win = TimeWindow(-5.0,20.0)
+ print("Time window range: ",win.start,"<=t<=",win.end)
+
+The time window here would be appropriate if the datum it used for
+had been converted to "relative time". The example could be appropriate
+to define a signal window if the data had been converted to "relative time"
+(see section :ref:`time_standard_constraints`)
+with zero defined as a measured or predicted phase arrival time.
+
+All algorithms to estimate *snr* require a `TimeWindow` defining what section of
+data should be used for computing some metric for signal and noise to
+compute the ratio. The metric applied to each window may or may not always
+be the same.
+
+Time Window Amplitude Metrics
+______________________________
+
+The most common measure of amplitude is the root mean square error
+that is a scaled version of the L2 vector norm:
+
+.. math::
+
+ | \bf{x} \|_{rms} = \sqrt{\frac{\sum\limits_{i=n_s}^{n_e} x_i^2 }{n_e - n_s -1}}
+
+where :math:`n_s` is the data index for the start time of a time window and
+:math:`n_e` is the index of the end time. In the snr docstrings this metric
+is given the tag `rms`.
+
+The traditional metric for most earthquake magnitude formulas is
+peak amplitude. In linear algebra language peak amplitude in a time
+window is called the :math:`L_\infty` norm defined as:
+
+.. math::
+
+ | \bf{x} |_{\infty} = max ( \mid x_{n_s}\mid , \mid x_{n_s + 1}\mid , \cdots , \mid x_{n_e}\mid )
+
+For those unfamiliar with this jargon it is little more than
+a mathematical statement that we measure the largest sample amplitude.
+In the snr docstrings this metric is given the tag `peak`.
+
+MsPASS also provides a facility to calculate a few less common metrics based
+on ranked (sorted) lists of amplitudes. All create a vector of
+:math:`N=n_s - n_e -1` absolute values of each sample in the time window
+(:math:`\mid x_i \mid`) and sort the values in increasing order.
+The MsPASS snr module supports two metrics based on a fully sorted list
+of amplitudes:
+
+1. The tag `median` is used to compute the standard quanity called the
+ median, which is defined the center (50% point) of the sorted list of amplitudes.
+2. The generic tag `perc` is used for a family of metrics using sorted
+ amplitudes. Users familiar with seismic unix will recognize this
+ keyword as a common argument for plot scaling in that package. MsPASS,
+ in fact, adapted the perc metric concept from seismic unix.
+ If used it always requires specifying a "percentage"
+ describing the level it should define. e.g the default of 95 means
+ return the amplitude for which 95% of the samples have a lower amplitude.
+ Similarly, perc of 50 would be the same as the median and 100 would
+ be the same as the peak amplitude.
+
+Low-level SNR Estimation
+________________________________
+The basic function to compute signal-to-noise ratios has the
+simple name :py:func:`snr`. The docstring
+describes the detailed use, but you will normally need to specify or
+default four key parameters: (1) signal time window, (2) noise time window,
+(3) metric to use for the signal window, and (4) metric to use for the
+noise window. The metric choice is made by using the tag strings
+defined above: `rms`, `peak, `mad`, or `perc`. Note the `perc`
+option level defaults but can be set to any level that makes sense.
+If `perc` is used for both signal and noise the same level will be used
+for both. The following is a typical usage example. The example uses
+the `peak` metric for the signal and rms for noise and returns the estimate
+in the symbol `dsnr`.
+
+.. code-block:: python
+
+ swin = TimeWindow(-2.0,10.0)
+ nwin = TimeWindow(-120.0,-5.0)
+ # assume d is a TimeSeries object defined above
+ dsnr = snr(d,noise_window=nwin,signal_window=swin,
+ noise_metric="rms",signal_metric="peak")
+
+Broadband SNR Estimation
+____________________________
+MsPASS implements a set of more elaborate signal-to-noise metrics
+designed for quality control editing of modern broadband data.
+An issue not universally appreciated about all, modern, passive array
+data recorded with broadband instruments is that traditional measures of
+signal-to-noise ratio are usually meaningless if applied to raw data.
+From a signal processing perspective the fundamental reason is that
+broadband seismic noise and earthquake signals are both strongly
+"colored". Broadband noise is always dominated by microseisms and/or
+cultural noise at frequencies above a few Hz. Earthquake's have
+characteristic spectra. The "colors" earthquakes generate are commonly
+used, in fact, to measure source properties. As a result most earthquake records
+have wildly variable signal-to-noise variation across the recording
+band of modern instruments. MsPASS addresses this issue through
+a novel implementation in the function
+:py:func:`FD_snr_estimator`
+and two higher level functions that use it internally called
+:py:func:`arrival_snr`
+and :py:func:`arrival_snr_QC`.
+The focus of this section is the algorithm used in
+:py:func:`FD_snr_estimator`.
+The others should be thought of as convenient wrappers to run
+`FD_snr_estimator`.
+
+The "FD" in `FD_snr_estimator` function is short for "Frequency Domain"
+emphasizing that FD is the key idea of the algorithm. Specifically,
+the function computes the
+power spectrum of the signal and noise windows tha are used to compute a series of
+broadband snr metrics you can use to sort out signals worth processing further.
+The algorithm makes a fundamental assumption that the optimal frequency band
+of a given signal can be defined by a high and low corner frequency.
+In marginal snr conditions that assumption can easily be wrong.
+A type example is teleseismic P waves that have signals in the
+traditional short period and long period bands, but the signal level does
+not exceed the microseism peak. Nonetheless, the single passband assumption
+is an obvious first order approach we have found useful for automated
+data winnowing.
+
+The function attempts to determine the data passband by a process illustrated in the
+Figure below. The algorithm is a bidirectional search. The lower passband edge
+initiates at the highest frequency distinguishable from zero as
+defined by the time-bandwidth product (`tbp` input parameter). The upper
+passband edge search is initiated either from a user specified frequency
+or the default of 80% of Nyquist. Both searches increment/decrement through
+frequency bins computing the ratio of the signal power spectral density to
+the estimated noise power spectral density. An input parameter with the
+tag `band_cutoff_snr` defines the minimum snr the algorithm assumes is
+indicating a signal is present. A special feature of the search possible
+because of the use of multitaper spectra uses the `tbp` parameter.
+A property of multitaper spectra is the spectra are smoothed at a scale of the
+number of frequency binds defined by the time-bandwidth product through
+the formula:
+
+.. math::
+x = need this formula
+
+A key point is that increasing the time-bandwidth products causes the
+spectral estimates to progressively smoother. For this application
+we found using `tbp=4` with 8 tapers or `tbp=5` and 10 tapers
+are good choices as they produce
+smoother spectra that produces more stable results.
+Readers unfamiliar with
+multitaper spectral methods may find it useful begin with
+the `matlab help file in their multitaper estimator `__.
+There is also a large literature applying the technique to a range of
+practical problems easily found by any library search engine.
+We chose to use the multitaper because it always produces a more
+stable estimator for this application because of its reliable smoothing
+properties.
+
+
+need to note window normalization as a postscript
+
+Practical Advice about Amplitude Metrics
+____________________________________________
+THIS BELONGS LATER
+:math:`| \bf{x} \|_{rms}` is
+the stock measure of noise amplitude.
+It is a useful measure of signal
+amplitude only if the signal window is restricted to the highest amplitude
+of the signal. In particular a window is defined relative to an arrival
+time predicted from a model the signal start time should be well before any
+time shift from travel time errors. That will, however, increase the
+number of points