Skip to content

Commit

Permalink
Merge pull request #459 from mspass-team/documentation_practical_advice
Browse files Browse the repository at this point in the history
Documentation Update and Changes for v2 release
  • Loading branch information
wangyinz authored Mar 1, 2024
2 parents 9446d54 + 0e1cb5a commit 6a4613f
Show file tree
Hide file tree
Showing 27 changed files with 5,447 additions and 365 deletions.
Binary file added docs/source/_static/figures/MapDAGFigure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/figures/MapProcessing.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/figures/ReduceFigure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/cxx_api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ The MsPASS C++ API's key components are the following classes.

.. toctree::

../cxx_api/mspass
../cxx_api/mspass.utility.Metadata
../cxx_api/mspass.seismic.TimeSeries
../cxx_api/mspass.seismic.Seismogram
Expand Down
5 changes: 3 additions & 2 deletions docs/source/cxx_api/mspass.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@ MsPASS C++ API

<script language="javascript" type="text/javascript">
function resizeIframe(obj) {
obj.style.height = obj.contentWindow.document.body.scrollHeight + 'px';
obj.style.height = "500px"; // Temporarily reset height for recalculation
obj.style.height = obj.contentWindow.document.documentElement.scrollHeight + 'px';
}
</script>

<iframe src="../_static/html/hierarchy.html" width="100%" marginheight="0" frameborder="0" scrolling="no" id="iframe" onload='javascript:resizeIframe(this);'></iframe>
<iframe src="../_static/html/hierarchy.html" width="100%" height="600px" marginheight="0" frameborder="0" scrolling="yes" id="iframe" onload="javascript:resizeIframe(this);"></iframe>

.. .. doxygennamespace:: mspass
.. :members:
Expand Down
33 changes: 28 additions & 5 deletions docs/source/getting_started/run_mspass_with_docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,18 @@ Run MsPASS with Docker
Prerequisites
-------------

Docker is required for users to run MsPASS on desktop systems.
It is the piece of software you will use to run and manage
Docker is required in normal use to run MsPASS on desktop systems.
The alternative is a more complicated installation of the components
built from source as described on
`this wiki page <https://github.com/mspass-team/mspass/wiki/Compiling-MsPASS-from-source-code>`__.
Docker is the piece of software you will use to run and manage
any containers on your desktop system.

Docker is well-supported on all current desktop operating systems and
has simple install procedures described in detail in the
product's documentation found `here <https://docs.docker.com/get-docker/>`__
The software can currently be downloaded at no cost, but you must have
administrative priveleges to install the software.
administrative privileges to install the software.
The remainder of this page assumes you have successfully installed
docker. For Windows or Apple user's it may be convenient to launch the
"docker desktop" as an alternative to command line tools.
Expand Down Expand Up @@ -85,8 +88,8 @@ to save your results to your local system. Without the
``--mount`` incantation any results
you produce in a run will disappear when the container exits.

An useful, alternative way to launch docker on a linux or MacOS system
is use the shell ``cd`` command in the terminal you are using to make
A useful, alternative way to launch docker on a linux or MacOS system
is to use the shell ``cd`` command in the terminal you are using to make
your project directory the "current directory". Then you can
cut-and-paste the following variation of the above into that terminal
window and */home* in the container will be mapped to your
Expand Down Expand Up @@ -159,6 +162,26 @@ you are doing before you alter any files with bash commands in this
terminal. A more standard use is to run common monitoring commands like
``top`` to monitor memory and cpu usage by the container.

If you are using dask on a desktop, we have found many algorithms perform
badly because of a subtle issue with python and threads. That is, by
default dask uses a "thread pool" for workers with the number of threads
equal to the number of cores defined for the docker container.
Threading with python is subject to poor performance because of
something called the Global Interpreter Lock (GIL) that causes multithread
python functions to not run in parallel at all with dask. The solution
is to tell dask to run each worker task as a "process" not a thread.
(Note pyspark does this by default.) A way to do that with dask is to
launch docker with the following variant of above:

.. code-block::
docker run -p 8888:8888 -e MSPASS_WORKER_ARG="--nworkers 4 --nthreads 1" --mount src=`pwd`,target=/home,type=bind mspass/mspass
where the value after `--nworkers` should be the number of worker tasks
you want to have the container run. Normally that would be the number of
cores defined for the container which be default is less than the number of
cores for the machine running docker.

Finally, to exit close any notebook windows and the Jupyter notebook
home page. You will usually need to type a `ctrl-C` in the terminal
window you used to launch mpass via docker.
67 changes: 51 additions & 16 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,51 +6,87 @@
MsPASS Documentation
====================

The Massive Parallel Analysis System for Seismologists
is an open source framework for seismic data processing
The Massive Parallel Analysis System for Seismologists
is an open source framework for seismic data processing
and management. It has three core components:

* A scalable parallel processing framework based on a
* A scalable parallel processing framework based on a
dataflow computation model.
* A NoSQL database system centered on document store.
* A container-based virtualization environment.

The system builds on the `ObsPy <http://obspy.org>`_
toolkit, with extension built on a rewrite of the
`SEISPP <http://www.indiana.edu/~pavlab/software/seispp/html/index.html>`_
package.
The system builds on the `ObsPy <http://obspy.org>`_
toolkit, with extension built on a rewrite of the
`SEISPP <http://www.indiana.edu/~pavlab/software/seispp/html/index.html>`_
package.

.. .. mdinclude:: ../../README.md
.. toctree::
:maxdepth: 1
:caption: Getting Started

getting_started/quick_start
getting_started/run_mspass_with_docker
getting_started/deploy_mspass_with_docker_compose
getting_started/deploy_mspass_on_HPC
getting_started/getting_started_overview

.. toctree::
:maxdepth: 1
:caption: User Manual
:caption: Introduction

user_manual/introduction
user_manual/data_object_design_concepts
user_manual/time_standard_constraints
user_manual/obspy_interface

.. toctree::
:maxdepth: 1
:caption: Data Management

user_manual/database_concepts
user_manual/CRUD_operations
user_manual/mongodb_and_mspass
user_manual/normalization

.. toctree::
:maxdepth: 1
:caption: Seismic Data Objects

user_manual/data_object_design_concepts
user_manual/numpy_scipy_interface
user_manual/obspy_interface
user_manual/time_standard_constraints
user_manual/processing_history_concepts
user_manual/continuous_data
user_manual/schema_choices

.. toctree::
:maxdepth: 1
:caption: Data Processing

user_manual/algorithms
user_manual/importing_data
user_manual/handling_errors
user_manual/data_editing
user_manual/header_math
user_manual/graphics
user_manual/processing_history_concepts
user_manual/parallel_processing
user_manual/normalization
user_manual/signal_to_noise
user_manual/adapting_algorithms

.. toctree::
:maxdepth: 1
:caption: System Tuning

user_manual/parallel_processing
user_manual/memory_management
user_manual/io
user_manual/parallel_io

.. toctree::
:maxdepth: 2
:caption: FAQ

user_manual/FAQ
user_manual/development_strategies


.. toctree::
Expand All @@ -60,4 +96,3 @@ package.
python_api/index
cxx_api/index
mspass_schema/mspass_schema

94 changes: 22 additions & 72 deletions docs/source/user_manual/CRUD_operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ alternative construct:
from mspasspy.db.client import Client
from mspasspy.db.database import Database
dbclient = Client()
db = Database(dbclient, 'database_name', db_schema='wf_Seismogram')
db=Database(dbclient, 'database_name', db_schema='wf_Seismogram')
If your workflow requires reading both TimeSeries and Seismogram
data, best practice (i.e. it isn't required but a good idea)
Expand All @@ -63,7 +63,7 @@ the synonymous word "save". Here we list all save methods with a brief
description of each method. Consult the docstring pages for detailed
and most up to date usage:

1. :py:meth:`save_data <mspasspy.db.database.Database.save_data>` is probably the most common method you will use. The
1. :code:`save_data` is probably the most common method you will use. The
first argument is one of the atomic objects defined in MsPASS
(Seismogram or TimeSeries) that you wish to save. Options are
described in the docstring. Here is an example usage:
Expand All @@ -84,7 +84,7 @@ and most up to date usage:
normalized collections (:code:`source`, :code:`channel`, and/or :code:`site`) with no
safety checks. We discuss additional common options in a later section.

2. :py:meth:`save_ensemble_data <mspasspy.db.database.Database.save_ensemble_data>` is similar to :code:`save_data` except the first argument
2. :code:`save_ensemble_data` is similar to :code:`save_data` except the first argument
is an Ensemble object. There are currently two of them: (1) TimeSeriesEnsemble
and (2) SeismogramEnsemble. As discussed in the section
:ref:`data_object_design_concepts` an Ensemble
Expand All @@ -101,11 +101,7 @@ and most up to date usage:
are copied verbatim to each member. If previous values existed in any
of the members they will be silently replaced by the ensemble groups version.

:py:meth:`save_ensemble_data_binary_file <mspasspy.db.database.Database.save_ensemble_data_binary_file>`
is an optimized version of save_ensemble_data. It saves all objects of the
ensemble into one file, and only opens the file once.

3. :py:meth:`save_catalog <mspasspy.db.database.Database.save_catalog>` should be viewed mostly as a convenience method to build
3. :code:`save_catalog` should be viewed mostly as a convenience method to build
the :code:`source` collection from QUAKEML data downloaded from FDSN data
centers via obspy's web services functions. :code:`save_catalog` can be
thought of as a converter that translates the contents of a QUAKEML
Expand Down Expand Up @@ -137,7 +133,7 @@ and most up to date usage:
This particular example pulls 11 large aftershocks of the 2011 Tohoku
Earthquake.

4. :py:meth:`save_inventory <mspasspy.db.database.Database.save_inventory>` is similar in concept to :code:`save_catalog`, but instead of
4. :code:`save_inventory` is similar in concept to :code:`save_catalog`, but instead of
translating data for source information it translates information to
MsPASS for station metadata. The station information problem is slightly
more complicated than the source problem because of an implementation
Expand Down Expand Up @@ -199,31 +195,6 @@ and most up to date usage:
collection that has invalid documents you will need to write a custom function to override that
behaviour or rebuild the collection as needed with web services.

5. :code:`write_distributed_data` is a parallel equivalent of :code:`save_data` and :code:`save_ensemble_data`.
MsPASS supports two parallel frameworks called SPARK and DASK.
Both abstract the concept of the parallel data set in
a container they call an RDD and Bag respectively. Both are best thought
of as a handle to the entire data set that can be passed between
processing functions. The function can be thought of as writing the entire data set
from a parallel container to storage. The input is SPARK RDD or DASK BAG of objects (TimeSeries or Seismogram), and the
output is a dataframe of metadata. From the container, it will firstly write to files distributedly
using SPARK or DASK, and then write to the database sequentially. The two parts are done in two
functions: :code:`write_files`, and :code:`write_to_db`. It returns a dataframe of metadata for
each object in the original container. The return value can be used as input for :code:`read_distributed_data`
function.

Note that the objects should be written to different files, otherwise it may overwrite each other.
dir and dfile should be stored in each object.

:code:`write_files` is the writer for writing the object to storage. Input is an object (TimeSeries/Seismogram),
output is the metadata of the original object with some more parameters added. This is
the reverse of :code:`read_files`.

:code:`write_to_db` is to save a list of atomic data objects (TimeSeries or Seismogram)
to be managed with MongoDB. It will write to the doc and to the database for every metadata of the
target mspass object. Then return a dataframe of the metadata for target mspass objects.
The function is the reverse of :code:`read_to_dataframe`.

Read
~~~~~~~

Expand All @@ -233,7 +204,7 @@ and Seismogram. There are also convenience functions for reading ensembles.
As with the save operators we discuss here the key methods, but refer the
reader to the sphinx documentation for full usage.

1. :py:meth:`read_data <mspasspy.db.database.Database.read_data>` is the core method for reading atomic data. The method has
1. :code:`read_data` is the core method for reading atomic data. The method has
one required argument. That argument is an ObjectID for the document used
to define the read operation OR a MongoDB document (python dict) that
contains the ObjectID. The ObjectID is guaranteed to provide a
Expand All @@ -244,10 +215,10 @@ reader to the sphinx documentation for full usage.

.. code-block:: python
query = {...Some MongoDB query dict entry...}
cursor = db.wf_TimeSeries.find(query) # Changed to wf_Seismogram for 3D data
for doc in cursor:
d = db.read_data(doc) # Add option collection='wf_Seismogram' for 3C reads
query={...Some MongoDB query dict entry...}
cursor=db.wf_TimeSeries.find(query) # Changed to wf_Seismogram for 3D data
for doc in cursor:
d=db.read_data(doc) # Add option collection='wf_Seismogram' for 3C reads

By default :code:`read_data` will use the waveform collection defined
in the schema defined for the handle. The default for the standard
Expand Down Expand Up @@ -312,7 +283,7 @@ reader to the sphinx documentation for full usage.
3. The "pedantic" mode is mainly of use for data export where a
type mismatch could produce invalid data required by another package.

2. A closely related function to :code:`read_data` is :py:meth:`read_ensemble_data <mspasspy.db.database.Database.read_ensemble_data>`. Like
2. A closely related function to :code:`read_data` is :code:`read_ensemble_data`. Like
:code:`save_ensemble_data` it is mostly a loop to assemble an ensemble of
atomic data using a sequence of calls to :code:`read_data`. The sequence of
what to read is defined by arg 0. That arg must be one of two things:
Expand All @@ -337,17 +308,10 @@ reader to the sphinx documentation for full usage.
cursor = db.wf_TimeSeries.find(query)
ens = db.read_ensemble_data(cursoe)
:py:meth:`read_ensemble_data_group <mspasspy.db.database.Database.read_ensemble_data_group>`
is an optimized version of :code:`save_ensemble_data`. It groups the files firstly to avoid
duplicate open for the same file. Open and close the file only when the dir or dfile change.
When multiple objects store in the same file, this function will group the files first
and collect their foffs in that file. Then open the file once, and sequentially read the data
according to the foffs. This function only supports reading from binary files.

3. A workflow that needs to read and process a large data sets in
a parallel environment should use
the parallel equivalent of :code:`read_data` and :code:`read_ensemble_data` called
:py:meth:`read_distributed_data <mspasspy.db.database.Database.read_distributed_data>`. MsPASS supports two parallel frameworks called
:code:`read_distributed_data`. MsPASS supports two parallel frameworks called
SPARK and DASK. Both abstract the concept of the parallel data set in
a container they call an RDD and Bag respectively. Both are best thought
of as a handle to the entire data set that can be passed between
Expand Down Expand Up @@ -385,25 +349,6 @@ reader to the sphinx documentation for full usage.
If you are using DASK instead of SPARK you would add the optional
argument :code:`format='dask'`.

:code:`read_distributed_data` divide the process of reading into two parts:
reading from database and reading from file, where reading from database is
done in sequence, and reading from file is done with DASK or SPARK. The two parts
are done in two functions: :code:`read_to_dataframe`, and :code:`read_files`.
The division is to avoid using database in DASK or SPARK to improve efficiency.

The input can also be a dataframe, which stores the information of the metadata.
It will read from file/gridfs according to the metadata and construct the objects.

:code:`read_to_dataframe` firstly construct a list of objects using cursor.
Then for each object, constrcut the metadata and add to the list. Finally it will
convert the list to a dataframe.

:code:`read_files` is the reader for constructing the object from storage. Firstly construct the object,
either TimeSeries or Seismogram, then read the stored data from a file or in gridfs and
loads it into the mspasspy object. It will also load history in metadata. If the object is
marked dead, it will not read and return an empty object with history. The logic of reading
is same as :code:`Database.read_data`.

Update
~~~~~~

Expand Down Expand Up @@ -475,7 +420,12 @@ In MsPASS we adopt these rules to keep delete operations under control.
We trust rules 1 and 2 require no further comment. Rule 3, however,
needs some clarification to understand how we handle deletes.
A good starting point is to look at the signature of the simple core delete
method of the Database class: :py:meth:`delete_data <mspasspy.db.database.Database.delete_data>`
method of the Database class:

.. code-block:: python
def delete_data(self,id,remove_unreferenced_files=False,
clear_history=True,clear_elog=False):)
As with the read methods id is the ObjectID of the wf collection document
that references the data to be deleted.
Expand Down Expand Up @@ -681,12 +631,12 @@ list of elog messages:
# This needs to be checked for correctness - done while off the grid
query = {'$def' : 'tombstone'}
cursor = db.elog.find(query)
cursor=db.elog.find(query)
for doc in cursor:
wfmd = doc['tombstone']
wfmd=doc['tombstone']
print('Error log contents for this Seismogram marked dead:',
wfmd['net'], wfmd['sta'], UTCDateTime(wfmd['startime']))
err = doc['logdata']
wfmd['net'],wfmd['sta'],UTCDateTime(wfmd['startime'])
err=doc['logdata']
for e in err:
print(e.message)
Expand Down
15 changes: 15 additions & 0 deletions docs/source/user_manual/FAQ.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.. _FAQ:

Frequency Asked Questions (FAQ)
=====================================

This page is a link to pages on pragmatic
topics that can, we hope, make MsPASS more approachable. The topic of each
page is used as the hyperlink text. Click on topics of interest to learn
more.

:ref:`How do I develop a new workflow from scratch? <development_strategies>`

:ref:`How does MsPASS handle continuous data? <continuous_data>`

:ref:`What database schema should I use? <schema_choices>`
Loading

0 comments on commit 6a4613f

Please sign in to comment.