Issue/18/dataset types (#24)

* Added RailDataset, to allow for strong checking of type-matching between plotters and the datasets they use * isort * fixing up docs * use classes own name in generate_dataset_dict()
LSSTDESC · Feb 6, 2025 · 6c8a0ca · 6c8a0ca
1 parent b5c282c
commit 6c8a0ca
Show file tree

Hide file tree

Showing 49 changed files with 545 additions and 673 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -16,14 +16,14 @@
 import subprocess
 import sys
 import pkgutil
-import rail.projects
-import rail.plotting
-import rail.cli.rail_plot
-import rail.cli.rail_project
+import rail
+#import rail.plotting
+#import rail.cli.rail_project
+#import rail.cli.rail_plot
 
 
 sys.path.insert(0, os.path.abspath('..'))
-sys.path.insert(0, os.path.abspath('../src/rail/cli'))
+sys.path.insert(0, os.path.abspath('../src'))
 
 print(sys.path)
 
@@ -114,11 +114,6 @@
 nbsphinx_allow_errors = True
 
 
-autodoc_default_options = {
-    'special-members': '__call__',
-}
-
-
 # use type hints in autodoc
 autodoc_typehints = "description"
 
@@ -193,8 +188,10 @@ def run_apidoc(_):
     cur_dir = os.path.normpath(os.path.dirname(__file__))
     output_path = os.path.join(cur_dir, 'api')
 
-    src_path = os.path.normpath(os.path.join(os.path.dirname(__file__), '..', 'src', 'rail'))    
-    paramlist = ['--separate', '--implicit-namespaces', '-M', '-o', output_path, '-f', src_path]
+    base_path = os.path.normpath(os.path.join(os.path.dirname(__file__), '..', 'src'))
+
+    srcpath = os.path.normpath(os.path.join(base_path, 'rail'))
+    paramlist = ['--separate', '--implicit-namespaces', '--no-toc', '-M', '-o', output_path, '-f', srcpath]
     print(f"running {paramlist}")
     apidoc_main(paramlist)
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -62,8 +62,8 @@ guidance on citing RAIL and the underlying algorithms.
 
    source/contributing
    source/fix_an_issue
+   source/new_dataset
    source/new_plotter
-   source/new_data_extractor
    source/new_dataset_holder
 
 .. toctree::
@@ -73,7 +73,10 @@ guidance on citing RAIL and the underlying algorithms.
    demos
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 4
    :caption: API
 
-   api/modules
+   api/rail
+
+
+
diff --git a/docs/source/analysis_components.rst b/docs/source/analysis_components.rst
@@ -67,49 +67,49 @@ There are several sub-classes of `RailAlgorithmHolder` for different types of al
 PZAlgorithm
 -----------
 
-.. autoclass:: rail.projects. algorithm_holder.RailPZAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailPZAlgorithmHolder
     :noindex:
 
 
 Summarizer
 ----------
 
-.. autoclass:: rail.projects. algorithm_holder.RailSummarizerAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailSummarizerAlgorithmHolder
     :noindex:
 
 
 Classifier
 ----------
 
-.. autoclass:: rail.projects. algorithm_holder.RailClassificationAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailClassificationAlgorithmHolder
     :noindex:
 
 
 SpecSelection
 -------------
 
-.. autoclass:: rail.projects. algorithm_holder.RailSpecSelectionAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailSpecSelectionAlgorithmHolder
     :noindex:
 
 
 ErrorModel
 ----------
 
-.. autoclass:: rail.projects. algorithm_holder.RailErrorModelAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailErrorModelAlgorithmHolder
     :noindex:
 
 
 Reducer
 -------
 
-.. autoclass:: rail.projects. algorithm_holder.RailReducerAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailReducerAlgorithmHolder
     :noindex:
 
 
 Subsampler
 ----------
 
-.. autoclass:: rail.projects. algorithm_holder.RailSubsamplerAlgorithmHolder
+.. autoclass:: rail.projects.algorithm_holder.RailSubsamplerAlgorithmHolder
     :noindex:
 
 

diff --git a/docs/source/cli.rst b/docs/source/cli.rst
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -145,7 +145,7 @@ We anticipate a few types of contributions, and provide separate instructions
 for those workflows:
 
 * :ref:`Fix an Issue` in the codebase
+* :ref:`Adding a new RailDataset type`
 * :ref:`Adding a new RailPlotter` 
-* :ref:`Adding a new DataExtractor`
 * :ref:`Adding a new RailDatasetHolder`
 
diff --git a/docs/source/new_data_extractor.rst b/docs/source/new_data_extractor.rst
diff --git a/docs/source/new_dataset.rst b/docs/source/new_dataset.rst
@@ -0,0 +1,33 @@
+=============================
+Adding a new RailDataset type
+=============================
+
+Because of the variety of formats of files in RAIL, and the variety of analysis flavors
+in a ``RailProject``, it is useful to be able to define the particular types of
+datasets that are needed to make specific plots. These are implemented as subclasses of the :py:class:`rail.plotting.dataset.RailDataset` class.
+A ``RailDataset`` is intended define the quantities needed to make a particular type of plot.
+
+
+New RailDataset Example
+-----------------------
+
+The following example has all of the required pieces of a ``RailDataset`` and almost nothing else.
+
+.. code-block:: python
+
+    class RailPZPointEstimateDataset(RailDataset):
+        """Dataet to hold a vector p(z) point estimates and corresponding
+        true redshifts
+        """
+
+        data_types = dict(
+            truth=np.ndarray,
+            pointEstimate=np.ndarray,
+        )
+
+
+The required pieces, in the order that they appear are:
+
+#. The ``RailPZPointEstimateDataset (RailDataset):`` defines a class called ``RailPZPointEstimateDataset`` and specifies that it inherits from ``RailDataset``.
+
+#. The ``data_types`` define names and expected data types of the required data.
diff --git a/docs/source/new_dataset_holder.rst b/docs/source/new_dataset_holder.rst
@@ -18,14 +18,11 @@ The following example has all of the required pieces of a ``RailDatasetHolder``
 
 .. code-block:: python
 
-    class RailProjectDatasetHolder(RailDatasetHolder):
+    class RailPZPointEstimateDataHolder(RailDatasetHolder):
         """Simple class for holding a dataset for plotting data that comes from a RailProject"""
 
         config_options: dict[str, StageParameter] = dict(
             name=StageParameter(str, None, fmt="%s", required=True, msg="Dataset name"),
-            extractor=StageParameter(
-                str, None, fmt="%s", required=True, msg="Dataset extractor class name"
-            ),
             project=StageParameter(
                 str, None, fmt="%s", required=True, msg="RailProject name"
             ),
@@ -45,17 +42,17 @@ The following example has all of the required pieces of a ``RailDatasetHolder``
 
         extractor_inputs: dict = {
             "project": RailProject,
-            "extractor": RailProjectDataExtractor,
             "selection": str,
             "flavor": str,
             "tag": str,
             "algo": str,
         }
 
+	output_type: type[RailDataset] = RailPZPointEstimateDataset
+
         def __init__(self, **kwargs: Any):
             RailDatasetHolder.__init__(self, **kwargs)
             self._project: RailProject | None = None
-            self._extractor: RailProjectDataExtractor | None = None
 
         def __repr__(self) -> str:
             ret_str = (
@@ -69,14 +66,9 @@ The following example has all of the required pieces of a ``RailDatasetHolder``
 
         def get_extractor_inputs(self) -> dict[str, Any]:
             if self._project is None:
-                self._project = RailDatasetFactory.get_project(self.config.project)()
-            if self._extractor is None:
-                self._extractor = RailProjectDataExtractor.create_from_dict(
-                    dict(name=self.config.name, class_name=self.config.extractor),
-                )
+                self._project = RailDatasetFactory.get_project(self.config.project).resolve()
             the_extractor_inputs = dict(
                 project=self._project,
-                extractor=self._extractor,
                 selection=self.config.selection,
                 flavor=self.config.flavor,
                 tag=self.config.tag,
@@ -85,6 +77,15 @@ The following example has all of the required pieces of a ``RailDatasetHolder``
             self._validate_extractor_inputs(**the_extractor_inputs)
             return the_extractor_inputs
 
+	def _get_data(self, **kwargs: Any) -> dict[str, Any] | None:
+            return get_pz_point_estimate_data(**kwargs)
+	    
+        @classmethod
+        def generate_dataset_dict(
+            cls,
+            **kwargs: Any,
+        ) -> list[dict[str, Any]]:
+
 
 The required pieces, in the order that they appear are:
 
@@ -94,8 +95,19 @@ The required pieces, in the order that they appear are:
 
 #. The ``extractor_inputs = [('input', PqHandle)]`` and ``outputs = [('output', PqHandle)]``  define the inputs that will be based to the 
 
+#. The ``output_type: type[RailDataset] = RailPZPointEstimateDataset``
+   line specifies that this class will return a
+   RailPZPointEstimateDataset dataset.
+
 #. The ``__init__`` method does any class-specific initialization, in this case defining that this class will store and project and extractor 
 
 #. The ``__repr__`` method is optional, here it gives a useful representation of the class
 
-#. The ``get_extractor_inputs()`` method does the actual work, note that it doesn't take any arguments, that it uses the factories to find the helper objects and passes algo it's configuration and validates it's outputs
+#. The ``get_extractor_inputs()`` method does the first part of the actual work, note
+   that it doesn't take any arguments, that it uses the factories to
+   find the helper objects and passes algo it's configuration and
+   validates it's outputs
+
+#. The ``_get_data()`` method does the rest of actual work (in this case it passes it off to a utility function ``get_pz_point_estimate_data`` which knows how to extract data from the ``RailProject``
+
+#. The ``generate_dataset_dict()`` can scan a ``RailProject`` and generate a dictionary of all the available datasets
diff --git a/docs/source/new_plotter.rst b/docs/source/new_plotter.rst
@@ -29,10 +29,7 @@ The following example has all of the required pieces of a ``RailPlotter`` and al
             n_zbins=StageParameter(int, 150, fmt="%i", msg="Number of z bins"),
         )
 
-        inputs: dict = {
-            "truth": np.ndarray,
-            "pointEstimate": np.ndarray,
-        }
+        input_type = RailPZPointEstimateDataset
 
         def _make_2d_hist_plot(
             self,
@@ -90,7 +87,11 @@ The required pieces, in the order that they appear are:
 
 #. The ``config_options`` lines define the configuration parameters for this class, as well as their default values.  Note that here we are copying the configuration parameters from the ``RailPlotter`` as well as defining some new ones.
 
-#. The ``inputs: dict = ...`` define the inputs and expected data types for those, in this case two numpy arrays
+#. The ``input_type = RailPZPointEstimateDataset`` specifies that this
+   plotter expects a :py:class:`rail.plotting.pz_plotters.RailPZPointEstimateDataset` type dataset, which in
+   this case is an dict with one item (called ``truth``) that is a
+   numpy array, and a second item (called ``pointEstimate``) that is a
+   also a numpy array.
 
 #. The ``__init__`` method does any class-specific initialization.  In this case there isn't any and the method is superfluous.