Added docs

Signed-off-by: Stefano Savare <[email protected]>
the-deep · Jun 14, 2021 · 0612116 · 0612116
1 parent fc8a471
commit 0612116
Show file tree

Hide file tree

Showing 11 changed files with 319 additions and 21 deletions.
diff --git a/.gitignore b/.gitignore
@@ -71,6 +71,9 @@ instance/
 
 # Sphinx documentation
 docs/_build/
+docs/build
+docs/documentation
+make.bat
 
 # PyBuilder
 target/

diff --git a/Makefile b/Makefile
@@ -1,18 +1,17 @@
-install:
-	pip install -r requirements.txt
+cloud-install:
+	pip install -r cloud-requirements.txt
 	pre-commit install
 
-local-install: install
-	pip install -r local-requirements.txt
+install: cloud-install
+	pip install -r requirements.txt
 	conda install -y jupyter
 	conda install -y -c conda-forge jupyter_contrib_nbextensions
 	jupyter contrib nbextension install --user
 
-cloud-install:
-	source activate pytorch_p36
-	local-install
+dev-install: install
+	pip install -r doc-requirements.txt
 
-streamlit:
+streamlit-install:
 	pip install -r streamlit-requirements.txt
 	pip install git+https://github.com/casics/nostril.git
 
@@ -26,3 +25,9 @@ streamlit-deploy:
 	aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 961104659532.dkr.ecr.us-east-1.amazonaws.com
 	docker tag deatinor/streamlit 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit
 	docker push 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit
+
+documentation:
+	sphinx-build -b html docs/source docs/documentation
+
+documentation-push:
+	aws s3 sync docs/documentation s3://deep-documentation/deep-experiments
diff --git a/README.md b/README.md
diff --git a/README.rst b/README.rst
@@ -0,0 +1,101 @@
+Deep Experiments
+================
+
+This repository is dedicated to the NLP part of the DEEP project.
+The code is tightly coupled with AWS Sagemaker.
+
+Quick-start
+-----------
+
+Local development
+~~~~~~~~~~~~~~~~~
+
+Contact Stefano to get the AWS credentials, install the
+`AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html>`_
+
+Clone the repo and pull the data
+
+.. code-block:: bash
+
+    git clone <deep_experiments_repo>
+    cd deep-experiments
+
+Create a new conda environment:
+
+.. code-block:: bash
+
+   conda create -n deepl python=3.9.1
+
+Install necessary libraries:
+
+.. code-block:: bash
+
+    make dev-install
+
+Pull the data:
+
+.. code-block:: bash
+
+    dvc pull
+
+
+
+
+Notebook instances on AWS
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ask Stefano for a AWS user account and a new Notebook instance on AWS.
+The notebook instance comes with the repo already cloned.
+
+Once it is ready, start the instance and click on *Open Jupyter*.
+Open the Jupyter terminal and ``cd`` to the ``deep-experiments`` repo. It should be:
+
+.. code-block:: bash
+
+    cd SageMaker/deep-experiments
+
+Run:
+
+.. code-block:: bash
+
+    make cloud-install
+
+(This must be run everytime the instance is activated)
+
+Pull the data:
+
+.. code-block:: bash
+
+    dvc pull
+
+Streamlit
+~~~~~~~~~
+
+We incorporated in the repo the ``streamlit`` web application. In the future we will put it in
+another repo.
+
+To use it locally:
+
+.. code-block:: bash
+
+    make streamlit-install
+    streamlit run scripts/testing/subpillar_pred_with_st.py
+
+You can also build and deploy a Docker application to ECR and Beanstalk:
+
+.. code-block:: bash
+
+    make streamlit-build
+    make streamlit-deploy
+
+You may need to change the local image name (WIP).
+Also we plan to add Github Actions to automate this procedure
+
+
+Folder structure
+----------------
+
+- ``data`` contains the data
+- ``deep`` contains the code
+- ``notebooks`` contains all the Jupyter Notebook, divided by category and person working on them
+- ``scripts`` contains the training scripts necessary for Sagemaker
diff --git a/requirements.txt → cloud-requirements.txt b/requirements.txt → cloud-requirements.txt
diff --git a/doc-requirements.txt b/doc-requirements.txt
@@ -0,0 +1,6 @@
+sphinx==4.0.2
+numpydoc==1.1.0
+sphinx_issues==1.2.0
+sphinx_rtd_theme==0.5.2
+nbsphinx==0.8.6
+nbsphinx_link==1.3.0
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -0,0 +1,65 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath("../../"))
+
+
+# -- Project information -----------------------------------------------------
+
+project = "Deep Experiments"
+copyright = ""
+author = "DFS"
+
+# The full version, including alpha/beta/rc tags
+release = "0.1.0"
+
+
+# -- General development ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx.ext.autodoc",
+    "nbsphinx",
+    "nbsphinx_link",
+    "numpydoc",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.mathjax",
+    "sphinx.ext.intersphinx",
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = []
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_rtd_theme"
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+
+autosummary_generate = True
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -0,0 +1,26 @@
+.. Typewriter documentation master file, created by
+   sphinx-quickstart on Wed Jul 22 15:32:08 2020.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+.. include:: ../../README.rst
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Data:
+
+   data/description
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Modeling:
+
+   modeling/current_work
+   modeling/testing_environment
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Sagemaker:
+
+   sagemaker/sagemaker
+
diff --git a/docs/source/modeling/current_work.rst b/docs/source/modeling/current_work.rst
@@ -0,0 +1,58 @@
+Current Work
+============
+
+Results
+--------
+
+An (incomplete) summary of the results of the models that we have tried can be found
+`here <https://docs.google.com/spreadsheets/d/1zCbyZNb-Smz3GsEeJO6oyodvEK3rjgxJSDiC47jSh-o/edit#gid=299270945>`_
+
+We tested the following:
+
+- sectors
+- subpillars
+
+Sectors
+~~~~~~~~~
+
+The current performance is already good enough to test it in production.
+
+Subpillars
+~~~~~~~~~~
+
+The current performance is not too good. It is ok for the pillars but not for the subpillars, in particular
+the less frequent ones.
+
+Models
+~~~~~~
+
+We tested recent Deep Learning models, in particular transformers, finetuned for text classification.
+We tried basic techniques for unbalanced text classification (oversampling, entailment) with no success.
+The best results so far have been obtained with a multi-lingual transformer.
+
+Metric
+~~~~~~
+
+We give results of F1 score, recall and precision.
+Recall may be more important, because the biggest goal would be to avoid the taggers to open and read the
+PDF/web article. Only a good recall can give this.
+
+Training
+---------
+
+You have two options to train a model:
+
+- Use a Sagemaker notebook instance with GPU
+- Use the `Sagemaker training feature <https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html>`_ (recommended)
+
+The first way is more intuitive and interactive.
+However, once you set-up the main parameters of the training
+and you reach a semi-definitive version of the training script, we advise you to move to training jobs.
+They can use faster GPUs, are cheaper (you can run a training job from your
+local PC and pay only the training time) and allow for a better tracking of the results and model artifact.
+
+Experiment tracking
+-------------------
+
+We are discussing the preferred tracking system in these days.
+The main options are MLFlow and the built-in Sagemaker Experiments/Pipeline.
diff --git a/docs/source/modeling/testing_environment.rst b/docs/source/modeling/testing_environment.rst
@@ -0,0 +1,9 @@
+Testing Environment
+===================
+
+We have built a simple online test environment to check the predictions of our models on the subpillars.
+You can find it `here <http://test-env.eba-crsiq2wb.us-east-1.elasticbeanstalk.com>`_.
+
+It is useful to show to externals our results. We plan to add more features to it.
+
+We have used Streamlit as Python library, and Docker + ECR + Beanstalk as deployment option.
diff --git a/docs/source/sagemaker/sagemaker.rst b/docs/source/sagemaker/sagemaker.rst
@@ -0,0 +1,38 @@
+Sagemaker
+=========
+
+We use mainly AWS Sagemaker as training and deployment platform.
+Many features have been built into the platform. We go over the ones we use:
+
+Notebook Instances
+-------------------
+
+The easiest one, mainly EC2 + jupyter notebook already installed. It should not be difficult to user
+
+They have already integrated Git repositories.
+
+Training Jobs
+-------------
+
+A useful feature to train models quickly and cheaply.
+An example notebook is ``ssa-1.8-fastai-multilabel-remote.ipynb`` with the corresponding scripts in
+``scripts/training/stefano/multiclass-fastai``.
+
+There are many details in the notebook and in the scripts that I will write here.
+
+Hyperparameter tuning jobs
+--------------------------
+
+Very similar to training jobs, they allow to tune the hyperparamers, should not be difficult to use,
+but we did not try yet.
+
+Inference - API
+---------------
+
+Deploy quickly one of the models trained in the trained job.
+
+Inference - Batch Transform Job
+-------------------------------
+Associate to one of the models trained in the trained job a corresponding inference job,
+that can be triggered. This is probably what we will use in the future for production.
+