From 06121167ec5981e899fcc1890877bac0756ef216 Mon Sep 17 00:00:00 2001 From: Stefano Savare Date: Mon, 14 Jun 2021 19:15:27 +0200 Subject: [PATCH] Added docs Signed-off-by: Stefano Savare --- .gitignore | 3 + Makefile | 21 ++-- README.md | 13 --- README.rst | 101 +++++++++++++++++++ requirements.txt => cloud-requirements.txt | 0 doc-requirements.txt | 6 ++ docs/source/conf.py | 65 ++++++++++++ docs/source/index.rst | 26 +++++ docs/source/modeling/current_work.rst | 58 +++++++++++ docs/source/modeling/testing_environment.rst | 9 ++ docs/source/sagemaker/sagemaker.rst | 38 +++++++ 11 files changed, 319 insertions(+), 21 deletions(-) delete mode 100644 README.md create mode 100644 README.rst rename requirements.txt => cloud-requirements.txt (100%) create mode 100644 doc-requirements.txt create mode 100644 docs/source/conf.py create mode 100644 docs/source/index.rst create mode 100644 docs/source/modeling/current_work.rst create mode 100644 docs/source/modeling/testing_environment.rst create mode 100644 docs/source/sagemaker/sagemaker.rst diff --git a/.gitignore b/.gitignore index 9978a8d..06346f3 100644 --- a/.gitignore +++ b/.gitignore @@ -71,6 +71,9 @@ instance/ # Sphinx documentation docs/_build/ +docs/build +docs/documentation +make.bat # PyBuilder target/ diff --git a/Makefile b/Makefile index a5d03c0..050cea5 100644 --- a/Makefile +++ b/Makefile @@ -1,18 +1,17 @@ -install: - pip install -r requirements.txt +cloud-install: + pip install -r cloud-requirements.txt pre-commit install -local-install: install - pip install -r local-requirements.txt +install: cloud-install + pip install -r requirements.txt conda install -y jupyter conda install -y -c conda-forge jupyter_contrib_nbextensions jupyter contrib nbextension install --user -cloud-install: - source activate pytorch_p36 - local-install +dev-install: install + pip install -r doc-requirements.txt -streamlit: +streamlit-install: pip install -r streamlit-requirements.txt pip install git+https://github.com/casics/nostril.git @@ -26,3 +25,9 @@ streamlit-deploy: aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 961104659532.dkr.ecr.us-east-1.amazonaws.com docker tag deatinor/streamlit 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit docker push 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit + +documentation: + sphinx-build -b html docs/source docs/documentation + +documentation-push: + aws s3 sync docs/documentation s3://deep-documentation/deep-experiments \ No newline at end of file diff --git a/README.md b/README.md deleted file mode 100644 index 9c49529..0000000 --- a/README.md +++ /dev/null @@ -1,13 +0,0 @@ -# Deep Experiments - -This repository is dedicated to the NLP part of the DEEP project. - -## Quick-start - -TODO - -## Folder structure - -- `notebooks` contains all the Jupyter Notebook, divided by category and person working on them -- `scripts` contains the training scripts necessary for Sagemaker -- `deep` contains the code diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..fd01203 --- /dev/null +++ b/README.rst @@ -0,0 +1,101 @@ +Deep Experiments +================ + +This repository is dedicated to the NLP part of the DEEP project. +The code is tightly coupled with AWS Sagemaker. + +Quick-start +----------- + +Local development +~~~~~~~~~~~~~~~~~ + +Contact Stefano to get the AWS credentials, install the +`AWS CLI `_ + +Clone the repo and pull the data + +.. code-block:: bash + + git clone + cd deep-experiments + +Create a new conda environment: + +.. code-block:: bash + + conda create -n deepl python=3.9.1 + +Install necessary libraries: + +.. code-block:: bash + + make dev-install + +Pull the data: + +.. code-block:: bash + + dvc pull + + + + +Notebook instances on AWS +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ask Stefano for a AWS user account and a new Notebook instance on AWS. +The notebook instance comes with the repo already cloned. + +Once it is ready, start the instance and click on *Open Jupyter*. +Open the Jupyter terminal and ``cd`` to the ``deep-experiments`` repo. It should be: + +.. code-block:: bash + + cd SageMaker/deep-experiments + +Run: + +.. code-block:: bash + + make cloud-install + +(This must be run everytime the instance is activated) + +Pull the data: + +.. code-block:: bash + + dvc pull + +Streamlit +~~~~~~~~~ + +We incorporated in the repo the ``streamlit`` web application. In the future we will put it in +another repo. + +To use it locally: + +.. code-block:: bash + + make streamlit-install + streamlit run scripts/testing/subpillar_pred_with_st.py + +You can also build and deploy a Docker application to ECR and Beanstalk: + +.. code-block:: bash + + make streamlit-build + make streamlit-deploy + +You may need to change the local image name (WIP). +Also we plan to add Github Actions to automate this procedure + + +Folder structure +---------------- + +- ``data`` contains the data +- ``deep`` contains the code +- ``notebooks`` contains all the Jupyter Notebook, divided by category and person working on them +- ``scripts`` contains the training scripts necessary for Sagemaker diff --git a/requirements.txt b/cloud-requirements.txt similarity index 100% rename from requirements.txt rename to cloud-requirements.txt diff --git a/doc-requirements.txt b/doc-requirements.txt new file mode 100644 index 0000000..008ea68 --- /dev/null +++ b/doc-requirements.txt @@ -0,0 +1,6 @@ +sphinx==4.0.2 +numpydoc==1.1.0 +sphinx_issues==1.2.0 +sphinx_rtd_theme==0.5.2 +nbsphinx==0.8.6 +nbsphinx_link==1.3.0 diff --git a/docs/source/conf.py b/docs/source/conf.py new file mode 100644 index 0000000..144e2f5 --- /dev/null +++ b/docs/source/conf.py @@ -0,0 +1,65 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +import os +import sys + +sys.path.insert(0, os.path.abspath("../../")) + + +# -- Project information ----------------------------------------------------- + +project = "Deep Experiments" +copyright = "" +author = "DFS" + +# The full version, including alpha/beta/rc tags +release = "0.1.0" + + +# -- General development --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + "sphinx.ext.autodoc", + "nbsphinx", + "nbsphinx_link", + "numpydoc", + "sphinx.ext.napoleon", + "sphinx.ext.mathjax", + "sphinx.ext.intersphinx", +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ["_templates"] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = [] + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = "sphinx_rtd_theme" + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ["_static"] + +autosummary_generate = True diff --git a/docs/source/index.rst b/docs/source/index.rst new file mode 100644 index 0000000..1265d7a --- /dev/null +++ b/docs/source/index.rst @@ -0,0 +1,26 @@ +.. Typewriter documentation master file, created by + sphinx-quickstart on Wed Jul 22 15:32:08 2020. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +.. include:: ../../README.rst + +.. toctree:: + :maxdepth: 1 + :caption: Data: + + data/description + +.. toctree:: + :maxdepth: 1 + :caption: Modeling: + + modeling/current_work + modeling/testing_environment + +.. toctree:: + :maxdepth: 1 + :caption: Sagemaker: + + sagemaker/sagemaker + diff --git a/docs/source/modeling/current_work.rst b/docs/source/modeling/current_work.rst new file mode 100644 index 0000000..f703e20 --- /dev/null +++ b/docs/source/modeling/current_work.rst @@ -0,0 +1,58 @@ +Current Work +============ + +Results +-------- + +An (incomplete) summary of the results of the models that we have tried can be found +`here `_ + +We tested the following: + +- sectors +- subpillars + +Sectors +~~~~~~~~~ + +The current performance is already good enough to test it in production. + +Subpillars +~~~~~~~~~~ + +The current performance is not too good. It is ok for the pillars but not for the subpillars, in particular +the less frequent ones. + +Models +~~~~~~ + +We tested recent Deep Learning models, in particular transformers, finetuned for text classification. +We tried basic techniques for unbalanced text classification (oversampling, entailment) with no success. +The best results so far have been obtained with a multi-lingual transformer. + +Metric +~~~~~~ + +We give results of F1 score, recall and precision. +Recall may be more important, because the biggest goal would be to avoid the taggers to open and read the +PDF/web article. Only a good recall can give this. + +Training +--------- + +You have two options to train a model: + +- Use a Sagemaker notebook instance with GPU +- Use the `Sagemaker training feature `_ (recommended) + +The first way is more intuitive and interactive. +However, once you set-up the main parameters of the training +and you reach a semi-definitive version of the training script, we advise you to move to training jobs. +They can use faster GPUs, are cheaper (you can run a training job from your +local PC and pay only the training time) and allow for a better tracking of the results and model artifact. + +Experiment tracking +------------------- + +We are discussing the preferred tracking system in these days. +The main options are MLFlow and the built-in Sagemaker Experiments/Pipeline. diff --git a/docs/source/modeling/testing_environment.rst b/docs/source/modeling/testing_environment.rst new file mode 100644 index 0000000..a3ff5e8 --- /dev/null +++ b/docs/source/modeling/testing_environment.rst @@ -0,0 +1,9 @@ +Testing Environment +=================== + +We have built a simple online test environment to check the predictions of our models on the subpillars. +You can find it `here `_. + +It is useful to show to externals our results. We plan to add more features to it. + +We have used Streamlit as Python library, and Docker + ECR + Beanstalk as deployment option. \ No newline at end of file diff --git a/docs/source/sagemaker/sagemaker.rst b/docs/source/sagemaker/sagemaker.rst new file mode 100644 index 0000000..92043ce --- /dev/null +++ b/docs/source/sagemaker/sagemaker.rst @@ -0,0 +1,38 @@ +Sagemaker +========= + +We use mainly AWS Sagemaker as training and deployment platform. +Many features have been built into the platform. We go over the ones we use: + +Notebook Instances +------------------- + +The easiest one, mainly EC2 + jupyter notebook already installed. It should not be difficult to user + +They have already integrated Git repositories. + +Training Jobs +------------- + +A useful feature to train models quickly and cheaply. +An example notebook is ``ssa-1.8-fastai-multilabel-remote.ipynb`` with the corresponding scripts in +``scripts/training/stefano/multiclass-fastai``. + +There are many details in the notebook and in the scripts that I will write here. + +Hyperparameter tuning jobs +-------------------------- + +Very similar to training jobs, they allow to tune the hyperparamers, should not be difficult to use, +but we did not try yet. + +Inference - API +--------------- + +Deploy quickly one of the models trained in the trained job. + +Inference - Batch Transform Job +------------------------------- +Associate to one of the models trained in the trained job a corresponding inference job, +that can be triggered. This is probably what we will use in the future for production. +