Skip to content

Commit

Permalink
Added docs
Browse files Browse the repository at this point in the history
Signed-off-by: Stefano Savare <[email protected]>
  • Loading branch information
deatinor committed Jun 14, 2021
1 parent fc8a471 commit 0612116
Show file tree
Hide file tree
Showing 11 changed files with 319 additions and 21 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ instance/

# Sphinx documentation
docs/_build/
docs/build
docs/documentation
make.bat

# PyBuilder
target/
Expand Down
21 changes: 13 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
install:
pip install -r requirements.txt
cloud-install:
pip install -r cloud-requirements.txt
pre-commit install

local-install: install
pip install -r local-requirements.txt
install: cloud-install
pip install -r requirements.txt
conda install -y jupyter
conda install -y -c conda-forge jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

cloud-install:
source activate pytorch_p36
local-install
dev-install: install
pip install -r doc-requirements.txt

streamlit:
streamlit-install:
pip install -r streamlit-requirements.txt
pip install git+https://github.com/casics/nostril.git

Expand All @@ -26,3 +25,9 @@ streamlit-deploy:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 961104659532.dkr.ecr.us-east-1.amazonaws.com
docker tag deatinor/streamlit 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit
docker push 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit

documentation:
sphinx-build -b html docs/source docs/documentation

documentation-push:
aws s3 sync docs/documentation s3://deep-documentation/deep-experiments
13 changes: 0 additions & 13 deletions README.md

This file was deleted.

101 changes: 101 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
Deep Experiments
================

This repository is dedicated to the NLP part of the DEEP project.
The code is tightly coupled with AWS Sagemaker.

Quick-start
-----------

Local development
~~~~~~~~~~~~~~~~~

Contact Stefano to get the AWS credentials, install the
`AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html>`_

Clone the repo and pull the data

.. code-block:: bash
git clone <deep_experiments_repo>
cd deep-experiments
Create a new conda environment:

.. code-block:: bash
conda create -n deepl python=3.9.1
Install necessary libraries:

.. code-block:: bash
make dev-install
Pull the data:

.. code-block:: bash
dvc pull
Notebook instances on AWS
~~~~~~~~~~~~~~~~~~~~~~~~~

Ask Stefano for a AWS user account and a new Notebook instance on AWS.
The notebook instance comes with the repo already cloned.

Once it is ready, start the instance and click on *Open Jupyter*.
Open the Jupyter terminal and ``cd`` to the ``deep-experiments`` repo. It should be:

.. code-block:: bash
cd SageMaker/deep-experiments
Run:

.. code-block:: bash
make cloud-install
(This must be run everytime the instance is activated)

Pull the data:

.. code-block:: bash
dvc pull
Streamlit
~~~~~~~~~

We incorporated in the repo the ``streamlit`` web application. In the future we will put it in
another repo.

To use it locally:

.. code-block:: bash
make streamlit-install
streamlit run scripts/testing/subpillar_pred_with_st.py
You can also build and deploy a Docker application to ECR and Beanstalk:

.. code-block:: bash
make streamlit-build
make streamlit-deploy
You may need to change the local image name (WIP).
Also we plan to add Github Actions to automate this procedure


Folder structure
----------------

- ``data`` contains the data
- ``deep`` contains the code
- ``notebooks`` contains all the Jupyter Notebook, divided by category and person working on them
- ``scripts`` contains the training scripts necessary for Sagemaker
File renamed without changes.
6 changes: 6 additions & 0 deletions doc-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
sphinx==4.0.2
numpydoc==1.1.0
sphinx_issues==1.2.0
sphinx_rtd_theme==0.5.2
nbsphinx==0.8.6
nbsphinx_link==1.3.0
65 changes: 65 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys

sys.path.insert(0, os.path.abspath("../../"))


# -- Project information -----------------------------------------------------

project = "Deep Experiments"
copyright = ""
author = "DFS"

# The full version, including alpha/beta/rc tags
release = "0.1.0"


# -- General development ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"nbsphinx",
"nbsphinx_link",
"numpydoc",
"sphinx.ext.napoleon",
"sphinx.ext.mathjax",
"sphinx.ext.intersphinx",
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

autosummary_generate = True
26 changes: 26 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. Typewriter documentation master file, created by
sphinx-quickstart on Wed Jul 22 15:32:08 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. include:: ../../README.rst

.. toctree::
:maxdepth: 1
:caption: Data:

data/description

.. toctree::
:maxdepth: 1
:caption: Modeling:

modeling/current_work
modeling/testing_environment

.. toctree::
:maxdepth: 1
:caption: Sagemaker:

sagemaker/sagemaker

58 changes: 58 additions & 0 deletions docs/source/modeling/current_work.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Current Work
============

Results
--------

An (incomplete) summary of the results of the models that we have tried can be found
`here <https://docs.google.com/spreadsheets/d/1zCbyZNb-Smz3GsEeJO6oyodvEK3rjgxJSDiC47jSh-o/edit#gid=299270945>`_

We tested the following:

- sectors
- subpillars

Sectors
~~~~~~~~~

The current performance is already good enough to test it in production.

Subpillars
~~~~~~~~~~

The current performance is not too good. It is ok for the pillars but not for the subpillars, in particular
the less frequent ones.

Models
~~~~~~

We tested recent Deep Learning models, in particular transformers, finetuned for text classification.
We tried basic techniques for unbalanced text classification (oversampling, entailment) with no success.
The best results so far have been obtained with a multi-lingual transformer.

Metric
~~~~~~

We give results of F1 score, recall and precision.
Recall may be more important, because the biggest goal would be to avoid the taggers to open and read the
PDF/web article. Only a good recall can give this.

Training
---------

You have two options to train a model:

- Use a Sagemaker notebook instance with GPU
- Use the `Sagemaker training feature <https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html>`_ (recommended)

The first way is more intuitive and interactive.
However, once you set-up the main parameters of the training
and you reach a semi-definitive version of the training script, we advise you to move to training jobs.
They can use faster GPUs, are cheaper (you can run a training job from your
local PC and pay only the training time) and allow for a better tracking of the results and model artifact.

Experiment tracking
-------------------

We are discussing the preferred tracking system in these days.
The main options are MLFlow and the built-in Sagemaker Experiments/Pipeline.
9 changes: 9 additions & 0 deletions docs/source/modeling/testing_environment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Testing Environment
===================

We have built a simple online test environment to check the predictions of our models on the subpillars.
You can find it `here <http://test-env.eba-crsiq2wb.us-east-1.elasticbeanstalk.com>`_.

It is useful to show to externals our results. We plan to add more features to it.

We have used Streamlit as Python library, and Docker + ECR + Beanstalk as deployment option.
38 changes: 38 additions & 0 deletions docs/source/sagemaker/sagemaker.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Sagemaker
=========

We use mainly AWS Sagemaker as training and deployment platform.
Many features have been built into the platform. We go over the ones we use:

Notebook Instances
-------------------

The easiest one, mainly EC2 + jupyter notebook already installed. It should not be difficult to user

They have already integrated Git repositories.

Training Jobs
-------------

A useful feature to train models quickly and cheaply.
An example notebook is ``ssa-1.8-fastai-multilabel-remote.ipynb`` with the corresponding scripts in
``scripts/training/stefano/multiclass-fastai``.

There are many details in the notebook and in the scripts that I will write here.

Hyperparameter tuning jobs
--------------------------

Very similar to training jobs, they allow to tune the hyperparamers, should not be difficult to use,
but we did not try yet.

Inference - API
---------------

Deploy quickly one of the models trained in the trained job.

Inference - Batch Transform Job
-------------------------------
Associate to one of the models trained in the trained job a corresponding inference job,
that can be triggered. This is probably what we will use in the future for production.

0 comments on commit 0612116

Please sign in to comment.