-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Stefano Savare <[email protected]>
- Loading branch information
Showing
11 changed files
with
319 additions
and
21 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
Deep Experiments | ||
================ | ||
|
||
This repository is dedicated to the NLP part of the DEEP project. | ||
The code is tightly coupled with AWS Sagemaker. | ||
|
||
Quick-start | ||
----------- | ||
|
||
Local development | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
Contact Stefano to get the AWS credentials, install the | ||
`AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html>`_ | ||
|
||
Clone the repo and pull the data | ||
|
||
.. code-block:: bash | ||
git clone <deep_experiments_repo> | ||
cd deep-experiments | ||
Create a new conda environment: | ||
|
||
.. code-block:: bash | ||
conda create -n deepl python=3.9.1 | ||
Install necessary libraries: | ||
|
||
.. code-block:: bash | ||
make dev-install | ||
Pull the data: | ||
|
||
.. code-block:: bash | ||
dvc pull | ||
Notebook instances on AWS | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Ask Stefano for a AWS user account and a new Notebook instance on AWS. | ||
The notebook instance comes with the repo already cloned. | ||
|
||
Once it is ready, start the instance and click on *Open Jupyter*. | ||
Open the Jupyter terminal and ``cd`` to the ``deep-experiments`` repo. It should be: | ||
|
||
.. code-block:: bash | ||
cd SageMaker/deep-experiments | ||
Run: | ||
|
||
.. code-block:: bash | ||
make cloud-install | ||
(This must be run everytime the instance is activated) | ||
|
||
Pull the data: | ||
|
||
.. code-block:: bash | ||
dvc pull | ||
Streamlit | ||
~~~~~~~~~ | ||
|
||
We incorporated in the repo the ``streamlit`` web application. In the future we will put it in | ||
another repo. | ||
|
||
To use it locally: | ||
|
||
.. code-block:: bash | ||
make streamlit-install | ||
streamlit run scripts/testing/subpillar_pred_with_st.py | ||
You can also build and deploy a Docker application to ECR and Beanstalk: | ||
|
||
.. code-block:: bash | ||
make streamlit-build | ||
make streamlit-deploy | ||
You may need to change the local image name (WIP). | ||
Also we plan to add Github Actions to automate this procedure | ||
|
||
|
||
Folder structure | ||
---------------- | ||
|
||
- ``data`` contains the data | ||
- ``deep`` contains the code | ||
- ``notebooks`` contains all the Jupyter Notebook, divided by category and person working on them | ||
- ``scripts`` contains the training scripts necessary for Sagemaker |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
sphinx==4.0.2 | ||
numpydoc==1.1.0 | ||
sphinx_issues==1.2.0 | ||
sphinx_rtd_theme==0.5.2 | ||
nbsphinx==0.8.6 | ||
nbsphinx_link==1.3.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Configuration file for the Sphinx documentation builder. | ||
# | ||
# This file only contains a selection of the most common options. For a full | ||
# list see the documentation: | ||
# https://www.sphinx-doc.org/en/master/usage/configuration.html | ||
|
||
# -- Path setup -------------------------------------------------------------- | ||
|
||
# If extensions (or modules to document with autodoc) are in another directory, | ||
# add these directories to sys.path here. If the directory is relative to the | ||
# documentation root, use os.path.abspath to make it absolute, like shown here. | ||
# | ||
import os | ||
import sys | ||
|
||
sys.path.insert(0, os.path.abspath("../../")) | ||
|
||
|
||
# -- Project information ----------------------------------------------------- | ||
|
||
project = "Deep Experiments" | ||
copyright = "" | ||
author = "DFS" | ||
|
||
# The full version, including alpha/beta/rc tags | ||
release = "0.1.0" | ||
|
||
|
||
# -- General development --------------------------------------------------- | ||
|
||
# Add any Sphinx extension module names here, as strings. They can be | ||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom | ||
# ones. | ||
extensions = [ | ||
"sphinx.ext.autodoc", | ||
"nbsphinx", | ||
"nbsphinx_link", | ||
"numpydoc", | ||
"sphinx.ext.napoleon", | ||
"sphinx.ext.mathjax", | ||
"sphinx.ext.intersphinx", | ||
] | ||
|
||
# Add any paths that contain templates here, relative to this directory. | ||
templates_path = ["_templates"] | ||
|
||
# List of patterns, relative to source directory, that match files and | ||
# directories to ignore when looking for source files. | ||
# This pattern also affects html_static_path and html_extra_path. | ||
exclude_patterns = [] | ||
|
||
|
||
# -- Options for HTML output ------------------------------------------------- | ||
|
||
# The theme to use for HTML and HTML Help pages. See the documentation for | ||
# a list of builtin themes. | ||
# | ||
html_theme = "sphinx_rtd_theme" | ||
|
||
# Add any paths that contain custom static files (such as style sheets) here, | ||
# relative to this directory. They are copied after the builtin static files, | ||
# so a file named "default.css" will overwrite the builtin "default.css". | ||
html_static_path = ["_static"] | ||
|
||
autosummary_generate = True |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
.. Typewriter documentation master file, created by | ||
sphinx-quickstart on Wed Jul 22 15:32:08 2020. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
.. include:: ../../README.rst | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Data: | ||
|
||
data/description | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Modeling: | ||
|
||
modeling/current_work | ||
modeling/testing_environment | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Sagemaker: | ||
|
||
sagemaker/sagemaker | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
Current Work | ||
============ | ||
|
||
Results | ||
-------- | ||
|
||
An (incomplete) summary of the results of the models that we have tried can be found | ||
`here <https://docs.google.com/spreadsheets/d/1zCbyZNb-Smz3GsEeJO6oyodvEK3rjgxJSDiC47jSh-o/edit#gid=299270945>`_ | ||
|
||
We tested the following: | ||
|
||
- sectors | ||
- subpillars | ||
|
||
Sectors | ||
~~~~~~~~~ | ||
|
||
The current performance is already good enough to test it in production. | ||
|
||
Subpillars | ||
~~~~~~~~~~ | ||
|
||
The current performance is not too good. It is ok for the pillars but not for the subpillars, in particular | ||
the less frequent ones. | ||
|
||
Models | ||
~~~~~~ | ||
|
||
We tested recent Deep Learning models, in particular transformers, finetuned for text classification. | ||
We tried basic techniques for unbalanced text classification (oversampling, entailment) with no success. | ||
The best results so far have been obtained with a multi-lingual transformer. | ||
|
||
Metric | ||
~~~~~~ | ||
|
||
We give results of F1 score, recall and precision. | ||
Recall may be more important, because the biggest goal would be to avoid the taggers to open and read the | ||
PDF/web article. Only a good recall can give this. | ||
|
||
Training | ||
--------- | ||
|
||
You have two options to train a model: | ||
|
||
- Use a Sagemaker notebook instance with GPU | ||
- Use the `Sagemaker training feature <https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html>`_ (recommended) | ||
|
||
The first way is more intuitive and interactive. | ||
However, once you set-up the main parameters of the training | ||
and you reach a semi-definitive version of the training script, we advise you to move to training jobs. | ||
They can use faster GPUs, are cheaper (you can run a training job from your | ||
local PC and pay only the training time) and allow for a better tracking of the results and model artifact. | ||
|
||
Experiment tracking | ||
------------------- | ||
|
||
We are discussing the preferred tracking system in these days. | ||
The main options are MLFlow and the built-in Sagemaker Experiments/Pipeline. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Testing Environment | ||
=================== | ||
|
||
We have built a simple online test environment to check the predictions of our models on the subpillars. | ||
You can find it `here <http://test-env.eba-crsiq2wb.us-east-1.elasticbeanstalk.com>`_. | ||
|
||
It is useful to show to externals our results. We plan to add more features to it. | ||
|
||
We have used Streamlit as Python library, and Docker + ECR + Beanstalk as deployment option. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
Sagemaker | ||
========= | ||
|
||
We use mainly AWS Sagemaker as training and deployment platform. | ||
Many features have been built into the platform. We go over the ones we use: | ||
|
||
Notebook Instances | ||
------------------- | ||
|
||
The easiest one, mainly EC2 + jupyter notebook already installed. It should not be difficult to user | ||
|
||
They have already integrated Git repositories. | ||
|
||
Training Jobs | ||
------------- | ||
|
||
A useful feature to train models quickly and cheaply. | ||
An example notebook is ``ssa-1.8-fastai-multilabel-remote.ipynb`` with the corresponding scripts in | ||
``scripts/training/stefano/multiclass-fastai``. | ||
|
||
There are many details in the notebook and in the scripts that I will write here. | ||
|
||
Hyperparameter tuning jobs | ||
-------------------------- | ||
|
||
Very similar to training jobs, they allow to tune the hyperparamers, should not be difficult to use, | ||
but we did not try yet. | ||
|
||
Inference - API | ||
--------------- | ||
|
||
Deploy quickly one of the models trained in the trained job. | ||
|
||
Inference - Batch Transform Job | ||
------------------------------- | ||
Associate to one of the models trained in the trained job a corresponding inference job, | ||
that can be triggered. This is probably what we will use in the future for production. | ||
|