diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml index c39ca04..197a774 100644 --- a/.github/workflows/documentation.yml +++ b/.github/workflows/documentation.yml @@ -21,7 +21,6 @@ jobs: make doc-install - name: Build sphinx documentation run: | - ls ./docs make documentation - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v1 @@ -31,7 +30,6 @@ jobs: aws-region: us-east-1 - name: Deploy static site to S3 bucket run: | - ls ./docs aws s3 rm s3://${{ secrets.AWS_DOCUMENTATION_BUCKET }}/documentation --recursive aws s3 sync ./docs/documentation/ s3://${{ secrets.AWS_DOCUMENTATION_BUCKET }}/deep-experiments/ --delete # - name: gcloud auth diff --git a/Makefile b/Makefile index 3e09790..35183ef 100644 --- a/Makefile +++ b/Makefile @@ -29,6 +29,7 @@ streamlit-deploy: docker push 961104659532.dkr.ecr.us-east-1.amazonaws.com/streamlit documentation: + rm -rf docs/documentation sphinx-build -b html docs/source docs/documentation documentation-push: diff --git a/docs/source/dfs-data/description.rst b/docs/source/dfs-data/description.rst new file mode 100644 index 0000000..21b89b6 --- /dev/null +++ b/docs/source/dfs-data/description.rst @@ -0,0 +1,57 @@ +Data +==== + +The data for the project is in the folder `data`. +The `immap` subfolder contains only IMMAP data, while `frameworks_data` contains all the rest. + +What is our data? +------------------------ + +Our data is composed of ~100.000 sentences extracted from PDF and web articles. Each sentence has been manually +labeled by taggers. Multiple labels are associated to the same sentence: + +- Sector +- Pillar +- Subpillar + +We want to predict all of them. Each subpillar belongs to one and only one pillar. +More labels are coming. + +How good is the data +-------------------- + +Not so much, some classes are ambiguous and, since we have multiple taggers, their tags are not +always consistent. +However, we have a lot of data, which is good. + + +Which data should I use? +------------------------ + +We are currently working only with `frameworks_data` and the most recent data version. + +We advise you to import the file: + +.. code-block:: python + + from deep.constants import * + +The variable ``LATEST_DATA_PATH`` points to the most recent version of the data. + +What are the differences between the data versions? +---------------------------------------------------- + +Alongside the dataset queried (IMMAP vs all) mainly bug fixes and better definition of the classes. +Please use the latest version. + +How do I get the most recent version? +------------------------------------- + +We use `DVC `_ to deal with data. Simply run + +.. code-block:: bash + + dvc pull + +to get the most recent version. + diff --git a/docs/source/index.rst b/docs/source/index.rst index 87f3764..ecf8b2f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -9,7 +9,7 @@ :maxdepth: 1 :caption: Data: - data/description + dfs-data/description .. toctree:: :maxdepth: 1