New tutorials and polish (#23)

- Add / update new tutorials formatted as Juypter notbooks - Add "scripts to rule them all" and used the `em-core` scripts as a template - Updated and define S3 client - Added persistent system environment variable instructions - README Updates - Replaced static links to original repo with dynamic links for this repo - Improved organization - Various markdown linting Co-authored-by: Ryan Earley <[email protected]> Co-authored-by: Nadia Dimitrova <[email protected]> Co-authored-by: imgbot[bot] <31301654+imgbot[bot]@users.noreply.github.com>
LADI-Dataset · Sep 30, 2020 · 77da4ae · 77da4ae
1 parent 67ea474
commit 77da4ae
Show file tree

Hide file tree

Showing 52 changed files with 3,765 additions and 80 deletions.
diff --git a/README.md b/README.md
@@ -1,45 +1,67 @@
-# Tutorials Guide
+# LADI Tutorials
+
+Tutorials for the Low Altitude Disaster Imagery (LADI) dataset. This tutorial was originally forked from a [Penn State Learning Factory](https://www.lf.psu.edu/) capstone project.
+
+- [LADI Tutorials](#ladi-tutorials)
+  - [Point of Contact](#point-of-contact)
+  - [Initial Setup](#initial-setup)
+    - [Persistent System Environment Variables](#persistent-system-environment-variables)
+    - [Scripts](#scripts)
+  - [Tutorials - Accessing the Dataset](#tutorials---accessing-the-dataset)
+  - [Tutorials - Metadata Analysis](#tutorials---metadata-analysis)
+    - [Tutorials - Machine Learning](#tutorials---machine-learning)
+  - [Distribution Statement](#distribution-statement)
 
-Tutorial for the Low Altitude Disaster Imagery (LADI) dataset. This tutorial was originally forked from a [Penn State Learning Factory](https://www.lf.psu.edu/) capstone project
+## Point of Contact
 
-- [Tutorials Guide](#tutorials-guide)
-  - [Getting Started](#getting-started)
-  - [Clean and Validate LADI Dataset](#clean-and-validate-ladi-dataset)
-  - [PyTorch Data Loading](#pytorch-data-loading)
-  - [Train and Test A Classifier](#train-and-test-a-classifier)
-  - [Fine Tuning Torchvision Models](#fine-tuning-torchvision-models)
-  - [Distribution Statement](#distribution-statement)
+We encourage the use of the [GitHub Issues](https://guides.github.com/features/issues/) but when email is required, please contact the administrators at [[email protected]](mailto:[email protected]). As the public safety and computer vision communities adopt the dataset, a separate mailing list for development may be created.
 
-## Getting Started
+## Initial Setup
 
-[Getting Started](./Tutorials/Get_Started.md)
+This section specifies the run order and requirements for the initial setup the repository. Other repositories in this organization may be reliant upon this setup being completed.
 
-This documentation is about installing AWS tools and configuring AWS environment to download LADI dataset and load dataset in Python locally and remotely.
+### Persistent System Environment Variables
 
-## Clean and Validate LADI Dataset
+Immediately after cloning this repository, [create a persistent system environment](https://superuser.com/q/284342/44051) variable titled `LADI_DIR_TUTORIAL` with a value of the full path to this repository root directory.
 
-[Clean and Validate LADI Dataset](./Tutorials/Clean_Validate.md)
+On unix there are many ways to do this, here is an example using [`/etc/profile.d`](https://unix.stackexchange.com/a/117473). Create a new file `ladi-env.sh` using `sudo vi /etc/profile.d/ladi-env.sh` and add the command to set the variable:
 
-This documentation is about clean the LADI dataset. For this project, we have only extracted 2000 images for training.
+```bash
+export LADI_DIR_TUTORIAL=PATH TO /ladi-tutorial
+```
 
-## PyTorch Data Loading
+You can confirm `LADI_DIR_TUTORIAL` was set in unix by inspecting the output of `env`.
 
-[PyTorch Data Loading](./Tutorials/Pytorch_Data_Load.md)
+### Scripts
 
-This documentation is about loading LADI dataset in PyTorch framework including examples of writing custom `Dataset`, `Transforms` and `Dataloader`.
+This is a set of boilerplate scripts describing the [normalized script pattern that GitHub uses in its projects](https://github.blog/2015-06-30-scripts-to-rule-them-all/). The [GitHub Scripts To Rule Them All](https://github.com/github/scripts-to-rule-them-all) was used as a template. Refer to the [script directory README](./script/README.md) for more details.
+
+You will need to run these scripts in this order to download package dependencies and download all of the necessary data to get you started.
+
+## Tutorials - Accessing the Dataset
 
-## Train and Test A Classifier
+A set of tutorials focused on installing AWS tools and configuring AWS environment to download LADI dataset and load dataset in Python locally and remotely. There is also a short tutorial on how to clean and validate data.
 
-[Train and Test A Classifier](./Tutorials/Train_Test_Classifier.md)
+- [Getting Started](./tutorials/Get_Started.md)
+- [Clean and Validate LADI Dataset](./tutorials/Clean_Validate.md)
 
-This documentation is about training and testing a classifier model using Convolutional Neural Network (CNN) from scratch.
+## Tutorials - Metadata Analysis
 
-## Fine Tuning Torchvision Models
+A set of tutorials are Jupyter Python 3.X notebooks that demonstrate on how to perform geospatial analysis by enhancing the LADI metadata with third party GIS information. One tutorial identifies the number of images taken within an administrative boundary (e.g. USA states) and assigns each state a color based on the number of images taken. The other tutorial filters images based on an specific annotation and performs various geospatial measurements on this subset.
 
-[Fine Tuning Torchvision Models](./Tutorials/Fine_Tune_Torchvision_Models.md)
+- [ISO-3166-2 Administrative Boundaries](./tutorials/Geospatial-Hurricane-Analysis.ipynb)
+- [Geospatial Hurricane Analysis](./tutorials/Geospatial-Hurricane-Analysis.ipynb)
 
-This documentation is about training and testing a classifier model using pre-trained ResNet and AlexNet.
+### Tutorials - Machine Learning
+
+These tutorials focus on how to training and testing a classifier model using Convolutional Neural Network (CNN) from scratch or using pre-trained ResNet and AlexNet.
+
+- [PyTorch Data Loading](./tutorials/Pytorch_Data_Load.md)
+- [Train and Test A Classifier](./tutorials/Train_Test_Classifier.md)
+- [Fine Tuning Torchvision Models](./tutorials/Fine_Tune_Torchvision_Models.md)
+
+This documentation is about loading LADI dataset in PyTorch framework including examples of writing custom `Dataset`, `Transforms` and `Dataloader`.
 
 ## Distribution Statement
 
-[BSD -Clause License](https://github.com/LADI-Dataset/ladi-tutorial/blob/master/LICENSE)
+[BSD 3-Clause License](LICENSE)
diff --git a/Tutorials/Model Script/README.md b/Tutorials/Model Script/README.md
diff --git a/data/Census-AHS/README.md b/data/Census-AHS/README.md
@@ -0,0 +1,19 @@
+# Census-AHS
+The Census' American Housing Survey data provides information about the physical costs and conditions of homes, characteristics of the people living in these houses, and characteristics for disaster response of more than 60,000 Americans
+
+## Download Instructions
+
+### Script (Recommended)
+
+[`script/setup.sh`](../../script/setup.sh) is used to set up the project in an initial state. It will download and extract the data.
+
+### Manual
+
+Although not recommended, the data can be downloaded manually:
+
+1. Go to the US-Census National public use file page: [Census](https://www.census.gov/programs-surveys/ahs/data/2017/ahs-2017-public-use-file--puf-/ahs-2017-national-public-use-file--puf-.html)
+2. download the AHS 2017 National PUF v3.0 CSV (or the latest version)
+
+## Distribution Statement
+
+[BSD -Clause License](https://github.com/LADI-Dataset/ladi-tutorial/blob/master/LICENSE)
diff --git a/data/Census-CBSA/README.md b/data/Census-CBSA/README.md
@@ -0,0 +1,20 @@
+# Census-CBSA
+
+Census-CBSA defines US Metropolitan and Micropolitan statistical areas. This Census data is provided as a geospatial-enabled file format.
+
+## Download Instructions
+
+### Script (Recommended)
+
+[`script/setup.sh`](../../script/setup.sh) is used to set up the project in an initial state. It will download and extract the data.
+
+### Manual
+
+Although not recommended, the data can be downloaded manually:
+
+1. Go to the US-Census Bureau open data website: [US-CBSA](https://catalog.data.gov/dataset/tiger-line-shapefile-2019-nation-u-s-current-metropolitan-statistical-area-micropolitan-statist)
+2. download the latest version of the Shapefile Zip File
+
+## Distribution Statement
+
+[BSD -Clause License](https://github.com/LADI-Dataset/ladi-tutorial/blob/master/LICENSE)
diff --git a/data/Census-State/README.md b/data/Census-State/README.md
@@ -0,0 +1,21 @@
+# Census-State
+
+Census-State defines statistical areas for all US-States. This Census data is provided as a geospatial-enabled file format.
+
+
+## Download Instructions
+
+### Script (Recommended)
+
+[`script/setup.sh`](../../script/setup.sh) is used to set up the project in an initial state. It will download and extract the data.
+
+### Manual
+
+Although not recommended, the data can be downloaded manually:
+
+1. Go to the US-Census Bureau open data website: [US-State](https://catalog.data.gov/dataset/tiger-line-shapefile-2017-nation-u-s-current-state-and-equivalent-national)
+2. download the latest version of the Shapefile Zip File
+
+## Distribution Statement
+
+[BSD -Clause License](https://github.com/LADI-Dataset/ladi-tutorial/blob/master/LICENSE)
diff --git a/data/FAA-Airports/README.md b/data/FAA-Airports/README.md
@@ -0,0 +1,16 @@
+# FAA Airports
+
+Airport defines area on land or water intended to be used either wholly or in part for the arrival; departure and surface movement of aircraft/helicopters. This airport data is provided as a vector geospatial-enabled file format.
+
+## Download Instructions
+
+### Script (Recommended)
+
+[`script/setup.sh`](../../script/setup.sh) is used to set up the project in an initial state. It will download and extract the data.
+
+### Manual
+
+Although not recommended, the data can be downloaded manually:
+
+1. Go to the FAA open data website: [Airports](https://ais-faa.opendata.arcgis.com/datasets/e747ab91a11045e8b3f8a3efd093d3b5_0)
+2. Select shapefile from the Download drop down
diff --git a/data/Natural-Earth/README.md b/data/Natural-Earth/README.md
@@ -0,0 +1,17 @@
+# Natural-Earth
+The Natural Earth file is a comprehensive map of the world and it's administrative boundaries (states, fips, countries). This map is provided as a geospatial-enabled file format.
+
+### Script (Recommended)
+
+[`script/setup.sh`](../../script/setup.sh) is used to set up the project in an initial state. It will download and extract the data.
+
+### Manual
+
+Although not recommended, the data can be downloaded manually:
+
+1. Go to the Natural Earth download page: [Natural-Earth](https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-1-states-provinces/)
+2. Download states and provinces version 4.1.0 (or the latest version)
+
+## Distribution Statement
+
+[BSD -Clause License](https://github.com/LADI-Dataset/ladi-tutorial/blob/master/LICENSE)
diff --git a/Data/README.md → data/README.md b/Data/README.md → data/README.md
diff --git a/Tutorials/README.md → data/metadata/README.md b/Tutorials/README.md → data/metadata/README.md
@@ -1,6 +1,6 @@
-# Tutorials
+# Data
 
-Default directory for tutorials markdowns.
+Default directory for sample data used by tutorials.
 
 ## Distribution Statement
 

diff --git a/Data/flood_tiny_label → data/metadata/flood_tiny_label b/Data/flood_tiny_label → data/metadata/flood_tiny_label
diff --git a/Data/flood_tiny_metadata → data/metadata/flood_tiny_metadata b/Data/flood_tiny_metadata → data/metadata/flood_tiny_metadata
diff --git a/..._4365f589-de67-4561-8fdd-f1ac1ac1ae07.jpg → ..._4365f589-de67-4561-8fdd-f1ac1ac1ae07.jpg b/..._4365f589-de67-4561-8fdd-f1ac1ac1ae07.jpg → ..._4365f589-de67-4561-8fdd-f1ac1ac1ae07.jpg
diff --git a/Images/Label_Human.png → images/Label_Human.png b/Images/Label_Human.png → images/Label_Human.png
diff --git a/Images/README.md → images/README.md b/Images/README.md → images/README.md
diff --git a/Images/Skip_Connection.png → images/Skip_Connection.png b/Images/Skip_Connection.png → images/Skip_Connection.png
diff --git a/Images/cm.png → images/cm.png b/Images/cm.png → images/cm.png
diff --git a/Images/custom_flood_tiny_dataset_output.png → images/custom_flood_tiny_dataset_output.png b/Images/custom_flood_tiny_dataset_output.png → images/custom_flood_tiny_dataset_output.png
diff --git a/Images/dataloader_batch_result.png → images/dataloader_batch_result.png b/Images/dataloader_batch_result.png → images/dataloader_batch_result.png
diff --git a/Images/freq_cnt_demo.png → images/freq_cnt_demo.png b/Images/freq_cnt_demo.png → images/freq_cnt_demo.png
diff --git a/Images/human_annotations.png → images/human_annotations.png b/Images/human_annotations.png → images/human_annotations.png
diff --git a/Images/pred.png → images/pred.png b/Images/pred.png → images/pred.png
diff --git a/Images/pytorch_transform_flood_tiny.png → images/pytorch_transform_flood_tiny.png b/Images/pytorch_transform_flood_tiny.png → images/pytorch_transform_flood_tiny.png
diff --git a/Images/sagemaker1.png → images/sagemaker1.png b/Images/sagemaker1.png → images/sagemaker1.png
diff --git a/Images/sagemaker2.png → images/sagemaker2.png b/Images/sagemaker2.png → images/sagemaker2.png
diff --git a/Images/sagemaker3.png → images/sagemaker3.png b/Images/sagemaker3.png → images/sagemaker3.png
diff --git a/Images/sagemaker4.png → images/sagemaker4.png b/Images/sagemaker4.png → images/sagemaker4.png
diff --git a/Images/sagemaker5.png → images/sagemaker5.png b/Images/sagemaker5.png → images/sagemaker5.png
diff --git a/Images/sagemaker6.png → images/sagemaker6.png b/Images/sagemaker6.png → images/sagemaker6.png
diff --git a/Images/sagemaker7.png → images/sagemaker7.png b/Images/sagemaker7.png → images/sagemaker7.png
diff --git a/Images/testimages.png → images/testimages.png b/Images/testimages.png → images/testimages.png
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,18 @@
+boto3 == 1.9.66
+contextily == 1.0.0
+contextlib2 == 0.6.0.post1
+fiona == 1.8.13
+jupyter >= 1.0.0
+jupyter-client >= 6.1.6
+jupyter-console >= 6.1.0
+jupyter-core >= 4.6.3
+jupyterlab >= 2.2.5
+jupyterlab-server >= 1.2.0
+geopandas == 0.6.1
+geopy == 2.0.0               
+matplotlib == 3.3.1
+numpy == 1.19.1
+pandas == 1.1.0
+path == 15.0.0        
+pyproj == 2.6.1.post1
+shapely == 1.7.0
diff --git a/script/README.md b/script/README.md
@@ -0,0 +1,156 @@
+# Scripts
+
+This is a set of boilerplate scripts describing the [normalized script pattern that GitHub uses in its projects](https://github.blog/2015-06-30-scripts-to-rule-them-all/). The [GitHub Scripts To Rule Them All
+](https://github.com/github/scripts-to-rule-them-all) was used as a template. They were tested using Ubuntu 18.04.3 LTS on Windows 10.
+
+- [Scripts](#scripts)
+  - [`LADI_DIR_TUTORIAL` and Execution](#ladi_dir_tutorial-and-execution)
+  - [Dependencies](#dependencies)
+    - [Linux Shell](#linux-shell)
+    - [Proxy and Internet Access](#proxy-and-internet-access)
+    - [Superuser Access](#superuser-access)
+  - [The Scripts](#the-scripts)
+    - [script/bootstrap](#scriptbootstrap)
+      - [Packages](#packages)
+    - [script/setup](#scriptsetup)
+      - [Data](#data)
+
+## `LADI_DIR_TUTORIAL` and Execution
+
+These scripts assume that `LADI_DIR_TUTORIAL` has been set. Refer to the repository root [README](../README.md) for instructions.
+
+## Dependencies
+
+### Linux Shell
+
+The scripts need to be run in a Linux shell. For Windows 10 users, you can use [Ubuntu on Windows](https://tutorials.ubuntu.com/tutorial/tutorial-ubuntu-on-windows#0). Specifically for Windows users, system drive and other connected drives are exposed in the `/mnt/` directory. For example, you can access the Windows C: drive via `cd /mnt/c`.
+
+If you modify these scripts, please follow the [convention guide](https://github.com/LADI-Dataset/ladi-overview/blob/master/CONTRIBUTING.md#convention-guide) that specifies an end of line character of `LF (\n)`. If the end of line character is changed to `CRLF (\r)`, you will get an error like this:
+
+### Proxy and Internet Access
+
+The scripts will download data using [`curl`](https://curl.haxx.se/docs/manpage.html) and [`wget`](https://manpages.ubuntu.com/manpages/trusty/man1/wget.1.html), which depending on your security policy may require a proxy.
+
+The scripts assume that the `http_proxy` and `https_proxy` linux environments variables have been set.
+
+```bash
+export http_proxy=proxy.mycompany:port
+export https_proxy=proxy.mycompany:port
+```
+
+You may also need to [configure git to use a proxy](https://stackoverflow.com/q/16067534). This information is stored in `.gitconfig`, for example:
+
+```git
+[http]
+	proxy = http://proxy.mycompany:port
+[https]
+	proxy = http://proxy.mycompany:port
+```
+
+### Superuser Access
+
+Depending on your security policy, you may need to run some scripts as a superuser or another user. These scripts have been tested using [`sudo`](https://manpages.ubuntu.com/manpages/disco/en/man8/sudo.8.html). Depending on how you set up the system variable, `LADI_DIR_TUTORIAL` you may need to call [sudo with the `-E` flag](https://stackoverflow.com/a/8633575/363829), preserve env.
+
+If running without administrator or sudo access, try running these scripts using `bash`, such as
+
+```bash
+bash ./setup.sh
+```
+
+## The Scripts
+
+Each of these scripts is responsible for a unit of work. This way they can be called from other scripts.
+
+This not only cleans up a lot of duplicated effort, it means contributors can do the things they need to do, without having an extensive fundamental knowledge of how the project works. Lowering friction like this is key to faster and happier contributions.
+
+The following is a list of scripts and their primary responsibilities.
+
+### script/bootstrap
+
+[`script/bootstrap`][bootstrap] is used solely for fulfilling dependencies of the project, such as packages, software versions, and git submodules. The goal is to make sure all required dependencies are installed. This script should be run before
+[`script/setup`][setup].
+
+#### Packages
+
+Using [`apt`](https://help.ubuntu.com/lts/serverguide/apt.html), the following linux packages are installed:
+
+| Package        |  Use |
+| :-------------| :--  |
+`unzip` | extracting zip archives
+
+The LADI team has not knowingly modified any of these packages. Any modifications to these packages shall be in compliance with their respective license and outside the scope of this repository.
+
+### script/setup
+
+[`script/setup`][setup] is used to set up a project in an initial state. This is typically run after an initial clone, or, to reset the project back to its initial state. This is also useful for ensuring that your bootstrapping actually works well.
+
+#### Data
+
+Commonly used datasets are downloaded by [`script/setup`][setup]. Refer to the [data directory README](../data/README.md) for more details.
+
+<!--  NOT YET IMPLEMENTED BUT COMMENTED FOR FUTURE REFERENCE
+
+### script/update
+
+[`script/update`][update] is used to update the project after a fresh pull.
+
+If you have not worked on the project for a while, running [`script/update`][update] after
+a pull will ensure that everything inside the project is up to date and ready to work.
+
+Typically, [`script/bootstrap`][bootstrap] is run inside this script. This is also a good
+opportunity to run database migrations or any other things required to get the
+state of the app into shape for the current version that is checked out.
+
+### script/server
+
+[`script/server`][server] is used to start the application.
+
+For a web application, this might start up any extra processes that the 
+application requires to run in addition to itself.
+
+[`script/update`][update] should be called ahead of any application booting to ensure that
+the application is up to date and can run appropriately.
+
+### script/test
+
+[`script/test`][test] is used to run the test suite of the application.
+
+A good pattern to support is having an optional argument that is a file path.
+This allows you to support running single tests.
+
+Linting (i.e. rubocop, jshint, pmd, etc.) can also be considered a form of testing. These tend to run faster than tests, so put them towards the beginning of a [`script/test`][test] so it fails faster if there's a linting problem.
+
+[`script/test`][test] should be called from [`script/cibuild`][cibuild], so it should handle
+setting up the application appropriately based on the environment. For example,
+if called in a development environment, it should probably call [`script/update`][update]
+to always ensure that the application is up to date. If called from
+[`script/cibuild`][cibuild], it should probably reset the application to a clean state.
+
+### script/cibuild
+
+[`script/cibuild`][cibuild] is used for your continuous integration server.
+This script is typically only called from your CI server.
+
+You should set up any specific things for your environment here before your tests
+are run. Your test are run simply by calling [`script/test`][test].
+
+### script/console
+
+[`script/console`][console] is used to open a console for your application.
+
+A good pattern to support is having an optional argument that is an environment
+name, so you can connect to that environment's console.
+
+You should configure and run anything that needs to happen to open a console for
+the requested environment.
+
+-->
+
+<!-- Relative Links -->
+[bootstrap]: bootstrap.sh
+[setup]: setup.sh
+[update]: update.sh
+[server]: server.sh
+[test]: test.sh
+[cibuild]: cibuild.sh
+[console]: console.sh