aws s3 sync /local/path/to/data s3://qic003/target/dir --profile prp # specify user profile (check ~/.aws/cridential)
k create -f ./kubectl/cephfs_pvc.yml # create only once and can delete when no needed anymore
k create -f ./kubectl/pod_example.yml` # create pod
k get pods # get pod information
k describe pod pod_name(qic003-pod) # pod description
k exec -it pod_name(qic003-pod) bash # launch pod
k delete pod pod_name(qic003-pod) # delete pod
k create -f ./kubectl/job_example.yml # create job
k get jobs # get job information
k describe pod qic003-job # get job/pod description
k exec -it pod_name_in_this_job(qic003-job-zprb4) bash # launch pod
k delete job job_name(qic003-jod) # delete pod, remember to delete job once training is finished
Export conda environment coralnet: conda env export > environment.yml # only export once or environment changed
- Create a Job on kubectl with computing resources
- Clone thie repo to target directory:
git clone https://github.com/qiminchen/CoralNet.git
- Create conda environment coralnet:
conda env create -f environment.yml
- Configure AWS:
NOTE: Do NOT recommend installing the awscli awscli-plugin-endpoint in local environment, instead
install the awscli awscli-plugin-endpoint in conda environment
pip install awscli awscli-plugin-endpoint
aws configure set plugins.endpoint awscli_plugin_endpoint
vim ~/.aws/credentials
# Add profile to ~/.aws/credentials
[default]
aws_access_key_id = xxx
aws_secret_access_key = xxx
# Add profile to ~/.aws/config
aws configure set s3api.endpoint_url https://s3.nautilus.optiputer.net
aws configure set s3.endpoint_url https://s3.nautilus.optiputer.net
aws s3 ls s3://qic003/ --profile prp # List dir to test s3 connection
# Sync repository from s3
aws s3 sync s3://qic003/CoralNet ./qic003
For 1 8GB RAM CPU and 1 1080ti GPU, the maximum batch size with input size of 224x224 that fills up GPU memory and and the minimum number of workers of dataloader that ensure >90% GPU utilization.
Network | CPU | GPU | Batch size | Workers | Training time |
---|---|---|---|---|---|
VGG16 | 3 - 64G | 2 - 1080ti | 144 | 36 | ~ 19 hrs/epoch |
ResNet50 | 3 - 64G | 2 - 1080ti | 196 | 108 | ~ 19 hrs/epoch |
ResNet101 | 3 - 64G | 2 - 1080ti | 156 | 108 | ~ 24 hrs/epoch |
EfficientNetb0 - 1280 | 3 - 64G | 2 - 1080ti | 156 | 128 | ~ 18 hrs/epoch |
EfficientNetb0 - 2048 | 3 - 64G | 2 - 1080ti | 156 | 128 | ~ 24 hrs/epoch |
EfficientNetb0 - 4096 | 3 - 64G | 2 - 1080ti | 156 | 128 | ~ 26 hrs/epoch |
cd ../CoralNet/
./scripts/train.sh --net resnet --net_version "resnet50" --nclasses 1279 --max_lr 0.1 --batch_size 64 --workers 24 --epoch 10 --logdir /path/to/log
cd ../CoralNet/
./scripts/extract.sh --net resnet --net_version "resnet50" --net_path /path/to/well/trained/model --source "sxxx" --logdir /path/to/save/features/dir
NOTE that outdir
in extract_features.py
has to be the same as data_root
in eval_local.py
for evaluating the new CoralNet purpose.
cd ../CoralNet/
./scripts/eval.sh
-- label_set.json # contains all the labels id and corresponding label names
-- s102:
-- meta.json # contains source informations including number of images, create date,
source name, number of valid robots, longitude, latitude, affiliation,
number of confirmed images and source description.
-- images:
|-- i60774.jpg # image 1, a high resolution RGB image
|-- i60774.meta.json # contains some image informations including height in cm, image name,
aux1, aux2, aux3, photo date, etc.
|-- i60774.anns.json # contains annotations information of images, each annotation is composed
of label, col and row (coordinate in image).
|-- i60774.features.json #
|-- ... # image 2
|-- ...
|-- ...
|-- ...
|-- ...
-- s109: ...
-- meta.json
-- images:
-- ...
-- ...
-- ...
-
Divide the data sources into "source" and "target" sets. Make sure that "target" set does not include any sources Oscar gave me earlier this year. In particular, make sure source 16 (https://coralnet.ucsd.edu/source/16/) is in the "target" set. This is to enable a comparison on the PLC dataset.
-
There are around 1 million images in coralnet with more than 1000 labels, should talk to NOAA about identifying and merging them to 1000 (TBD) labels that are most common and used across the most sources, also discard the rare labels.
-
Once the labels are set, use 224x224 windows to crop the image centering at each annotation coordinate: cropping the images when loading the Dataset. Be cautious about
__getitem__
function. -
Write the images path and "source"/"target" chunks to all_images.txt and is_train.txt for convenient data loading.
-
Use the EfficientNet b4 configuration pre-trained on ImageNet as the pre-trained model and replace the last fully connected layer.
-
Use the "source" chunks to fine tune the entire network.
-
Use the "target" set to evaluate the fine-tuned EfficientNet. For each target source, split it into the training set and testing set (Use 1/8 for testing, 7/8 for training). Push all patches through the base-network, store the features then train a logistic regression classifier (e.g. scikit-learn toolkit). Evaluate the classifier on the test set.
-
[BONUS] replace the last layer in pre-trained net and fine-tune on each source.
-
"Target" set performance versus the beta features. The key plot is a histogram of difference in performance, per source, compared to the beta features. This plot would be directly relevant to the performance of CoralNet.
-
A comparison on Pacific Labeled Corals. Results in the same format as following woule be great:
a) Intra-expert: self-consistency of experts.
b) Inter-expert: agreement level across experts.
c) Texton: Oscar's CVPR 2012 classifier (HOG+SVM).
d) ImageNet feat: features from ImageNet backbone. Logistic regression classifier trained on each source.
e) ImageNet BP: net pre-trained on ImageNet. Fine-tuned on each source in the "target" set.
f) CoralNet feat: net pre-trained on CoralNet data. Logistic regression classifier trained on each source.
g) CoralNet BP: net pre-trained on CoralNet data. Fine-tuned on each source in the "target" set.
h) [THIS PROJECT]: Res (Efficient) Net pre-trained on new CoralNet data export. Logistic regression classifier trained on each source.
i) [THIS PROJECT]: Res (Efficient) Net pre-trained on new CoralNet data export. Fine-tuned on each source in the "target" set. (Same as training step 4)
- Create base-net with smaller / larger receptive field.
- Try ResNet50 or deeper ResNet
-
Successfully ran the code with EfficientNet and ResNet50 models on Moorea Labelled Dataset.
-
Confusion matrix of EfficientNet and ResNet50:
-
CPU runtime
ResNet50: maximum of 11 seconds spent of all batches with a batch size of 32, an average of 6 seconds spent in all batches with a batch size of 32, overall 10m 49s spent in forwarding 100 batches (3200 images).
EfficientNet: maximum of 20 seconds spent of all batches with a batch size of 32, an average of 8 seconds spent in all batches with a batch size of 32, overall 13m 6s spent in forwarding 100 batches (3200 images).
-
Going on ...