Columbia Data Science Institute Capstone Project, Fall 2020
Mentor: Dr. Adler Perotte
Instructor: Dr. Adam S. Kelleher
Team member:
Yihao Li, Chao Huang, Yufeng Ma, Xiaoyun Zhu, Shuo Yang
This project aims to create a machine learning-driven user interface for the annotation of very large pathology images. Each image may be 10s of thousands by 10s of thousands of pixels. As a result, annotation of the entire slide for object recognition or semantic/instance segmentation can be time consuming when entities are only a few pixels in diameter. This project aims to build a framework for maximally leveraging expert annotator (clinician) time by interleaving annotation (label generation) with inference to provide an intuitive notion of model fit and the minimal amount of labeling required for acceptable model performance.
The final report for this project can be seen from: Final Report
A video presentation with slides can be found on Youtube via https://youtu.be/XTHRxxOoG-k.
- Required packages can be found in the requirements file, it's recommended to use a virtual environment to install all required packages through pip.
- Note that although detectron2 is used in this repository, it's NOT explicitly listed in the requirements due to its complex dependencies on the version of PyTorch and CUDA. Therefore, it's better to build it from source by following the official guide.
-
Collage Generator: the module for generating synthetic whole slide images (a.k.a, collages) from vignettes, which utilize a complex algorithm. The algorithm is fully described and explained in the sub-directory called illustration.
-
Vignettes Data: contains vignettes used for generating synthetic whole slide images.
-
COCO-Format Converter: the module for generating instance segmentation datasets from collages using COCO-compatible format.
-
Core ML Components: the module storing essential functions and tools for training and serving UNet models for segmentation.
- preprocessing: contains functions for the preprocessing pipeline, namely cropping images as patches, saving patches as HDF5 files and loading data as PyTorch Datasets with augmentations.
- modeling: contains UNet model architecture, which is wrapped as a PyTorch Lightning model. Also, essential functions for postprocessing are also provided.
- utils: contains essential utility functions for manipulating slides and annotations.
- api: high level APIs exposed for the model serving component.
- config: a configuration file denoting target classes and parameters for the segmentation task.
-
Scripts: contains useful scripts for tuning (using Optuna) and testing models. Can also be used as a reference for calling low-level functions.
-
Demo Notebooks: contains several useful demo notebooks showing the usage of core components. =======