This is a minimal implementation that simply contains these files:
- train.py,predict.py: main entry script
- modeling/generalized_rcnn.py: implement variants of generalized R-CNN architecture
- modeling/backbone.py: implement backbones
- modeling/model_{fpn,rpn,frcnn,mrcnn,cascade}.py: implement FPN,RPN,Fast/Mask/Cascade R-CNN models.
- modeling/model_box.py: implement box-related symbolic functions
- dataset/dataset.py: the dataset interface
- dataset/coco.py: load COCO data to the dataset interface
- data.py: prepare data for training & inference
- common.py: common data preparation utilities
- utils/: third-party helper functions
- eval.py: evaluation utilities
- viz.py: visualization utilities
-
It's easy to train on your own data, by calling
DatasetRegistry.register(name, lambda: YourDatasetSplit())
, and modifycfg.DATA.*
accordingly. Afterwards, "name" can be used incfg.DATA.TRAIN
.YourDatasetSplit
can be:-
COCODetection
, if your data is already in COCO format. In this case, you need to modifydataset/coco.py
to change the class names and the id mapping. -
Your own class, if your data is not in COCO format. You need to write a subclass of
DatasetSplit
, similar toCOCODetection
. In this class you'll implement the logic to load your dataset and evaluate predictions. The documentation is in the docstring of `DatasetSplit.See BALLOON.md for an example of fine-tuning on a different dataset.
-
-
You can easily add more augmentations such as rotation, but be careful how a box should be augmented. The code now will always use the minimal axis-aligned bounding box of the 4 corners, which is probably not the optimal way. A TODO is to generate bounding box from segmentation, so more augmentations can be naturally supported.
- Floating-point boxes are defined like this:
-
We use ROIAlign, and
tf.image.crop_and_resize
is NOT ROIAlign. -
We currently only support single image per GPU in this example.
-
Because of (3), BatchNorm statistics are supposed to be frozen during fine-tuning.
-
An alternative to freezing BatchNorm is to sync BatchNorm statistics across GPUs (the
BACKBONE.NORM=SyncBN
option). Another alternative to BatchNorm is GroupNorm (BACKBONE.NORM=GN
) which has better performance.
Training throughput (larger is better) of standard R50-FPN Mask R-CNN, on 8 V100s:
Implementation | Throughput (img/s) |
---|---|
Detectron2 | 62 |
mmdetection | 53 |
maskrcnn-benchmark | 53 |
tensorpack | 50 |
Detectron | 19 |
matterport/Mask_RCNN | 14 |
-
This implementation does not use specialized CUDA ops (e.g. ROIAlign), and does not use batch of images. Therefore it might be slower than other highly-optimized implementations. For details of the benchmark, see detectron2 benchmarks.
-
If CuDNN warmup is on, the training will start very slowly, until about 10k steps (or more if scale augmentation is used) to reach a maximum speed. As a result, the ETA is also inaccurate at the beginning. CuDNN warmup is by default enabled when no scale augmentation is used.
-
After warmup, the training speed will slowly decrease due to more accurate proposals.
-
The code should have around 85~90% GPU utilization on one V100. Scalability isn't very meaningful since the amount of computation each GPU perform is data-dependent. If all images have the same spatial size (in which case the per-GPU computation is still different), then a 85%~90% scaling efficiency is observed when using 8 V100s and
HorovodTrainer
. -
To reduce RAM usage on host: (1) make sure you're using the "spawn" method as set in
train.py
; (2) reducebuffer_size
orNUM_WORKERS
indata.py
(which may negatively impact your throughput). The training only needs <10G RAM ifNUM_WORKERS=0
. -
Inference is unoptimized. Tensorpack is a training interface: it produces the trained weights in standard format but it does not help you on optimized inference. In fact, the current implementation uses some slow numpy operations in inference (in
eval.py:_paste_mask
).
Possible Future Speed Enhancements:
-
Support batch>1 per GPU. Batching with inconsistent shapes is non-trivial to implement in TensorFlow.
-
Use dedicated CUDA ops. (e.g. ROIAlign or
tf.image.generate_bounding_box_proposals
)
TensorFlow ≥ 1.6 supports most common features in this R-CNN implementation. However, each version of TensorFlow has bugs that I either reported or fixed, and this implementation touches many of those bugs. Therefore, not every version of TF ≥ 1.6 supports every feature in this implementation.
- TF < 1.6: Nothing works due to lack of support for empty tensors
(PR)
and
FrozenBN
training (PR). - TF < 1.10:
SyncBN
with NCCL will fail (PR). - TF 1.11 & 1.12: multithread inference will fail (issue). Latest tensorpack will apply a workaround.
- TF 1.13: MKL inference will fail (issue).
- TF > 1.12: Horovod training will fail (issue). Latest tensorpack will apply a workaround.
- TF > 1.14: NCCL produce wrong gradients (issue). Latest tensorpack will avoid using NCCL.
This implementation contains workaround for some of these TF bugs.
However, note that the workaround needs to check your TF version by tf.VERSION
,
and may not detect bugs properly if your TF version is not an official release
(e.g., if you use a nightly build).