Official PyTorch implementation of iFormer, published on ICLR 2025.
Abstract
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios. The source code and trained models will be available soon.
iFormer is Pareto-optimal compared to existing methods on ImageNet-1k. The latency is measured on an iPhone 13.
- [2025/1/23] Our paper has been accepted to ICLR 2025.
- [2024/12/1] All models have been released.
Model | Params(M) | GMACs | Latency(ms) | Top-1(%) | Ckpt. | Core ML | Log |
---|---|---|---|---|---|---|---|
iFormer-T | 2.9 | 0.53 | 0.60 | 74.1 | 300e | 300e | 300e |
iFormer-S | 6.5 | 1.09 | 0.85 | 78.8 | 300e | 300e | 300e |
iFormer-M | 8.9 | 1.64 | 1.10 | 80.4/81.1 | 300e/300e distill | 300e/300e distill | 300e / 300e distill |
iFormer-L | 14.7 | 2.63 | 1.60 | 81.9/82.7 | 300e /300e distill | 300e/300e distill | 300e /300e distill |
iFormer-L2 | 24.5 | 4.50 | 2.30 | 83.9 | 300e distill | 300e distill | 300e distill |
iFormer-H | 99.0 | 15.5 | - | 84.8 | 300e | 300e | 300e |
- iFormer-L2 is trained with distillation for 450 epochs.
git clone [email protected]:ChuanyangZheng/iFormer.git
cd iFormer
pip install -r requirements.txt
We use a standard ImageNet dataset, you can download it from http://image-net.org/.
- The file structure should look like this:
imagenet ├── train │ ├── class1 │ │ ├── img1.jpeg │ │ └── ... │ ├── class2 │ │ ├── img3.jpeg │ │ └── ... │ └── ... └── val ├── class1 │ ├── img4.jpeg │ └── ... ├── class2 │ ├── img6.jpeg │ └── ... └── ...
python -m torch.distributed.launch --nproc_per_node=8 \
main.py \
--cfg-path configs/iFormer_m.yaml
python -m torch.distributed.launch --nproc_per_node=1 \
main.py \
--model iFormer_m \
--input_size 224 \
--num_workers 16 \
--layer_scale_init_value 0 \
--finetune $ckpt_path \
--eval true \
--data_path $data_path
This should give
* Acc@1 80.420 Acc@5 95.336 loss 1.010
Distillation
python -m torch.distributed.launch --nproc_per_node=1 \
main.py \
--model iFormer_m \
--input_size 224 \
--num_workers 16 \
--layer_scale_init_value 0 \
--distillation_type hard \
--finetune $ckpt_path \
--eval true \
--data_path $data_path
This should give
* Acc@1 81.068 Acc@5 95.466 loss 0.746
Layer scale
python -m torch.distributed.launch --nproc_per_node=1 \
main.py \
--model iFormer_h \
--input_size 224 \
--num_workers 16 \
--layer_scale_init_value 1e-6 \
--finetune $ckpt_path \
--eval true \
--data_path $data_path
This should give
* Acc@1 84.820 Acc@5 97.058 loss 0.843
- These configurations should be consistent with training configurations.
Get the FLOPs and parameters
python flops.py
Compile your model by Core ML Tools (CoreML)
python export_coreml.py --model=iFormer_m --resolution=224
For downstream tasks, we compile the backbone with a larger resolution of 512 x 512.
python export_coreml.py --model=iFormer_m --resolution=512
Benchmark the compiled model using Xcode (version 15.4) on an iPhone 13 (iOS 17.7), giving you the following latency.
- You can change the software version you want, but we have noticed that different iOS versions may affect latency measurements.
Object Detection on COCO
Semantic Segmentation on ADE20K
Image Classification code is partly built with ConvNeXt, timm, and RepViT.
Object detection & instance segmentation are trained on MMDetection toolkit.
Semantic segmentation is trained on MMSegmentation toolkit.
Sincerely appreciate their elegant implementations!
If you find this repository helpful, please consider citing:
@article{zheng2025iformer,
title={iFormer: Integrating ConvNet and Transformer for Mobile Application},
author={Zheng, Chuanyang},
journal={arXiv preprint arXiv:2501.15369},
year={2025}
}