Skip to content

Official Pytorch implementation for "IFORMER: INTEGRATING CONVNET AND TRANSFORMER FOR MOBILE APPLICATION" [ICLR 2025]

License

Notifications You must be signed in to change notification settings

nirvanesque/iFormer

 
 

Repository files navigation

Official PyTorch implementation of iFormer, published on ICLR 2025.


Abstract We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios. The source code and trained models will be available soon.


iFormer is Pareto-optimal compared to existing methods on ImageNet-1k. The latency is measured on an iPhone 13.


What's New 💥💥💥

  • [2025/1/23] Our paper has been accepted to ICLR 2025.
  • [2024/12/1] All models have been released.

Main Results on ImageNet with Pretrained Models

Model Params(M) GMACs Latency(ms) Top-1(%) Ckpt. Core ML Log
iFormer-T 2.9 0.53 0.60 74.1 300e 300e 300e
iFormer-S 6.5 1.09 0.85 78.8 300e 300e 300e
iFormer-M 8.9 1.64 1.10 80.4/81.1 300e/300e distill 300e/300e distill 300e / 300e distill
iFormer-L 14.7 2.63 1.60 81.9/82.7 300e /300e distill 300e/300e distill 300e /300e distill
iFormer-L2 24.5 4.50 2.30 83.9 300e distill 300e distill 300e distill
iFormer-H 99.0 15.5 - 84.8 300e 300e 300e
  • iFormer-L2 is trained with distillation for 450 epochs.

Getting Started

Requirements

git clone [email protected]:ChuanyangZheng/iFormer.git
cd iFormer
pip install -r requirements.txt

Data Preparation

We use a standard ImageNet dataset, you can download it from http://image-net.org/.

  • The file structure should look like this:
    imagenet
    ├── train
    │   ├── class1
    │   │   ├── img1.jpeg
    │   │   └── ...
    │   ├── class2
    │   │   ├── img3.jpeg
    │   │   └── ...
    │   └── ...
    └── val
        ├── class1
        │   ├── img4.jpeg
        │   └── ...
        ├── class2
        │   ├── img6.jpeg
        │   └── ...
        └── ...
    

Training

python -m torch.distributed.launch --nproc_per_node=8  \
    main.py \
    --cfg-path configs/iFormer_m.yaml

Evaluation

python -m torch.distributed.launch --nproc_per_node=1  \
    main.py  \
    --model iFormer_m \
    --input_size 224 \
    --num_workers 16 \
    --layer_scale_init_value 0 \
    --finetune $ckpt_path \
    --eval true \
    --data_path $data_path

This should give

* Acc@1 80.420 Acc@5 95.336 loss 1.010

Distillation

python -m torch.distributed.launch --nproc_per_node=1  \
    main.py  \
    --model iFormer_m \
    --input_size 224 \
    --num_workers 16 \
    --layer_scale_init_value 0 \
    --distillation_type hard \
    --finetune $ckpt_path \
    --eval true \
    --data_path $data_path

This should give

* Acc@1 81.068 Acc@5 95.466 loss 0.746

Layer scale

python -m torch.distributed.launch --nproc_per_node=1  \
    main.py  \
    --model iFormer_h \
    --input_size 224 \
    --num_workers 16 \
    --layer_scale_init_value 1e-6 \
    --finetune $ckpt_path \
    --eval true \
    --data_path $data_path

This should give

* Acc@1 84.820 Acc@5 97.058 loss 0.843
  • These configurations should be consistent with training configurations.

Get the FLOPs and parameters

python flops.py

Latency Measurement

Compile your model by Core ML Tools (CoreML)

python export_coreml.py --model=iFormer_m --resolution=224

For downstream tasks, we compile the backbone with a larger resolution of 512 x 512.

python export_coreml.py --model=iFormer_m --resolution=512

Benchmark the compiled model using Xcode (version 15.4) on an iPhone 13 (iOS 17.7), giving you the following latency.

  • You can change the software version you want, but we have noticed that different iOS versions may affect latency measurements.

Downstream Tasks

Object Detection on COCO
Semantic Segmentation on ADE20K

Acknowledgement

Image Classification code is partly built with ConvNeXt, timm, and RepViT.

Object detection & instance segmentation are trained on MMDetection toolkit.

Semantic segmentation is trained on MMSegmentation toolkit.

Sincerely appreciate their elegant implementations!

Citation

If you find this repository helpful, please consider citing:

@article{zheng2025iformer,
  title={iFormer: Integrating ConvNet and Transformer for Mobile Application},
  author={Zheng, Chuanyang},
  journal={arXiv preprint arXiv:2501.15369},
  year={2025}
}

About

Official Pytorch implementation for "IFORMER: INTEGRATING CONVNET AND TRANSFORMER FOR MOBILE APPLICATION" [ICLR 2025]

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%