Zehan Wang1* · Ziang Zhang1* · Tianyu Pang2 · Du Chao2 · Hengshuang Zhao3 · Zhou Zhao1
1Zhejiang University 2SEA AI Lab 3HKU
*Equal Contribution
Orient Anything, a robust image-based object orientation estimation model. By training on 2M rendered labeled images, it achieves strong zero-shot generalization ability for images in the wild.
- 2024-12-24: Paper, Project Page, Code, Models, and Demo are released.
We provide three models of varying scales for robust object orientation estimation in images:
Model | Params | Checkpoint |
---|---|---|
Orient-Anything-Small | 23.3 M | Download |
Orient-Anything-Base | 87.8 M | Download |
Orient-Anything-Large | 305 M | Download |
pip install -r requirements.txt
Start gradio by executing the following script:
python app.py
then open GUI page(default is https://127.0.0.1:7860) in web browser.
or, you can try it in our Huggingface-Space
from paths import *
from vision_tower import DINOv2_MLP
from transformers import AutoImageProcessor
import torch
from PIL import Image
import torch.nn.functional as F
from utils import *
from inference import *
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="Viglong/Orient-Anything", filename="croplargeEX2/dino_weight.pt", repo_type="model", cache_dir='./', resume_download=True)
print(ckpt_path)
save_path = './'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
dino = DINOv2_MLP(
dino_mode = 'large',
in_dim = 1024,
out_dim = 360+180+180+2,
evaluate = True,
mask_dino = False,
frozen_back = False
)
dino.eval()
print('model create')
dino.load_state_dict(torch.load(ckpt_path, map_location='cpu'))
dino = dino.to(device)
print('weight loaded')
val_preprocess = AutoImageProcessor.from_pretrained(DINO_LARGE, cache_dir='./')
image_path = '/path/to/image'
origin_image = Image.open(image_path).convert('RGB')
angles = get_3angle(origin_image, dino, val_preprocess, device)
azimuth = float(angles[0])
polar = float(angles[1])
rotation = float(angles[2])
confidence = float(angles[3])
To avoid ambiguity, our model only supports inputs that contain images of a single object. For daily images that usually contain multiple objects, it is a good choice to isolate each object with DINO-grounding and predict the orientation separately.
[ToDo]
In order to further enhance the robustness of the model,We further propose the test-time ensemble strategy. The input images will be randomly cropped into different variants, and the predicted orientation of different variants will be voted as the final prediction result. We implement this strategy in functions get_3angle_infer_aug()
and get_crop_images()
.
If you find this project useful, please consider citing:
@article{orient_anything,
title={Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models},
author={Wang, Zehan and Zhang, Ziang and Pang, Tianyu and Du, Chao and Zhao, Hengshuang and Zhao, Zhou},
journal={arXiv:2412.18605},
year={2024}
}
Thanks to the open source of the following projects: Grounded-Segment-Anything, render-py