Add support for Apple's Depth-Pro #34583

geetu040 · 2024-11-03T06:20:30Z

What does this PR do?

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Research Paper: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Authors: Aleksei Bochkovskii, Amaël Delaunoy, and others
Implementation: apple/ml-depth-pro
Models Weights: apple/DepthPro

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts, @qubvel

geetu040 · 2024-11-03T06:59:29Z

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.
For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

Rocketknight1 · 2024-11-04T13:40:59Z

cc @pcuenca as well!

qubvel · 2024-11-05T10:30:42Z

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

geetu040 · 2024-11-10T06:27:52Z

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

qubvel · 2024-11-11T14:01:13Z

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

geetu040 · 2024-11-12T04:52:18Z

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

qubvel · 2024-11-15T23:21:09Z

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

geetu040 · 2024-11-18T05:35:55Z

currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

now an image will downscale to these 3 scales
then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

qubvel · 2024-11-18T10:29:08Z

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

geetu040 · 2024-11-25T07:36:09Z

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image

Source code results

HF code results

geetu040 · 2024-11-25T07:54:39Z

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

qubvel · 2025-02-05T17:05:26Z

Thanks for applying the changes! 🤗 We’re all set to proceed with the model.

Before merging, there’s just one thing left: we usually transfer the checkpoint to the organization that released the original model, in this case - Apple, but only if you’re okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

Change the Hub repository name from "DepthPro" to "depth-pro-hf"
Update the checkpoint name in the model card snippet
Update the checkpoint names in the code

Let us know how you’d like to proceed! 😊

geetu040 · 2025-02-05T18:31:26Z

Apple, but only if you’re okay with that.

I am okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

I have updated the hub repository name: https://huggingface.co/geetu040/depth-pro-hf
and updated the checkpoint in model card and code

geetu040 · 2025-02-05T18:39:28Z

And Thank you @qubvel for all the fine detailed reviews.

qubvel · 2025-02-05T20:01:54Z

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

geetu040 · 2025-02-06T03:25:52Z

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

updated

geetu040 · 2025-02-06T03:38:36Z

@qubvel, DepthProImageProcessingTest::test_fast_is_faster_than_slow has failed in the last (2/5) runs, do you think I should put is_flanky decorator on it? It's guaranteed (with a very high probability) that it passes every time.

qubvel · 2025-02-06T09:59:58Z

Yes, let's add @is_flaky()

geetu040 · 2025-02-06T12:14:06Z

Yes, let's add @is_flaky()

I've updated the test and its working fine now

qubvel · 2025-02-06T12:16:17Z

Thanks! woking on checkpoint transfer

pcuenca · 2025-02-07T13:56:37Z

Checkpoint has been transferred.

qubvel · 2025-02-07T14:07:18Z

run-slow: depth_pro

github-actions · 2025-02-07T14:08:37Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/depth_pro']
quantizations: [] ...

geetu040 · 2025-02-07T17:17:32Z

Checkpoint has been transferred.

@qubvel Shouldn't the checkpoint be apple/depth-pro-hf instead of apple/DepthPro-hf as you suggested? The code uses apple/depth-pro-hf everywhere.

pcuenca · 2025-02-07T17:20:32Z

@geetu040 I followed the same pattern for the model family. Let me check if there's anything we can do.

geetu040 · 2025-02-07T17:24:13Z

@geetu040 I followed the same pattern for the model family. Let me check if there's anything we can do.

Otherwise I can change the checkpoint in the code, and it would also require to be changed in the model card.

pcuenca · 2025-02-07T17:25:13Z

https://huggingface.co/apple/depth-pro-hf now redirects to DepthPro-hf. So code should work.

pcuenca · 2025-02-07T17:26:50Z

But good point about the model card, I'll change so it's less confusing. Edit: updated.

qubvel · 2025-02-07T17:49:39Z

Thanks @pcuenca!

@geetu040 can you please update the code references? sorry for the back-and-forth changes.

Waiting for the slow CI to merge it then.

geetu040 · 2025-02-07T18:17:12Z

@geetu040 can you please update the code references? sorry for the back-and-forth changes.

sure, updated!

qubvel · 2025-02-07T20:14:29Z

run-slow: depth_pro

github-actions · 2025-02-07T20:16:01Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/depth_pro']
quantizations: [] ...

geetu040 · 2025-02-08T12:27:13Z

@qubvel, the tests did not run in the last 2 attempts.

Error Log

Run PR_MERGE_SHA=$(git log -1 --format=%H)
  PR_MERGE_SHA=$(git log -1 --format=%H)
  if [ $PR_MERGE_SHA != $VERIFIED_PR_MERGE_SHA ]; then
    echo "The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!";
    exit -1;
  fi
  shell: sh -e {0}
  env:
    HF_HOME: /mnt/cache
    TRANSFORMERS_IS_CI: yes
    OMP_NUM_THREADS: 8
    MKL_NUM_THREADS: 8
    RUN_SLOW: yes
    HF_HUB_READ_TOKEN: ***
    SIGOPT_API_TOKEN: ***
    TF_FORCE_GPU_ALLOW_GROWTH: true
    RUN_PT_TF_CROSS_TESTS: 1
    CUDA_VISIBLE_DEVICES: 0,1
    matrix_folders: models_depth_pro
    VERIFIED_PR_MERGE_SHA: 850bdaaad25826da72b87e9455296742bb83e331
The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!
/__w/_temp/881ad002-6e8a-4896-968a-9a2b93d19337.sh: 4: exit: Illegal number: -1
Error: Process completed with exit code 2.

Looks like something is wrong with the workflow itself.

qubvel · 2025-02-08T14:00:26Z

@geetu040 yes, smth wrong with CI, waiting for the team to fix it

implement config and model building blocks

2986dc2

qubvel added New model Vision labels Nov 5, 2024

refactor model architechture

1728a2f

update model outputs

11ce50c

geetu040 added 10 commits November 16, 2024 10:23

update init param to include use_fov_model

27e9593

update param name in config

e74a7f5

fix hidden_states and attentions outputs for fov

8c2460b

sort config

55f6ed3

complete minor todos

b25dffb

update patching

c225deb

update config for encoder

176932d

fix config

dcec522

use correct defaults in config

0384d2f

update merge for compatibility with different image size

85e4f86

geetu040 added 4 commits November 21, 2024 11:04

restructure encoder for custom configuration

00e4aa3

make fov model compatible with custom config

6be242c

replace word "decoder" with "fusion"

0189108

weight conversion script

7614e1a

update checkpoint

4dc850f

Merge branch 'main' into depth-pro

b7f32b9

update checkpoint

592648c

geetu040 added 2 commits February 6, 2025 16:33

use @is_flanky for processor test

162f141

Merge branch 'main' into depth-pro

3a62d63

geetu040 requested a review from qubvel February 6, 2025 12:14

update checkpoint to "apple/DepthPro-hf"

4b76239

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Apple's Depth-Pro #34583

Add support for Apple's Depth-Pro #34583

geetu040 commented Nov 3, 2024 •

edited

Loading

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 •

edited

Loading

geetu040 commented Nov 25, 2024

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025 •

edited

Loading

geetu040 commented Feb 6, 2025

geetu040 commented Feb 6, 2025 •

edited

Loading

qubvel commented Feb 6, 2025

geetu040 commented Feb 6, 2025

qubvel commented Feb 6, 2025

pcuenca commented Feb 7, 2025

qubvel commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

geetu040 commented Feb 7, 2025

pcuenca commented Feb 7, 2025

geetu040 commented Feb 7, 2025 •

edited

Loading

pcuenca commented Feb 7, 2025 •

edited

Loading

pcuenca commented Feb 7, 2025 •

edited

Loading

qubvel commented Feb 7, 2025 •

edited

Loading

geetu040 commented Feb 7, 2025

qubvel commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

geetu040 commented Feb 8, 2025

qubvel commented Feb 8, 2025

Add support for Apple's Depth-Pro #34583

Are you sure you want to change the base?

Add support for Apple's Depth-Pro #34583

Conversation

geetu040 commented Nov 3, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 • edited Loading

geetu040 commented Nov 25, 2024

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025 • edited Loading

geetu040 commented Feb 6, 2025

geetu040 commented Feb 6, 2025 • edited Loading

qubvel commented Feb 6, 2025

geetu040 commented Feb 6, 2025

qubvel commented Feb 6, 2025

pcuenca commented Feb 7, 2025

qubvel commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

geetu040 commented Feb 7, 2025

pcuenca commented Feb 7, 2025

geetu040 commented Feb 7, 2025 • edited Loading

pcuenca commented Feb 7, 2025 • edited Loading

pcuenca commented Feb 7, 2025 • edited Loading

qubvel commented Feb 7, 2025 • edited Loading

geetu040 commented Feb 7, 2025

qubvel commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

geetu040 commented Feb 8, 2025

qubvel commented Feb 8, 2025

geetu040 commented Nov 3, 2024 •

edited

Loading

geetu040 commented Nov 25, 2024 •

edited

Loading

qubvel commented Feb 5, 2025 •

edited

Loading

geetu040 commented Feb 6, 2025 •

edited

Loading

geetu040 commented Feb 7, 2025 •

edited

Loading

pcuenca commented Feb 7, 2025 •

edited

Loading

pcuenca commented Feb 7, 2025 •

edited

Loading

qubvel commented Feb 7, 2025 •

edited

Loading