Does the input tensor need to be resized as (224, 224)? #237

kitaharasetusna · 2024-04-16T22:36:37Z

kitaharasetusna
Apr 16, 2024

Since the examples given are resnet18 pretrained on the imagenet, I'm wondering if it's ok to reshape the image to (224, 224) for this task, since the input image should be (32, 32) with 3 channels.
Here's the example code I referred to:

And here's my code snippet I used:

It works well when I call the resize function (224, 224) like this:

But works bad when I use the original shape or resize it explicitly as (32, 32)

When I tried to debug it, it shows in the later case activation_map[0] have a shape (1)

I really appreciate if anyone can solve my problem. I have referred to the document, but it turns out that all torchcam return a mask shaped like (224, 224), it's that for matplotlib illustrations only? Since my model should takes input as (3, 32,32) instead of (3, 224, 224)

Answered by frgfm

Apr 17, 2024

Hey @kitaharasetusna 👋

This project was meant to bring as much flexibility as possible, so it does not introduce a bias for image size. However the models you're using may do so. If you use a resnet18 trained on (224,224) inputs, passing (32,32) will not likely produce good things.

To give you a better understanding of the inner works of a resnet18, using TorchScan:

from torchscan import summary
from torchvision import resnet18

model = resnet18().eval()
summary(model, (3, 224, 224), max_depth=2)

yields

_________________________________________________________________
Layer        Type                 Output Shape          Param #  
========================================================…

View full answer

frgfm · 2024-04-17T12:40:57Z

frgfm
Apr 17, 2024
Maintainer

Hey @kitaharasetusna 👋

This project was meant to bring as much flexibility as possible, so it does not introduce a bias for image size. However the models you're using may do so. If you use a resnet18 trained on (224,224) inputs, passing (32,32) will not likely produce good things.

To give you a better understanding of the inner works of a resnet18, using TorchScan:

from torchscan import summary
from torchvision import resnet18

model = resnet18().eval()
summary(model, (3, 224, 224), max_depth=2)

yields

_________________________________________________________________
Layer        Type                 Output Shape          Param #  
=================================================================
resnet       ResNet               (-1, 1000)            0        
├─conv1      Conv2d               (-1, 64, 112, 112)    9,408    
├─bn1        BatchNorm2d          (-1, 64, 112, 112)    257      
├─relu       ReLU                 (-1, 64, 112, 112)    0        
├─maxpool    MaxPool2d            (-1, 64, 56, 56)      0        
├─layer1     Sequential           (-1, 64, 56, 56)      0        
|    └─0     BasicBlock           (-1, 64, 56, 56)      74,242   
|    └─1     BasicBlock           (-1, 64, 56, 56)      74,242   
├─layer2     Sequential           (-1, 128, 28, 28)     0        
|    └─0     BasicBlock           (-1, 128, 28, 28)     230,915  
|    └─1     BasicBlock           (-1, 128, 28, 28)     295,938  
├─layer3     Sequential           (-1, 256, 14, 14)     0        
|    └─0     BasicBlock           (-1, 256, 14, 14)     920,579  
|    └─1     BasicBlock           (-1, 256, 14, 14)     1,181,698
├─layer4     Sequential           (-1, 512, 7, 7)       0        
|    └─0     BasicBlock           (-1, 512, 7, 7)       3,676,163
|    └─1     BasicBlock           (-1, 512, 7, 7)       4,722,690
├─avgpool    AdaptiveAvgPool2d    (-1, 512, 1, 1)       0        
├─fc         Linear               (-1, 1000)            513,000  
=================================================================
Trainable params: 11,689,512
Non-trainable params: 0
Total params: 11,689,512
-----------------------------------------------------------------
Model size (params + buffers): 44.63 Mb
Framework & CUDA overhead: 108.00 Mb
Total RAM usage: 152.63 Mb
-----------------------------------------------------------------
Floating Point Operations on forward: 3.65 GFLOPs
Multiply-Accumulations on forward: 1.83 GMACs
Direct memory accesses on forward: 1.84 GDMAs
_________________________________________________________________

Depending on the layer you want to extract the CAM from, usually the last conv layer (in layer4), you need an unflattened tensor. As you can see, before the pooling layer, the smallest feature map has at least (7,7) spatial size. This can be upscaled for visualization easily
Now then, passing 32x32 inputs:

summary(model, (3, 32, 32), max_depth=2)

yields

_______________________________________________________________
Layer        Type                 Output Shape        Param #  
===============================================================
resnet       ResNet               (-1, 1000)          0        
├─conv1      Conv2d               (-1, 64, 16, 16)    9,408    
├─bn1        BatchNorm2d          (-1, 64, 16, 16)    257      
├─relu       ReLU                 (-1, 64, 16, 16)    0        
├─maxpool    MaxPool2d            (-1, 64, 8, 8)      0        
├─layer1     Sequential           (-1, 64, 8, 8)      0        
|    └─0     BasicBlock           (-1, 64, 8, 8)      74,242   
|    └─1     BasicBlock           (-1, 64, 8, 8)      74,242   
├─layer2     Sequential           (-1, 128, 4, 4)     0        
|    └─0     BasicBlock           (-1, 128, 4, 4)     230,915  
|    └─1     BasicBlock           (-1, 128, 4, 4)     295,938  
├─layer3     Sequential           (-1, 256, 2, 2)     0        
|    └─0     BasicBlock           (-1, 256, 2, 2)     920,579  
|    └─1     BasicBlock           (-1, 256, 2, 2)     1,181,698
├─layer4     Sequential           (-1, 512, 1, 1)     0        
|    └─0     BasicBlock           (-1, 512, 1, 1)     3,676,163
|    └─1     BasicBlock           (-1, 512, 1, 1)     4,722,690
├─avgpool    AdaptiveAvgPool2d    (-1, 512, 1, 1)     0        
├─fc         Linear               (-1, 1000)          513,000  
===============================================================
Trainable params: 11,689,512
Non-trainable params: 0
Total params: 11,689,512
---------------------------------------------------------------
Model size (params + buffers): 44.63 Mb
Framework & CUDA overhead: 108.00 Mb
Total RAM usage: 152.63 Mb
---------------------------------------------------------------
Floating Point Operations on forward: 75.25 MFLOPs
Multiply-Accumulations on forward: 37.63 MMACs
Direct memory accesses on forward: 48.94 MDMAs
_______________________________________________________________

Here in layer4, because the input is passing through size reducing layers, it ends up being flattened (1, 1). This can't be upscaled since it will only produced a uniform distribution. However layer3 still has (2, 2)

So, in short:

if you're using a model which has been trained on (32,32), for instance if you've trained the resnet18 on CIFAR, you only need to pass the input size to the input_shape of each CAM method (cf. https://frgfm.github.io/torch-cam/methods.html). And as mentioned above, I suggested targeting "layer3" or earlier otherwise the CAM doesn't even theoretically exist. Because of that, if you insist on using resnet18 on 32x32 input, I also suggest you remove layer4 (resnet18 is an overkill for mono-channel 32x32 input, and layer4 is useless since performing 2D conv on (1,1) tensor doesn't do anything interesting 😅
if you're using a model trained on (224, 224) inputs, that's probably not a good idea to pass anything different from that size (or to upscale them by that much)
FYI in your snippet, this line

grad_cam(model=model, image_tensor=...)

isn't using this library so I can't help over there 😅

Let me know if you have any additional questions, take care ✌️

1 reply

kitaharasetusna Apr 17, 2024
Author

Thanks for your prompt response, I think it solves my problems perfectly. And again, I really appreciate the work you've done!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the input tensor need to be resized as (224, 224)? #237

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Does the input tensor need to be resized as (224, 224)? #237

kitaharasetusna Apr 16, 2024

Replies: 1 comment · 1 reply

frgfm Apr 17, 2024 Maintainer

kitaharasetusna Apr 17, 2024 Author

kitaharasetusna
Apr 16, 2024

Replies: 1 comment 1 reply

frgfm
Apr 17, 2024
Maintainer

kitaharasetusna Apr 17, 2024
Author