Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN Error while training -map #7153

Closed
niemiaszek opened this issue Dec 20, 2020 · 21 comments
Closed

cuDNN Error while training -map #7153

niemiaszek opened this issue Dec 20, 2020 · 21 comments

Comments

@niemiaszek
Copy link

I got cuDNN Error: CUDNN_STATUS_BAD_PARAM on darknet/src/convolutional_kernels.cu : () : line: 533. Same issue as in pjreddie repo issue. There is indepth description of my setup

@achen353
Copy link

achen353 commented Jan 10, 2021

I had a similar issue when training yolov4-tiny on custom dataset of 4 classes as instructed in README:
Screen Shot 2021-01-10 at 5 25 49 PM

I'm using Debian 10 Linux on Tesla P100 on GCP with:

  • NVIDIA-SMI: 450.51.06
  • Driver Version: 450.51.06
  • CUDA Version: 11.0
  • cuDNN Version: 8.0.4

I've tried the solutions mentioned in #6836 but none of it works. Always crashed at the same iteration.

@achen353
Copy link

I updated CUDA to 11.2 with driver 460.27.04 and still didn't work

@achen353
Copy link

achen353 commented Jan 11, 2021

I was able to fix the bug with a new VM instance installed with driver version 418.87.01, CUDA 10.1, and cuDNN 7.6.5. I've seen people either downgrade from CUDA 11.x to 10.x or upgrade from 9.x to 10.x. @niemiaszek maybe try downgrading your CUDA.

@AlexeyAB
Copy link
Owner

So can we confirm, that there is this issue only with CUDA 11, but with CUDA 10 it works well?

Also, check that you are using the cuDNN-version corresponding to your CUDA-version. There are cuDNN for CUDA 10 and cuDNN for CUDA 11.

@achen353
Copy link

achen353 commented Jan 11, 2021

Yes, I used compatible versions of cuDNN when upgrading/downgrading my CUDA.

And for this specific issue on (when calculating mAP with -map flag enabled):

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545

Yes, I think this issue occurs mainly with CUDA 11 + cuDNN 8.x

Same error mentioned in #6836 with:

  • GTX-1080 Ti + CUDA 11.1 + cuDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04
  • GTX-2080 Ti + CUDA 11 + cuDNN 8.0.4 + NVIDIA driver unknown + Ubuntu 18.04
  • Tesla K80 + CUDA 11.1 + cuDNN: 8.0.4 + NVIDIA driver unknown + Ubuntu 16.04

But I've also seen the same problem with older CUDA here:

  • GTX-2080 Ti + CUDA 9.0 + cuDNN 7.6 + NVIDIA driver unknown + System unknown (fixed by upgrading to CUDA 10.0)

@wudashuo
Copy link

wudashuo commented Mar 2, 2021

Exactly the same error when training with -map, remove -map and there was no problem. And no problem using ./darknet detector map ... to test the weights.
So my way is removing -map while training, modified src/detector.c to save weights every 1000 or 2000 iterations, and test weights with ./darknet detector map . Seem like it's the only possible way for now if I don't downgrade my CUDA and cuDNN.

@AlexeyAB Hi Alexey, I wanna ask is there any difference when calculating mAP between darknet detector map and darknet detector train -map? They both use validate_detector_map() function, but why error only occurs while the training process?

Device: GTX 1650, 1650 SUPER, 1660 SUPER
NVIDIA-SMI: 455.23.05
Driver Version: 455.23.05
CUDA Version: 11.1
cuDNN Version: 8.0.5.39 for CUDA11.1
OS: Ubuntu 20.04

@SpongeBab
Copy link

SpongeBab commented Mar 24, 2021

@AlexeyAB
hi, guys!
I meet the error too.
this is my output:

(base) xiaopeng@xiaopeng-HP-Z800-Workstation:~/下载/darknet$ ./darknet detector test cfg/coco.data cfg/yolov4-csp.cfg yolov4.weights 
 CUDA-version: 11000 (11000), cuDNN: 8.1.1, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 4.4.0
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1070 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 Create CUDA-stream - 0 
 Create cudnn-handle 0 
非法指令 (核心已转储)

and i then have tried install opencv 3.4.10. make clean . make It is not useful.

(pytorch) xiaopeng@xiaopeng-HP-Z800-Workstation:~/下载/darknet$ ./darknet detector demo cfg/coco.data  cfg/yolov4-csp.cfg
 CUDA-version: 11000 (11000), cuDNN: 8.1.1, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 3.4.10
Demo
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1070 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 Create CUDA-stream - 0 

 cuDNN status Error in: file: ./src/dark_cuda.c : () : line: 174 : build time: Mar 24 2021 - 14:51:27 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
darknet: ./src/utils.c:331: error: Assertion `0' failed.
已放弃 (核心已转储)
 

it happened when i use -test.
Device: GTX 1070,
NVIDIA-SMI 450.102.04
Driver Version: 450.102.04
CUDA Version: 11.0
cuDNN Version: 8.1.1.33 for CUDA11.0
OS: Ubuntu 20.04

@SpongeBab
Copy link

So can we confirm, that there is this issue only with CUDA 11, but with CUDA 10 it works well?

Also, check that you are using the cuDNN-version corresponding to your CUDA-version. There are cuDNN for CUDA 10 and cuDNN for CUDA 11.
@AlexeyAB
I can confirm that darknet does not support cuda 11.x.
#7531

@mixxen
Copy link

mixxen commented Jun 14, 2021

Same error here. Nvidia RTX 30 series require CUDA 11.

@mixxen
Copy link

mixxen commented Jun 15, 2021

I may have found a workaround. I set the CUDA_VISIBLE_DEVICES=0 and set max_batches=9000 in yolo.cfg. Not sure which setting allowed train to work with -map. Bash script:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

@andrewssobral
Copy link

I had the same problem, just removed -map from ./darknet detector train and it works fine

@AlexeyAB
Copy link
Owner

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

Yes, try to use export CUDA_VISIBLE_DEVICES=0
As I remeber some cuDNN versions consume more GPU-0 memory if you have several GPUs even if you use only one GPU-0.

@JJ840
Copy link

JJ840 commented Jul 24, 2021

Got me too- RTX 2070, cuda 11.4
(next mAP calculation at 1000 iterations)
1000: 0.619399, 0.568391 avg loss, 0.002610 rate, 4.941000 seconds, 64000 images, 17.686189 hours left
Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 44.60 MB
CUDA allocate done!

calculation mAP (mean average precision)...
Detection layer: 139 - type = 28
Detection layer: 150 - type = 28
Detection layer: 161 - type = 28
4
cuDNN status Error in: file: c:\src\darknet\src\convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 555 : build time: Jul 22 2021 - 16:25:28

cuDNN Error: CUDNN_STATUS_BAD_PARAM

edit: removing -map, hopefully will work. mildly annoying tho :/
edit 2: yep, managed to pass 1000 iterations where it failed before. Thanks.

@pablogago11
Copy link

@AlexeyAB Hi Alexey! Firstly, thanks for all your work, amazing!
We tried your solution below but we got the same error when evaluating the mAP at 1000 iterations.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

Yes, try to use export CUDA_VISIBLE_DEVICES=0 As I remeber some cuDNN versions consume more GPU-0 memory if you have several GPUs even if you use only one GPU-0.

Our environment settings are:

  • CUDA version = 11.02.0
  • cuDNN = 8.1.1
  • OpenCV = 3.2.0

Our GPU is a GeForce RTX 2080 Ti with:

  • compute_capability = 750
  • cudnn_half = 0
  • net.optimized_memory = 0
  • mini_batch = 1, batch = 16, time_steps = 1, train = 0

As many other users reported, the cuDNN error in src/convolutional_kernels.cu : () : line: 533 only appears while training with -map flag. Evaluating the mAP using the command ./darknet detector map is successful.

Does anyone have any solution other than downgrading the CUDA version or manually iterating the training and validation (saving every 1000 iterations + evaluating mAP + launch training again)? It is important for us having an automatic training with mAP evaluation.

Many thanks for your attention and suggestions!

@iamrajee
Copy link

iamrajee commented Nov 4, 2021

@pablogago11 Stuck with the same issue. My specifications are Cuda 11.4 & cuDNN 8.2.4, Ubuntu 20.04.

@khsily
Copy link

khsily commented Nov 10, 2021

In my case, changing subdivisions to 64 solved the problem.

batch=64
subdivisions=64  # changed from 16 to 64
width=416
height=416

Tested on:

  • GeForce RTX 2070S
  • cuda 11.1
  • cudnn 8.2.1
  • Ubuntu 18.04

Working result:

@nhphuong91
Copy link

Thanks to @khsily suggestion, I doubt that this issue also has something to do with batch and subdivision too (same as out of memory error).
Initially, my training stop at iteration=900. After running export CUDA_VISIBLE_DEVICES=0, it stops at iteration=1000
Finally, I reduce batch size (=increase subdivision) and it runs ok; even though lots of memory are wasted 😅
image
Hopefully this would help!

@LSGL-LLW
Copy link

LSGL-LLW commented Dec 9, 2022

在我的例子中,将细分更改为 64 解决了这个问题。

batch=64
subdivisions=64  # changed from 16 to 64
width=416
height=416

测试于:

  • GeForce RTX 2070S
  • 库达11.1
  • cudnn 8.2.1
  • Ubuntu 18.04

工作结果:

Yes, this problem also occurred when I used the darknet yolov7-tiny.cfg. My configuration is CUDA10.2 and cudn8.05. I did not try to adjust cudnn, but adjusted batch=64 and subdivisions=64. Finally, I calculated the map through 1000 iterations.
I think we can try to modify the values of batch and subdivisions to increase or decrease them.

@lrf19991230
Copy link

I had the same problem.
Based on my experience and your discussions, I think the reason maybe that calculating the map on the validation set requires a large amount of GPU memory, and if there is insufficient memory remaining during the training, it may cause this error when calculating the map.
So set the batch=64 and subdivisions=64 can reduce the amount of memory used during the training process, then it will work

@jemrlee
Copy link

jemrlee commented May 3, 2023

I have same problem and fixed with correct CUDA and CUDNN version installation, and modify Makefile
I've used RTX4080 but the Makefile didn't contain compute_89 args, so I added this in Makefile

`# GeForce RTX4090

ARCH= -gencode arch=compute_89,code=[sm_89,compute_89]
`

and finally works fine.

Maybe this issue depends with the GPU Architecture

@nbandaru1h
Copy link

nbandaru1h commented Apr 2, 2024

I was able to solve this issue by reducing the input resolution of the images in training from the .cfg file. From 576 x 576 to 416 x 416.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests