cuDNN Error while training -map #7153

niemiaszek · 2020-12-20T05:00:44Z

I got cuDNN Error: CUDNN_STATUS_BAD_PARAM on darknet/src/convolutional_kernels.cu : () : line: 533. Same issue as in pjreddie repo issue. There is indepth description of my setup

achen353 · 2021-01-10T12:14:08Z

I had a similar issue when training yolov4-tiny on custom dataset of 4 classes as instructed in README:

I'm using Debian 10 Linux on Tesla P100 on GCP with:

NVIDIA-SMI: 450.51.06
Driver Version: 450.51.06
CUDA Version: 11.0
cuDNN Version: 8.0.4

I've tried the solutions mentioned in #6836 but none of it works. Always crashed at the same iteration.

achen353 · 2021-01-10T14:57:04Z

I updated CUDA to 11.2 with driver 460.27.04 and still didn't work

achen353 · 2021-01-11T15:00:11Z

I was able to fix the bug with a new VM instance installed with driver version 418.87.01, CUDA 10.1, and cuDNN 7.6.5. I've seen people either downgrade from CUDA 11.x to 10.x or upgrade from 9.x to 10.x. @niemiaszek maybe try downgrading your CUDA.

AlexeyAB · 2021-01-11T15:21:09Z

So can we confirm, that there is this issue only with CUDA 11, but with CUDA 10 it works well?

Also, check that you are using the cuDNN-version corresponding to your CUDA-version. There are cuDNN for CUDA 10 and cuDNN for CUDA 11.

achen353 · 2021-01-11T15:58:54Z

Yes, I used compatible versions of cuDNN when upgrading/downgrading my CUDA.

And for this specific issue on (when calculating mAP with -map flag enabled):

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545

Yes, I think this issue occurs mainly with CUDA 11 + cuDNN 8.x

Same error mentioned in #6836 with:

GTX-1080 Ti + CUDA 11.1 + cuDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04
GTX-2080 Ti + CUDA 11 + cuDNN 8.0.4 + NVIDIA driver unknown + Ubuntu 18.04
Tesla K80 + CUDA 11.1 + cuDNN: 8.0.4 + NVIDIA driver unknown + Ubuntu 16.04

But I've also seen the same problem with older CUDA here:

GTX-2080 Ti + CUDA 9.0 + cuDNN 7.6 + NVIDIA driver unknown + System unknown (fixed by upgrading to CUDA 10.0)

wudashuo · 2021-03-02T03:02:30Z

Exactly the same error when training with -map, remove -map and there was no problem. And no problem using ./darknet detector map ... to test the weights.
So my way is removing -map while training, modified src/detector.c to save weights every 1000 or 2000 iterations, and test weights with ./darknet detector map . Seem like it's the only possible way for now if I don't downgrade my CUDA and cuDNN.

@AlexeyAB Hi Alexey, I wanna ask is there any difference when calculating mAP between darknet detector map and darknet detector train -map? They both use validate_detector_map() function, but why error only occurs while the training process?

Device: GTX 1650, 1650 SUPER, 1660 SUPER
NVIDIA-SMI: 455.23.05
Driver Version: 455.23.05
CUDA Version: 11.1
cuDNN Version: 8.0.5.39 for CUDA11.1
OS: Ubuntu 20.04

SpongeBab · 2021-03-24T09:24:35Z

@AlexeyAB
hi, guys!
I meet the error too.
this is my output:

(base) xiaopeng@xiaopeng-HP-Z800-Workstation:~/下载/darknet$ ./darknet detector test cfg/coco.data cfg/yolov4-csp.cfg yolov4.weights 
 CUDA-version: 11000 (11000), cuDNN: 8.1.1, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 4.4.0
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1070 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 Create CUDA-stream - 0 
 Create cudnn-handle 0 
非法指令 (核心已转储)

and i then have tried install opencv 3.4.10. make clean . make It is not useful.

(pytorch) xiaopeng@xiaopeng-HP-Z800-Workstation:~/下载/darknet$ ./darknet detector demo cfg/coco.data  cfg/yolov4-csp.cfg
 CUDA-version: 11000 (11000), cuDNN: 8.1.1, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 3.4.10
Demo
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1070 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 Create CUDA-stream - 0 

 cuDNN status Error in: file: ./src/dark_cuda.c : () : line: 174 : build time: Mar 24 2021 - 14:51:27 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
darknet: ./src/utils.c:331: error: Assertion `0' failed.
已放弃 (核心已转储)

it happened when i use -test.
Device: GTX 1070,
NVIDIA-SMI 450.102.04
Driver Version: 450.102.04
CUDA Version: 11.0
cuDNN Version: 8.1.1.33 for CUDA11.0
OS: Ubuntu 20.04

SpongeBab · 2021-03-29T02:27:20Z

So can we confirm, that there is this issue only with CUDA 11, but with CUDA 10 it works well?

Also, check that you are using the cuDNN-version corresponding to your CUDA-version. There are cuDNN for CUDA 10 and cuDNN for CUDA 11.
@AlexeyAB
I can confirm that darknet does not support cuda 11.x.
#7531

mixxen · 2021-06-14T10:11:19Z

Same error here. Nvidia RTX 30 series require CUDA 11.

mixxen · 2021-06-15T01:37:54Z

I may have found a workaround. I set the CUDA_VISIBLE_DEVICES=0 and set max_batches=9000 in yolo.cfg. Not sure which setting allowed train to work with -map. Bash script:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

andrewssobral · 2021-06-15T16:18:09Z

I had the same problem, just removed -map from ./darknet detector train and it works fine

AlexeyAB · 2021-06-15T16:37:22Z

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

Yes, try to use export CUDA_VISIBLE_DEVICES=0
As I remeber some cuDNN versions consume more GPU-0 memory if you have several GPUs even if you use only one GPU-0.

JJ840 · 2021-07-24T21:41:42Z

Got me too- RTX 2070, cuda 11.4
(next mAP calculation at 1000 iterations)
1000: 0.619399, 0.568391 avg loss, 0.002610 rate, 4.941000 seconds, 64000 images, 17.686189 hours left
Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 44.60 MB
CUDA allocate done!

calculation mAP (mean average precision)...
Detection layer: 139 - type = 28
Detection layer: 150 - type = 28
Detection layer: 161 - type = 28
4
cuDNN status Error in: file: c:\src\darknet\src\convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 555 : build time: Jul 22 2021 - 16:25:28

cuDNN Error: CUDNN_STATUS_BAD_PARAM

edit: removing -map, hopefully will work. mildly annoying tho :/
edit 2: yep, managed to pass 1000 iterations where it failed before. Thanks.

pablogago11 · 2021-11-04T11:34:46Z

@AlexeyAB Hi Alexey! Firstly, thanks for all your work, amazing!
We tried your solution below but we got the same error when evaluating the mAP at 1000 iterations.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
darknet detector train config.txt yolo.cfg darknet53.conv.74 -dont_show -map 2>&1 | tee train_log.txt

Yes, try to use export CUDA_VISIBLE_DEVICES=0 As I remeber some cuDNN versions consume more GPU-0 memory if you have several GPUs even if you use only one GPU-0.

Our environment settings are:

CUDA version = 11.02.0
cuDNN = 8.1.1
OpenCV = 3.2.0

Our GPU is a GeForce RTX 2080 Ti with:

compute_capability = 750
cudnn_half = 0
net.optimized_memory = 0
mini_batch = 1, batch = 16, time_steps = 1, train = 0

As many other users reported, the cuDNN error in src/convolutional_kernels.cu : () : line: 533 only appears while training with -map flag. Evaluating the mAP using the command ./darknet detector map is successful.

Does anyone have any solution other than downgrading the CUDA version or manually iterating the training and validation (saving every 1000 iterations + evaluating mAP + launch training again)? It is important for us having an automatic training with mAP evaluation.

Many thanks for your attention and suggestions!

iamrajee · 2021-11-04T18:02:29Z

@pablogago11 Stuck with the same issue. My specifications are Cuda 11.4 & cuDNN 8.2.4, Ubuntu 20.04.

khsily · 2021-11-10T14:29:31Z

In my case, changing subdivisions to 64 solved the problem.

batch=64
subdivisions=64  # changed from 16 to 64
width=416
height=416

Tested on:

GeForce RTX 2070S
cuda 11.1
cudnn 8.2.1
Ubuntu 18.04

Working result:

nhphuong91 · 2021-12-28T09:51:58Z

Thanks to @khsily suggestion, I doubt that this issue also has something to do with batch and subdivision too (same as out of memory error).
Initially, my training stop at iteration=900. After running export CUDA_VISIBLE_DEVICES=0, it stops at iteration=1000
Finally, I reduce batch size (=increase subdivision) and it runs ok; even though lots of memory are wasted 😅

Hopefully this would help!

LSGL-LLW · 2022-12-09T04:09:56Z

在我的例子中，将细分更改为 64 解决了这个问题。
batch=64
subdivisions=64  # changed from 16 to 64
width=416
height=416
测试于：

GeForce RTX 2070S

库达11.1

cudnn 8.2.1

Ubuntu 18.04

工作结果：

Yes, this problem also occurred when I used the darknet yolov7-tiny.cfg. My configuration is CUDA10.2 and cudn8.05. I did not try to adjust cudnn, but adjusted batch=64 and subdivisions=64. Finally, I calculated the map through 1000 iterations.
I think we can try to modify the values of batch and subdivisions to increase or decrease them.

lrf19991230 · 2023-02-11T06:21:36Z

I had the same problem.
Based on my experience and your discussions, I think the reason maybe that calculating the map on the validation set requires a large amount of GPU memory, and if there is insufficient memory remaining during the training, it may cause this error when calculating the map.
So set the batch=64 and subdivisions=64 can reduce the amount of memory used during the training process, then it will work

jemrlee · 2023-05-03T06:21:04Z

I have same problem and fixed with correct CUDA and CUDNN version installation, and modify Makefile
I've used RTX4080 but the Makefile didn't contain compute_89 args, so I added this in Makefile

`# GeForce RTX4090

ARCH= -gencode arch=compute_89,code=[sm_89,compute_89]
`

and finally works fine.

Maybe this issue depends with the GPU Architecture

nbandaru1h · 2024-04-02T18:02:06Z

I was able to solve this issue by reducing the input resolution of the images in training from the .cfg file. From 576 x 576 to 416 x 416.

SpongeBab mentioned this issue Mar 24, 2021

Create cudnn-handle 0 非法指令 (核心已转储) #7531

Closed

niemiaszek mentioned this issue Apr 10, 2022

cuDNN Error: CUDNN_STATUS_BAD_PARAM while training pjreddie/darknet#2367

Open

vsaw mentioned this issue Sep 13, 2023

Run Darknet in Docker with Jetpack v5 opendatacam/opendatacam#607

Closed

22 tasks

niemiaszek closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN Error while training -map #7153

cuDNN Error while training -map #7153

niemiaszek commented Dec 20, 2020

achen353 commented Jan 10, 2021 •

edited

Loading

achen353 commented Jan 10, 2021

achen353 commented Jan 11, 2021 •

edited

Loading

AlexeyAB commented Jan 11, 2021

achen353 commented Jan 11, 2021 •

edited

Loading

wudashuo commented Mar 2, 2021

SpongeBab commented Mar 24, 2021 •

edited

Loading

SpongeBab commented Mar 29, 2021

mixxen commented Jun 14, 2021

mixxen commented Jun 15, 2021

andrewssobral commented Jun 15, 2021

AlexeyAB commented Jun 15, 2021

JJ840 commented Jul 24, 2021 •

edited

Loading

pablogago11 commented Nov 4, 2021

iamrajee commented Nov 4, 2021

khsily commented Nov 10, 2021 •

edited

Loading

nhphuong91 commented Dec 28, 2021

LSGL-LLW commented Dec 9, 2022

lrf19991230 commented Feb 11, 2023

jemrlee commented May 3, 2023 •

edited

Loading

nbandaru1h commented Apr 2, 2024 •

edited

Loading

cuDNN Error while training -map #7153

cuDNN Error while training -map #7153

Comments

niemiaszek commented Dec 20, 2020

achen353 commented Jan 10, 2021 • edited Loading

achen353 commented Jan 10, 2021

achen353 commented Jan 11, 2021 • edited Loading

AlexeyAB commented Jan 11, 2021

achen353 commented Jan 11, 2021 • edited Loading

wudashuo commented Mar 2, 2021

SpongeBab commented Mar 24, 2021 • edited Loading

SpongeBab commented Mar 29, 2021

mixxen commented Jun 14, 2021

mixxen commented Jun 15, 2021

andrewssobral commented Jun 15, 2021

AlexeyAB commented Jun 15, 2021

JJ840 commented Jul 24, 2021 • edited Loading

pablogago11 commented Nov 4, 2021

iamrajee commented Nov 4, 2021

khsily commented Nov 10, 2021 • edited Loading

nhphuong91 commented Dec 28, 2021

LSGL-LLW commented Dec 9, 2022

lrf19991230 commented Feb 11, 2023

jemrlee commented May 3, 2023 • edited Loading

nbandaru1h commented Apr 2, 2024 • edited Loading

achen353 commented Jan 10, 2021 •

edited

Loading

achen353 commented Jan 11, 2021 •

edited

Loading

achen353 commented Jan 11, 2021 •

edited

Loading

SpongeBab commented Mar 24, 2021 •

edited

Loading

JJ840 commented Jul 24, 2021 •

edited

Loading

khsily commented Nov 10, 2021 •

edited

Loading

jemrlee commented May 3, 2023 •

edited

Loading

nbandaru1h commented Apr 2, 2024 •

edited

Loading