loss : nan when train custom data set #6

rezha130 · 2018-06-23T09:26:09Z

Hi @BelBES

I tried several batch-size from 8,16,32,64,128,256..but always end with loss : nan in every epoch when training my custom data set.

python train.py --data-path datatrain --test-init True --test-epoch 10 --output-dir snapshot --abc 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/. --batch-size 8

Test phase
acc: 0.0000; avg_ed: 0.0000: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.10it/s]
acc: 0.0	acc_best: 0; avg_ed: 18.428571428571427
epoch: 0; iter: 1998; lr: 1.0000000000000002e-06; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.69it/s]
epoch: 1; iter: 3998; lr: 1.0000000000000004e-10; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.74it/s]
epoch: 2; iter: 5998; lr: 1.0000000000000006e-14; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:43<00:00, 45.84it/s]

I am using PyTorch 0.4, Python 3.6, GTX 1080 Ti and Ubuntu 16.04

Can you help me how to solve this problem?

Kindly Regards

The text was updated successfully, but these errors were encountered:

bes-dev · 2018-06-23T10:45:32Z

Hi,

Can you provide small reproducer for this bug?

rezha130 · 2018-06-23T11:21:35Z

Sorry @BelBES , would you please explain about "small reproducer"?

FYI, this is structure of my custom data set:

datatrain
---- data
-------- folderA/img_filename_0.jpg
...
-------- folderB/img_filename_1.jpg
---- desc.json

And, this is structure of my custom desc.json:

{
"abc": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/.",
"train": [
{
"text": "text_on_image0"
"name": "folderA/img_filename_0.jpg"
},
...
{
"text": "text_on_image1"
"name": "folderB/img_filename_1.jpg"
}
],
"test": [
{
"text": "text_on_image3"
"name": "folderC/img_filename_3.jpg"
},
...
{
"text": "text_on_image4"
"name": "folderD/img_filename_4.jpg"
}
]
}

In text_data.py, i used this syntax in line 32:
img = cv2.imread(os.path.join(self.data_path, "data", name))

But still have same loss : nan issue. Please help.

rezha130 · 2018-06-23T11:50:11Z

When i tried to debug using cuda = false (in CPU) on my dev laptop, this is the result of loss.data[0] that cause loss : nan

[0]:<Tensor>
_backward_hooks:None
_base:<Tensor, len() = 1>
_cdata:140460563260592
_grad:None
_grad_fn:None
_version:0
data:<Tensor>
device:device(type='cpu')
dtype:torch.float32
grad:None
grad_fn:None
is_cuda:False
is_leaf:True
is_sparse:False
layout:torch.strided
name:None
output_nr:0

Note: i set cuda = False in my CPU dev-laptop, but set cuda = True on my GPU server above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss : nan when train custom data set #6

loss : nan when train custom data set #6

rezha130 commented Jun 23, 2018 •

edited

Loading

bes-dev commented Jun 23, 2018

rezha130 commented Jun 23, 2018 •

edited

Loading

rezha130 commented Jun 23, 2018 •

edited

Loading

loss : nan when train custom data set #6

loss : nan when train custom data set #6

Comments

rezha130 commented Jun 23, 2018 • edited Loading

bes-dev commented Jun 23, 2018

rezha130 commented Jun 23, 2018 • edited Loading

rezha130 commented Jun 23, 2018 • edited Loading

rezha130 commented Jun 23, 2018 •

edited

Loading

rezha130 commented Jun 23, 2018 •

edited

Loading

rezha130 commented Jun 23, 2018 •

edited

Loading