Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss : nan when train custom data set #6

Open
rezha130 opened this issue Jun 23, 2018 · 3 comments
Open

loss : nan when train custom data set #6

rezha130 opened this issue Jun 23, 2018 · 3 comments

Comments

@rezha130
Copy link

rezha130 commented Jun 23, 2018

Hi @BelBES

I tried several batch-size from 8,16,32,64,128,256..but always end with loss : nan in every epoch when training my custom data set.

python train.py --data-path datatrain --test-init True --test-epoch 10 --output-dir snapshot --abc 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/. --batch-size 8

Test phase
acc: 0.0000; avg_ed: 0.0000: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.10it/s]
acc: 0.0	acc_best: 0; avg_ed: 18.428571428571427
epoch: 0; iter: 1998; lr: 1.0000000000000002e-06; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.69it/s]
epoch: 1; iter: 3998; lr: 1.0000000000000004e-10; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.74it/s]
epoch: 2; iter: 5998; lr: 1.0000000000000006e-14; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:43<00:00, 45.84it/s]

I am using PyTorch 0.4, Python 3.6, GTX 1080 Ti and Ubuntu 16.04

Can you help me how to solve this problem?

Kindly Regards

@bes-dev
Copy link
Owner

bes-dev commented Jun 23, 2018

Hi,

Can you provide small reproducer for this bug?

@rezha130
Copy link
Author

rezha130 commented Jun 23, 2018

Sorry @BelBES , would you please explain about "small reproducer"?

FYI, this is structure of my custom data set:

datatrain
---- data
-------- folderA/img_filename_0.jpg
...
-------- folderB/img_filename_1.jpg
---- desc.json

And, this is structure of my custom desc.json:

{
"abc": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/.",
"train": [
{
"text": "text_on_image0"
"name": "folderA/img_filename_0.jpg"
},
...
{
"text": "text_on_image1"
"name": "folderB/img_filename_1.jpg"
}
],
"test": [
{
"text": "text_on_image3"
"name": "folderC/img_filename_3.jpg"
},
...
{
"text": "text_on_image4"
"name": "folderD/img_filename_4.jpg"
}
]
}

In text_data.py, i used this syntax in line 32:
img = cv2.imread(os.path.join(self.data_path, "data", name))

But still have same loss : nan issue. Please help.

@rezha130
Copy link
Author

rezha130 commented Jun 23, 2018

When i tried to debug using cuda = false (in CPU) on my dev laptop, this is the result of loss.data[0] that cause loss : nan

[0]:<Tensor>
_backward_hooks:None
_base:<Tensor, len() = 1>
_cdata:140460563260592
_grad:None
_grad_fn:None
_version:0
data:<Tensor>
device:device(type='cpu')
dtype:torch.float32
grad:None
grad_fn:None
is_cuda:False
is_leaf:True
is_sparse:False
layout:torch.strided
name:None
output_nr:0

Note: i set cuda = False in my CPU dev-laptop, but set cuda = True on my GPU server above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants