loss: nan, acc: 0.0 #125

rookie0607 · 2024-07-31T08:59:12Z

System Info

none

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Dear Developer: I'm running the script https://github.com/X-LANCE/SLAM-LLM/blob/main/examples/asr_librispeech/scripts/finetune_whisper_large_linear_vicuna_ 7b.sh, and unlike the original, I replaced LLM with Qwen2-1.5b, and I have a problem with the training, as shown below. loss: nan, acc: 0.0

[Training Epoch: 2/15, step 486/487 completed (loss: nan, acc: 0.0): 100%|██████████████████████| 487/487 [05:44<00:00,  1.41it/s]
Training Epoch: 2/15, step 486/487 completed (loss: nan, acc: 0.0): 100%|██████████████████████| 487/487 [05:44<00:00,  1.41it/s]
Training Epoch: 2/15, step 486/487 completed (loss: nan, acc: 0.0): 100%|██████████████████████| 487/487 [05:44<00:00,  1.41it/s]
Training Epoch: 2/15, step 486/487 completed (loss: nan, acc: 0.0): 100%|██████████████████████| 487/487 [05:44<00:00,  1.41it/s]]([url](url))

How can I continue the experiment? @ddlBoJack

Error logs

1

Expected behavior

1

The text was updated successfully, but these errors were encountered:

fclearner · 2024-08-01T09:11:38Z

是不是某一步突然炸了的。。。

rookie0607 · 2024-08-01T09:15:02Z

是不是某一步突然炸了的。。。

关掉fp16就好了。。。
测试结果比较差。。

fclearner · 2024-08-01T09:16:48Z

是不是某一步突然炸了的。。。

关掉fp16就好了。。。测试结果比较差。。

加数据

yy835055664 · 2024-11-14T03:56:13Z

是不是某一步突然炸了的。。。

关掉fp16就好了。。。测试结果比较差。。

你好，我也遇到这个问题了，但TrainConfig里面的use_fp16默认是False，你说的关掉fp16是指这个嘛？

yy835055664 · 2024-11-14T04:00:51Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

fclearner · 2024-11-14T04:03:48Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

yy835055664 · 2024-11-14T05:40:50Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

多谢，有个问题咨询您一下，尝试llm=llama3-chinese时，训练aishell，测试dev和test，效果都不好：
test:

dev效果也一样

这种情况您知道怎么解决嘛？

fclearner · 2024-11-14T05:44:11Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

多谢，有个问题咨询您一下，尝试llm=llama3-chinese时，训练aishell，测试dev和test，效果都不好： test: dev效果也一样

这种情况您知道怎么解决嘛？
检查下你的prompt吧，看看是不是eos token有问题

yy835055664 · 2024-11-14T06:52:52Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

多谢，有个问题咨询您一下，尝试llm=llama3-chinese时，训练aishell，测试dev和test，效果都不好： test: dev效果也一样
这种情况您知道怎么解决嘛？
检查下你的prompt吧，看看是不是eos token有问题

prompt使用的默认: "Transcribe speech to text. "
这个是llama3-chinese的pos_token、eos_token

是因为prompt原因吗？如果是prompt一般怎么调

fclearner · 2024-11-14T07:29:40Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

多谢，有个问题咨询您一下，尝试llm=llama3-chinese时，训练aishell，测试dev和test，效果都不好： test: dev效果也一样
这种情况您知道怎么解决嘛？
检查下你的prompt吧，看看是不是eos token有问题

prompt使用的默认: "Transcribe speech to text. " 这个是llama3-chinese的pos_token、eos_token 是因为prompt原因吗？如果是prompt一般怎么调

参考这个：#128

yy835055664 · 2024-11-18T04:03:10Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

你好，在将use_fp16=false时，训练速度很慢。如果想使用bf16，如何更改？

fclearner · 2024-11-18T04:55:23Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

你好，在将use_fp16=false时，训练速度很慢。如果想使用bf16，如何更改？

首先，你的显卡得支持，其次，deepspeed可以设置bf16，然后数据流的tensor type留意下

yy835055664 · 2024-11-18T05:50:43Z

是不是某一步突然炸了的。。。

你好，我在mala_asr_slidespeech中llm替换成qwen2.5，训练也出现loss:nan，acc:0.3左右；

有好的方法解决嘛？

因为qwen是用bf16训的，用fp16容易数值溢出，具体可能是某些算子容易数值溢出

你好，在将use_fp16=false时，训练速度很慢。如果想使用bf16，如何更改？

首先，你的显卡得支持，其次，deepspeed可以设置bf16，然后数据流的tensor type留意下

多谢，如果将use_fp16=false，默认使用什么数据类型进行训练？

rookie0607 closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss: nan, acc: 0.0 #125

loss: nan, acc: 0.0 #125

rookie0607 commented Jul 31, 2024

fclearner commented Aug 1, 2024

rookie0607 commented Aug 1, 2024

fclearner commented Aug 1, 2024

yy835055664 commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 18, 2024

fclearner commented Nov 18, 2024

yy835055664 commented Nov 18, 2024

loss: nan, acc: 0.0 #125

loss: nan, acc: 0.0 #125

Comments

rookie0607 commented Jul 31, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

fclearner commented Aug 1, 2024

rookie0607 commented Aug 1, 2024

fclearner commented Aug 1, 2024

yy835055664 commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 14, 2024

fclearner commented Nov 14, 2024

yy835055664 commented Nov 18, 2024

fclearner commented Nov 18, 2024

yy835055664 commented Nov 18, 2024