Sudden nan values from the loss during LoRA training #186

MH-Python · 2024-12-04T09:59:18Z

Thank you for the nice compact work.
We have started recently to face an ambiguous error casing the loss to become nan during the training. After enabling anomaly detection " torch.autograd.set_detect_anomaly(True)"
We got this:

UserWarning: Error detected in MmBackward0. Traceback of forward call that caused the error:
...stacktrace...
.venv/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 569, in forward
result = result + lora_B(lora_A(dropout(x))) * scaling
...stacktrace...
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'MmBackward0' returned nan values in its 1th output.

Could it be caused by some numerical instability (nan or inf)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden nan values from the loss during LoRA training #186

Sudden nan values from the loss during LoRA training #186

MH-Python commented Dec 4, 2024

Sudden nan values from the loss during LoRA training #186

Sudden nan values from the loss during LoRA training #186

Comments

MH-Python commented Dec 4, 2024