-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train backwards fails with Error: CUDA error 999 'unknown error' #1002
Comments
The last GPU machine image update was back in April =. Theoretically, this could be related to #982, which affected the |
Also, something went wrong with the restarted task: https://firefox-ci-tc.services.mozilla.com/tasks/cxEF3FOvRLOjJJSv7n8hXQ/runs/1/logs/public/logs/live.log
|
Also, it seems the "Rerun" action no longer works on this task. |
I assume you're talking about "Curand error 203"? marian-nmt/marian-dev#666 (comment) suggests it might be caused by a cuda version mismatch, which seems like it could be rooted in the same thing as the 999 error. |
Did you get a specific error here? Is it still occurring? This is normal when trying to rerun a task that is already marked as Successful, so if that's what we're talking about, the root cause is the tasks getting marked as success when they've failed. |
Got it, thanks for the clarification. So, Rerun works as expected and I filed a separate issue about OpusTrainer not aborting training: #1015 |
https://firefox-ci-tc.services.mozilla.com/tasks/cxEF3FOvRLOjJJSv7n8hXQ/runs/0/logs/public/logs/live.log
It trained for some time and then failed with CUDA errors. @ZJaume can help clarify the timeline. It doesn't look like a preemption. I see that it was cancelled later. @bhearsum did we update the machine images lately?
The text was updated successfully, but these errors were encountered: