Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train backwards fails with Error: CUDA error 999 'unknown error' #1002

Open
eu9ene opened this issue Jan 21, 2025 · 6 comments
Open

Train backwards fails with Error: CUDA error 999 'unknown error' #1002

eu9ene opened this issue Jan 21, 2025 · 6 comments
Labels
bug Something is broken or not correct

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jan 21, 2025

https://firefox-ci-tc.services.mozilla.com/tasks/cxEF3FOvRLOjJJSv7n8hXQ/runs/0/logs/public/logs/live.log

It trained for some time and then failed with CUDA errors. @ZJaume can help clarify the timeline. It doesn't look like a preemption. I see that it was cancelled later. @bhearsum did we update the machine images lately?

[task 2025-01-21T16:24:56.313Z] [2025-01-21 16:24:56] [valid] Ep. 1 : Up. 50000 : ce-mean-words : 2.62238 : new best
[task 2025-01-21T16:25:04.482Z] [2025-01-21 16:25:04] Saving model weights and runtime parameters to /home/ubuntu/tasks/task_173745859288754/artifacts/model.npz.best-bleu-detok.npz
[task 2025-01-21T16:25:07.208Z] [2025-01-21 16:25:07] [valid] Ep. 1 : Up. 50000 : bleu-detok : 26.4729 : new best
[taskcluster 2025-01-21T16:29:14.968Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/cxEF3FOvRLOjJJSv7n8hXQ/0/on/us-central1-b/6805228967535197264/until/1737478154.931
[task 2025-01-21T16:30:34.200Z] [2025-01-21 16:30:34] Ep. 1 : Up. 51000 : Sen. 77,303,338 : Cost 1.41794586 : Time 370.55s : 97447.45 words/s : gNorm 0.5275
[task 2025-01-21T16:30:37.188Z] [2025-01-21 16:30:37] Error: CUDA error 999 'unknown error' - /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
[task 2025-01-21T16:30:37.188Z] [2025-01-21 16:30:37] Error: CUDA error 999 'unknown error' - /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
[task 2025-01-21T16:30:37.188Z] [2025-01-21 16:30:37] Error: CUDA error 999 'unknown error' - /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
[task 2025-01-21T16:30:37.188Z] [2025-01-21 16:30:37] Error: CUDA error 999 'unknown error' - /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
[task 2025-01-21T16:30:37.188Z] [2025-01-21 16:30:37] Error: Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54
[task 2025-01-21T16:30:37.188Z] Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54
[task 2025-01-21T16:30:37.188Z] Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54
[task 2025-01-21T16:30:37.188Z] Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /builds/worker/fetches/marian-source/src/tensors/gpu/algorithm.cu:54
[task 2025-01-21T16:30:37.191Z] 
[task 2025-01-21T16:30:37.191Z] [CALL STACK]
[task 2025-01-21T16:30:37.191Z] [0x5c6a841048cc]    void marian::gpu::  fill  <float>(std::shared_ptr<marian::Backend>,  float*,  float*,  float) + 0x68c
[task 2025-01-21T16:30:37.191Z] [0x5c6a837683bd]    void marian::TensorBase::  set  <float>(float)     + 0x25d
[task 2025-01-21T16:30:37.191Z] [0x5c6a839ad002]    marian::ExpressionGraph::  backward  (bool,  float) + 0x472
[task 2025-01-21T16:30:37.191Z] [0x5c6a83d76cd8]                                                       + 0xb3ccd8
[task 2025-01-21T16:30:37.191Z] [0x5c6a83d8357c]    marian::ThreadPool::enqueue<std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x5c
[task 2025-01-21T16:30:37.191Z] [0x5c6a83d844a6]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<bool>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,bool ()>::_M_run()::{lambda()#1},bool>>::  _M_invoke  (std::_Any_data const&) + 0x36
[task 2025-01-21T16:30:37.191Z] [0x5c6a836c8e9d]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2025-01-21T16:30:37.191Z] [0x71a82ba99ee8]                                                       + 0x99ee8
[task 2025-01-21T16:30:37.191Z] [0x5c6a83d7c165]    std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<bool (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x115
[task 2025-01-21T16:30:37.191Z] [0x5c6a836d0135]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2025-01-21T16:30:37.191Z] [0x71a82bedc253]                                                       + 0xdc253
[task 2025-01-21T16:30:37.191Z] [0x71a82ba94ac3]                                                       + 0x94ac3
[task 2025-01-21T16:30:37.191Z] [0x71a82bb26850]                                                       + 0x126850
[task 2025-01-21T16:30:37.191Z] 
@eu9ene eu9ene added the bug Something is broken or not correct label Jan 21, 2025
@bhearsum
Copy link
Collaborator

The last GPU machine image update was back in April =.

Theoretically, this could be related to #982, which affected the cuda-toolkit toolchain. If these tasks were accidentally using the cuda 11 files that were previously in this toolchain, that might explain a behaviour change.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 21, 2025

Also, something went wrong with the restarted task: https://firefox-ci-tc.services.mozilla.com/tasks/cxEF3FOvRLOjJJSv7n8hXQ/runs/1/logs/public/logs/live.log

[task 2025-01-21T17:10:44.928Z] [2025-01-21 17:10:38] Using synchronous SGD
[task 2025-01-21T17:10:44.975Z] [tracking INFO] Reading Marian command line arguments.
[task 2025-01-21T17:10:44.978Z] [tracking INFO] Reading datasets statistics from OpusTrainer configuration.
[task 2025-01-21T17:12:18.382Z] [tracking WARNING] Training has been resumed but resume option has been set to False, skipping publication.
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [comm] Compiled without MPI support. Running as a single process on translations-1-b-linux-v100-gpu-4-300g-cpybqe5esbiqj3baniotpw
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] Synced seed 1737479438
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_173747900851730/artifacts/vocab.spm
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [data] Setting vocabulary size for input 0 to 32,000
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_173747900851730/artifacts/vocab.spm
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [data] Setting vocabulary size for input 1 to 32,000
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] [batching] Collecting statistics for batch fitting with step size 10
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2025-01-21T17:12:18.382Z] [2025-01-21 17:10:38] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2025-01-21T17:12:18.382Z] 
[task 2025-01-21T17:12:18.382Z] [CALL STACK]
[task 2025-01-21T17:12:18.382Z] [0x5b04de9bb1ff]    marian::CurandRandomGenerator::  CurandRandomGenerator  (unsigned long,  marian::DeviceId) + 0x83f
[task 2025-01-21T17:12:18.382Z] [0x5b04de9bb899]    marian::  createRandomGenerator  (unsigned long,  marian::DeviceId) + 0x69
[task 2025-01-21T17:12:18.382Z] [0x5b04de9b4f90]    marian::  BackendByDeviceId  (marian::DeviceId,  unsigned long) + 0xa0
[task 2025-01-21T17:12:18.382Z] [0x5b04de3812f0]    marian::ExpressionGraph::  setDevice  (marian::DeviceId,  std::shared_ptr<marian::Device>) + 0x80
[task 2025-01-21T17:12:18.382Z] [0x5b04de76fbf5]    marian::GraphGroup::  initGraphsAndOpts  ()        + 0x1e5
[task 2025-01-21T17:12:18.382Z] [0x5b04de770f80]    marian::GraphGroup::  GraphGroup  (std::shared_ptr<marian::Options>,  std::shared_ptr<marian::IMPIWrapper>) + 0x570
[task 2025-01-21T17:12:18.382Z] [0x5b04de748103]    marian::SyncGraphGroup::  SyncGraphGroup  (std::shared_ptr<marian::Options>,  std::shared_ptr<marian::IMPIWrapper>) + 0x83
[task 2025-01-21T17:12:18.382Z] [0x5b04de173f93]    marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x1b53
[task 2025-01-21T17:12:18.382Z] [0x5b04de099a4c]    mainTrainer  (int,  char**)                        + 0x15c
[task 2025-01-21T17:12:18.382Z] [0x7e1a63629d90]                                                       + 0x29d90
[task 2025-01-21T17:12:18.382Z] [0x7e1a63629e40]    __libc_start_main                                  + 0x80
[task 2025-01-21T17:12:18.382Z] [0x5b04de0944a5]    _start                                             + 0x25
[task 2025-01-21T17:12:18.382Z] 
[taskcluster 2025-01-21T17:20:32.638Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/cxEF3FOvRLOjJJSv7n8hXQ/1/on/us-central1-b/6805228967535197264/until/1737481232.601
[task 2025-01-21T17:30:31.459Z] [2025-01-21 17:30:31] [Trainer] [INFO] Starting stage: "train"
[task 2025-01-21T17:30:31.459Z] [2025-01-21 17:30:31] [Trainer] [INFO] Running until dataset "original" for 10 epoch(s)
[task 2025-01-21T17:30:32.833Z] [2025-01-21 17:30:32] [Trainer] [INFO] trainer stopped reading input
[taskcluster 2025-01-21T17:37:32.696Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/cxEF3FOvRLOjJJSv7n8hXQ/1/on/us-central1-b/6805228967535197264/until/1737482252.666
[task 2025-01-21T17:43:15.559Z] [tracking WARNING] No TrainingEpoch entry, skipping.
[task 2025-01-21T17:43:15.559Z] [tracking WARNING] No ValidationEpoch entry, skipping.
[task 2025-01-21T17:43:15.559Z] [tracking INFO] Successfully parsed 279 lines
[task 2025-01-21T17:43:15.559Z] [tracking INFO] Found 0 training entries
[task 2025-01-21T17:43:15.559Z] [tracking INFO] Found 0 validation entries
[fetches 2025-01-21T17:43:16.714Z] removing /home/ubuntu/tasks/task_173747900851730/fetches
[fetches 2025-01-21T17:43:20.606Z] finished
[taskcluster 2025-01-21T17:43:20.619Z]    Exit Code: 0
[taskcluster 2025-01-21T17:43:20.619Z]    User Time: 14m6.363483s

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 21, 2025

Also, it seems the "Rerun" action no longer works on this task.

@bhearsum
Copy link
Collaborator

Also, something went wrong with the restarted task: https://firefox-ci-tc.services.mozilla.com/tasks/cxEF3FOvRLOjJJSv7n8hXQ/runs/1/logs/public/logs/live.log

I assume you're talking about "Curand error 203"? marian-nmt/marian-dev#666 (comment) suggests it might be caused by a cuda version mismatch, which seems like it could be rooted in the same thing as the 999 error.

@bhearsum
Copy link
Collaborator

Also, it seems the "Rerun" action no longer works on this task.

Did you get a specific error here? Is it still occurring?

This is normal when trying to rerun a task that is already marked as Successful, so if that's what we're talking about, the root cause is the tasks getting marked as success when they've failed.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 27, 2025

Also, it seems the "Rerun" action no longer works on this task.

Did you get a specific error here? Is it still occurring?

This is normal when trying to rerun a task that is already marked as Successful, so if that's what we're talking about, the root cause is the tasks getting marked as success when they've failed.

Got it, thanks for the clarification. So, Rerun works as expected and I filed a separate issue about OpusTrainer not aborting training: #1015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

2 participants