-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with Score-P and TensorFlow #112
Comments
Hi Anara, thanks for reporting again on GitHub. As already stated: please rebuild Score-P after all modules you need are loaded, to avoid conflicting MPI Versions. Best, Andreas |
Hello Andreas, I just rebuilt Score-P with required modules and I still get the same errors. |
Could you please try to install the latest python bindings? I did a small change to the But to be honest, I do have currently no clue what's wrong here ... Best, Andreas |
Here is a discussion about a similar problem: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-September/007126.html As mentioned, the application shows expected behavior with unset LD_PRELOAD without the Score-P. I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes. Best regards, |
Hey,
this seems the most promising approach. Unsetting the Best, Andreas |
Think about this Issue, about the last few days I had another idea what might help to debug:
Best, Andreas |
Thank you Andreas, The original machine is under maintenance since November. I am using another machine and OpenMPI implementation and it seems to work so far, I still need to run more tests though. I'll run a test with Best, |
As long as it works for you no hurries ;-). Best, Andreas |
Dear team,
I'm getting the following error when I run Score-P with a module for tracing python scripts:
2020-10-20 09:24:14.149317: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.00M (10485
76 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 921.8K (9438
72 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149366: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 829.8K (8496
64 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 747.0K (7649
28 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149380: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 672.5K (6886
40 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
The error files grows very quickly and I end up killing the job.
I use a custom Score-P build. The details about the environment setup is in the attached job script and the error output is attached too.
Without the Score-P, the application runs as expected even without specifying the LD_PRELOAD for MPI.
When I run Score-P with the LD_PRELOAD set, I get the following error instead:
[Score-P] src/adapters/mpi/SCOREP_Mpi_Env.c:230: Warning: MPI environment initialization request and provided level exceed MPI_THREAD_FUNNELED!
2020-10-19 10:56:13.384533: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494285000 Hz [rc0003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: rc0003: task 0: Segmentation fault
Would appreciate any feedback on this issue.
Thanks in advance!
Anara
err_example.txt
job-example.txt
The text was updated successfully, but these errors were encountered: