Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with Score-P and TensorFlow #112

Open
anarazh opened this issue Oct 28, 2020 · 8 comments
Open

Error with Score-P and TensorFlow #112

anarazh opened this issue Oct 28, 2020 · 8 comments

Comments

@anarazh
Copy link

anarazh commented Oct 28, 2020

Dear team,

I'm getting the following error when I run Score-P with a module for tracing python scripts:


2020-10-20 09:24:14.149317: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.00M (10485
76 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 921.8K (9438
72 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149366: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 829.8K (8496
64 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 747.0K (7649
28 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149380: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 672.5K (6886
40 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context


The error files grows very quickly and I end up killing the job.
I use a custom Score-P build. The details about the environment setup is in the attached job script and the error output is attached too.
Without the Score-P, the application runs as expected even without specifying the LD_PRELOAD for MPI.

When I run Score-P with the LD_PRELOAD set, I get the following error instead:


[Score-P] src/adapters/mpi/SCOREP_Mpi_Env.c:230: Warning: MPI environment initialization request and provided level exceed MPI_THREAD_FUNNELED!
2020-10-19 10:56:13.384533: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494285000 Hz [rc0003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: rc0003: task 0: Segmentation fault


Would appreciate any feedback on this issue.
Thanks in advance!

Anara
err_example.txt
job-example.txt

@AndreasGocht
Copy link
Collaborator

Hi Anara,

thanks for reporting again on GitHub. As already stated: please rebuild Score-P after all modules you need are loaded, to avoid conflicting MPI Versions.

Best,

Andreas

@anarazh
Copy link
Author

anarazh commented Oct 30, 2020

Hello Andreas,

I just rebuilt Score-P with required modules and I still get the same errors.
Running a dummy python script with the python binding works as usual.

@AndreasGocht
Copy link
Collaborator

Could you please try to install the latest python bindings? I did a small change to the LD_PRELOAD order.

But to be honest, I do have currently no clue what's wrong here ...

Best,

Andreas

@anarazh
Copy link
Author

anarazh commented Nov 9, 2020

Here is a discussion about a similar problem: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-September/007126.html
and here https://apps.fz-juelich.de/jsc/hps/jureca/known-issues.html#segmentation-faults-with-mvapich2

As mentioned, the application shows expected behavior with unset LD_PRELOAD without the Score-P.

I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.

Best regards,
Anara

@AndreasGocht
Copy link
Collaborator

Hey,

I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.

this seems the most promising approach. Unsetting the LD_PREOLOAD would make it very difficult to trace the MPI communication.

Best,

Andreas

@AndreasGocht
Copy link
Collaborator

Think about this Issue, about the last few days I had another idea what might help to debug:

LD_DEBUG can be used to debug library issues (http://www.bnikolic.co.uk/blog/linux-ld-debug.html). Setting LD_DEBUG=all will show all library-related information. Can you add these before a call of your application with and without Score-P Python?

Best,

Andreas

@anarazh
Copy link
Author

anarazh commented Dec 2, 2020

Thank you Andreas,

The original machine is under maintenance since November. I am using another machine and OpenMPI implementation and it seems to work so far, I still need to run more tests though. I'll run a test with LD_DEBUG once I get to the original machine which should be early 2021 but can be sooner.

Best,
Anara

@AndreasGocht
Copy link
Collaborator

As long as it works for you no hurries ;-).

Best,

Andreas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants