Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Segmentation violation when running multithreaded in CMSSW #311

Closed
VourMa opened this issue Jul 28, 2023 · 2 comments · Fixed by #316
Closed

Segmentation violation when running multithreaded in CMSSW #311

VourMa opened this issue Jul 28, 2023 · 2 comments · Fixed by #316

Comments

@VourMa
Copy link
Contributor

VourMa commented Jul 28, 2023

If one follows the instructions on how to integrate LST in CMSSW and run the step3 with more than 1 threads/streams, a segmentation violation happens:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Fri Jul 28 12:58:39 PDT 2023
Thread 6 (Thread 0x7fc0b623a700 (LWP 2501614) "cmsRun"):
#0  0x00007fc196191d96 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fc196191e88 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fc1847f4812 in ?? () from /lib64/libcuda.so.1
#3  0x00007fc184804b98 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc1961891cf in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc195df5dd3 in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fc11e0f1700 (LWP 2501363) "cmsRun"):
#0  0x00007fc195eb6658 in nanosleep () from /lib64/libc.so.6
#1  0x00007fc195eb655e in sleep () from /lib64/libc.so.6
#2  0x00007fc18f0ee360 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc0edcfc292 in __gnu_cxx::__aligned_membuf<std::pair<unsigned int const, float> >::_M_ptr (this=0x7fc09be363a0) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/ext/aligned_buffer.h:77
#5  0x00007fc0edcfc116 in std::_Rb_tree_node<std::pair<unsigned int const, float> >::_M_valptr (this=0x7fc09be36380) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_tree.h:239
#6  0x00007fc0edcfbc73 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, float>, std::_Select1st<std::pair<unsigned int const, float> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, float> > >::_S_key (__x=0x7fc09be36380) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_tree.h:785
#7  0x00007fc0edcfbe36 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, float>, std::_Select1st<std::pair<unsigned int const, float> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, float> > >::_M_lower_bound (this=0x7fc09c223c50, __x=0x7fc09be36380, __y=0x7fc09b1450d0, __k=@0x7fc11e0ea0dc: 442241098) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_tree.h:1935
#8  0x00007fc0edcfbb7d in std::_Rb_tree<unsigned int, std::pair<unsigned int const, float>, std::_Select1st<std::pair<unsigned int const, float> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, float> > >::lower_bound (this=0x7fc09c223c50, __k=@0x7fc11e0ea0dc: 442241098) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_tree.h:1277
#9  0x00007fc0edcfb865 in std::map<unsigned int, float, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, float> > >::lower_bound (this=0x7fc09c223c50, __x=@0x7fc11e0ea0dc: 442241098) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_map.h:1259
#10 0x00007fc0edcfb5cc in std::map<unsigned int, float, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, float> > >::operator[] (this=0x7fc09c223c50, __k=@0x7fc11e0ea0dc: 442241098) at /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/bits/stl_map.h:497
#11 0x00007fc0edd10ed1 in SDL::loadModulesFromFile (modulesInGPU=..., nModules=@0x7fc0ee5c67b8: 26401, nLowerModules=@0x7fc0ee5c67ba: 13200, pixelMapping=..., stream=0x0, moduleMetaDataFilePath=0x7fc09a8dca00 "/home/users/evourlio/LSTinCMSSW/cgpu-1/CMSSW_13_0_0_pre4/src/../../../TrackLooper/data/centroid_CMSSW_12_2_0_pre2.txt") at Module.cu:350
#12 0x00007fc0edcff622 in SDL::initModules (moduleMetaDataFilePath=0x7fc09a8dca00 "/home/users/evourlio/LSTinCMSSW/cgpu-1/CMSSW_13_0_0_pre4/src/../../../TrackLooper/data/centroid_CMSSW_12_2_0_pre2.txt") at Event.cu:490
#13 0x00007fc0edceff84 in SDL::LST::eventSetup (this=0x7fc101f59918) at LST.cc:10
#14 0x00007fc0fe03bc0a in alpaka_cuda_async::LSTProducer::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) () from /home/users/evourlio/LSTinCMSSW/cgpu-1/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginRecoTrackerLSTPluginsPortableCudaAsync.so
#15 0x00007fc198a39298 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#16 0x00007fc198a3779a in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007fc198a0ace9 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00007fc198a0ae7e in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fc19896d114 in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007fc198b689f9 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreConcurrency.so
#21 0x00007fc197042304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fc193dbd300, waiter=..., this=0x7fc193ed7e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fc193ed7e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007fc1970444c6 in tbb::detail::r1::rml::private_worker::run (this=0x7fc19172a100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fc19172a100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007fc1961891cf in start_thread () from /lib64/libpthread.so.0
#28 0x00007fc195df5dd3 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fc14ffff700 (LWP 2501340) "cuda-EvtHandlr"):
#0  0x00007fc195ee0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fc184809b89 in ?? () from /lib64/libcuda.so.1
#2  0x00007fc1848b0d7b in ?? () from /lib64/libcuda.so.1
#3  0x00007fc184804b98 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc1961891cf in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc195df5dd3 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fc16142b700 (LWP 2501339) "cuda-EvtHandlr"):
#0  0x00007fc195ee0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fc184809b89 in ?? () from /lib64/libcuda.so.1
#2  0x00007fc1848b0d7b in ?? () from /lib64/libcuda.so.1
#3  0x00007fc184804b98 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc1961891cf in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc195df5dd3 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fc167cdb700 (LWP 2501323) "cmsRun"):
#0  0x00007fc196193662 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fc18f0ee517 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007fc18f0ef0ca in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007fc1968179b4 in std::execute_native_thread_routine (__p=0x7fc192c60590) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007fc1961891cf in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc195df5dd3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fc195318640 (LWP 2501182) "cmsRun"):
#0  0x00007fc195ee0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fc18f0ee80f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007fc18f0ef19c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007fc18f0f1b1b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fc0edd03cac in SDL::Event::createPixelQuintuplets (this=0x7ffc89a705c0) at Event.cu:1360
#6  0x00007fc0edcebf43 in SDL::LST::run (this=0x7fc101f59318, stream=<optimized out>, verbose=<optimized out>, see_px=..., see_py=..., see_pz=..., see_dxy=..., see_dz=..., see_ptErr=..., see_etaErr=..., see_stateTrajGlbX=..., see_stateTrajGlbY=..., see_stateTrajGlbZ=..., see_stateTrajGlbPx=..., see_stateTrajGlbPy=..., see_stateTrajGlbPz=..., see_q=..., see_hitIdx=..., ph2_detId=..., ph2_x=..., ph2_y=..., ph2_z=...) at LST.cc:140
#7  0x00007fc0fe03c93a in alpaka_cuda_async::LSTProducer::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) () from /home/users/evourlio/LSTinCMSSW/cgpu-1/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/pluginRecoTrackerLSTPluginsPortableCudaAsync.so
#8  0x00007fc198a39298 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9  0x00007fc198a3779a in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007fc198a0ace9 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007fc198a0ae7e in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007fc19896d114 in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007fc198b689f9 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreConcurrency.so
#14 0x00007fc1970499cd in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fc100e5b200, this=0x7fc193ed7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fc193ed7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fc1988ec40d in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00007fc1988d4211 in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fc1988e0dc6 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x000000000040a1bd in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fc197037847 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_0_pre4-el8_amd64_gcc11/build/CMSSW_13_0_0_pre4-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-0282c02a966e31ef3a1f3b1a4ea0f8fa/tbb-v2021.8.0/src/tbb/arena.cpp:694
#22 0x000000000040b009 in main::{lambda()#1}::operator()() const ()
#23 0x000000000040971c in main ()

Current Modules:

Module: alpaka_cuda_async::LSTProducer:lstProducer (crashed)
Module: alpaka_cuda_async::LSTProducer:lstProducer

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)
@slava77
Copy link
Contributor

slava77 commented Jul 28, 2023

@dan131riley

@slava77
Copy link
Contributor

slava77 commented Jul 28, 2023

for the multithreading case, I think that the reason is already known: loadModulesFromFile is executed per event and writes to non-const globals.

IIUC, there are more uses of globals in other places; chances are they will crash as well once the very slow loadModulesFromFile method is made safe.
More details are in #287 item 3. and later in the discussion

@VourMa VourMa linked a pull request Aug 24, 2023 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants