testCaProvider deadlock #165

mdavidsaver · 2020-06-10T01:31:44Z

I happened to time it right to catch a hung appveyor run for Base 7.0. I found testCaProvider.exe still running after ~45 min. I first tried killing caRepeater, with apparent effect. The CI run completed shortly after I then killed the test. The last step caused some new output to appear in the log.

https://ci.appveyor.com/project/mdavidsaver/epics-base/build/job/rf9ojmbx2qmj0rpn

perl -CSD testCaProvider.t -tap > testCaProvider.tap
epicsThread::~epicsThread(): "winadc" blocking for thread "channelConnectThread" to exit
was epicsThread object destroyed before thread exit ?
epicsThread::~epicsThread(): "winadc" blocking for thread "channelConnectThread" to exit
was epicsThread object destroyed before thread exit ?
...

The text was updated successfully, but these errors were encountered:

mdavidsaver · 2020-06-10T01:34:23Z

I don't think this is exactly the same as #163, since I think only one provider instance is created. It looks like an issue with the same code though.

anjohnson · 2020-06-11T15:19:24Z

Thanks.

From that error message I suspect I know what the issue may be here, caused by deleting an epicsThread object in a different thread after the child has already called exitWait() or returned. I already have a fix and tests for the 3.15 branch but it needs some work to properly merge up due to conflicts with the epicsThreadJoin() changes in 7.0. I will bump that upwards in priority, and also work on #163

mdavidsaver · 2020-06-21T14:02:09Z

@anjohnson Is your fix imminent? If not, I'm going to temporarily disable testCaProvider from running on windows. Having eg. 3 appveyor jobs timeout during one run is annoying.

diff --git a/testCa/Makefile b/testCa/Makefile
index 1b1ded0f..a575235a 100644
--- a/testCa/Makefile
+++ b/testCa/Makefile
@@ -11,7 +11,9 @@ PROD_SYS_LIBS_WIN32 += netapi32 ws2_32
 
 TESTPROD_HOST += testCaProvider
 testCaProvider_SRCS += testCaProvider.cpp
+ifneq (,$(filter win%,$(T_A)))
 TESTS += testCaProvider
+endif
 ifdef BASE_3_16
   testCaProvider_SRCS += testIoc_registerRecordDeviceDriver.cpp
   REGRDDFLAGS = -l

anjohnson · 2020-06-21T21:37:54Z

Sorry, somehow I hadn't internalized that this bug was what was hanging the Appveyor builds; I will work on this next (tomorrow).

anjohnson · 2020-06-23T17:33:46Z

I may have a fix, or a we may have a Heisenbug.

I aborted my first Appveyor build job of this commit after it spent an hour building, which could have been testCaProvider.exe hanging but I had no way of knowing that for sure.

I switched to using a .appveyor.yml file in a gist to allow me to remote into it (and also cut out the older compilers which didn't seem to fail testCaProvider anyway). The resulting build didn't hang any jobs, although it did have 2 other failures the first time it tried to run

Both of those test programs succeeded when they were re-run during the make test-results phase.

So I switched back to the standard built-in .appyevor.yml file in case enabling RDP was causing the problem and this time none of the jobs are hanging, although we are still getting tests failing:

Note that the above builds were all against the same git commit. I will be re-doing that before the final merge as I lost the connection to the 3.15 commit that it was derived from and I'm improving the API doxy-description in the header, but the code changes there are what I'm currently proposing to fix this issue. Comments, ideas, any other explanations for my first job hanging?

mdavidsaver · 2020-06-23T19:07:55Z

There are (I think) multiple issues at play. wrt. testpvalink in the first instance. This looks like one of the mysterious crash on exits. I've seen various tests do this, include epicsMathTest. I'm hoping to troubleshoot this with my ci-core-dumper tool.

1..15
Warning: IOC is booting with TOP = "C:/projects/epics-base/modules/pva2pva"
          but was built with TOP = "C:/projects/epics-base"
############################################################################
## EPICS R7.0.4.1-DEV
## Rev. cf42f0910c2101177487-dirty
############################################################################
# ==== testGet ====
ok  1 - dbGetField("target:i.VAL", 7) -> 42 == 42
...
ok 15 - dbGetField("src:o2.VAL", 7) -> 14 == 14

none of the jobs are hanging

Yup... That's why these issues have persisted so long.

anjohnson · 2020-06-23T21:22:59Z

If there are no objections I'll work on merging those epicsThreadClass fixes into 3.15 and up to 7.0 (along with a few other benign changes from sonar/cppcheck that Ralph applied there).

mdavidsaver · 2020-06-23T21:37:15Z

What changes? Am I forgetting something?

anjohnson · 2020-06-23T21:51:25Z

The changes to epicsThread.cpp that the words this commit linked to in the second paragraph of my longer post earlier today. The main code changes might be easier to understand if you first look at the equivalent commit on my 3.15 branch where the fix isn't mixed up with removing the unnecessary epicsThread::joined handling from the 7.0 branch (the epicsThread class already handled joining).

mdavidsaver · 2020-06-23T22:00:26Z

my longer post earlier today

This got past me. Looking at it now, please do not push.

The thread join is necessary for me, please do not remove it. Having epicsThread actually join was what motivated me to do that work. If nothing else, it cuts down on false positives from valgrind (which I run regularly).

What situation are you trying to avoid?

anjohnson · 2020-06-23T22:16:50Z

The C++ epicsThread API was designed with two different ways for the epicsThread object to be destroyed. In one, the thread returns from its run() method when it has finished its work, and its parent is responsible for deleting the epicsThread object. The order in which these two are initiated doesn't matter, but the object destructor waits for the run() method to have returned before it will return (the parent can use exitWait() with a timeout if it needs to avoid blocking and only do the delete when that indicates the thread has exited). The older C API was designed so that when the thread returns all its resources are immediately freed (which is what VxWorks did), so in the C++ implementation only the epicsThread object needs to be tidied up in the C++ destructor.

A C++ thread may also delete its own epicsThread object before the run() method returns. It does that when its parent cannot wait for it to return, so the parent cannot call the epicsThread destructor. This avoids leaking the epicsThread object for 'fire-and-forget' type operations.

In the thread-join branch the code was changed so the C epicsThreadCreate() routine is called with the joinable flag set. In the first case above this unfortunately keeps the child thread alive after run() has returned, until the parent calls exitWait() or deletes the epicsThread object, so returning from run() no longer immediately frees the OS resources such as the thread's stack. If the epicsThread object is never deleted (say it gets started by an iocsh command and finishes sometime later), those child threads hang around forever.

A thread started with the C API can cancel its own joinable flag by calling epicsThreadMustJoin(). The equivalent in the C++ API would be for the thread to call exitWait(), which is what this change implements.

Can your need for a join not be implemented by the parent deleting the epicsThread object, causing it to wait?

mdavidsaver · 2020-06-23T22:35:57Z

How does this result in testCaProvider hanging?

Can your need for a join not be implemented by the parent deleting the epicsThread object, causing it to wait?

Not without calling pthread_join(). This is where valgrind lifetime tracking hooks in.

The C++ epicsThread API was designed with two different ways for the epicsThread object to be destroyed

I'm dubious about this statement. This seems like apologizing for a bug.

mdavidsaver · 2020-06-23T22:41:57Z

the thread returns from its run() method when it has finished its work, and its parent is responsible for deleting the epicsThread object.

A C++ thread may also delete its own epicsThread object before the run() method returns

Hang on, in both of these cases ~epicsThread() is called, and thus exitWait(10.0). All of the code paths where exitWait()==true should pass through epicsThreadMustJoin(). If this isn't the case, it is a bug.

mdavidsaver · 2020-06-23T22:55:44Z

To be clear, a leak will occur if the exitWait(10.0) in the destructor times out and returns false. This is a bug by design.

anjohnson · 2020-06-24T15:19:02Z

Either I've misunderstood you, or you've missed the fact that the exitWait(10.0) in the destructor is in the condition of a while loop, not an if statement. There is no leak on timeout, the destructor just won't return at all until the thread's run() method has returned to epicsThreadCallEntryPoint() and the terminated flag is set.

That's how the join functionality of the epicsThread class delays the parent thread.

mdavidsaver · 2020-06-24T15:36:50Z

You're right, I was seeing an if. So, I don't see any leak.

It would be helpful to me if you could explain what you see wrt. testCaProvider (or CaProvider more generally) which leads you to conclude that there is a bug in epicsThread?

anjohnson · 2020-06-24T15:57:22Z

I will work on that. Can you add a link to the exact job log where you caught the errors please; the link in the issue description above only takes me to your latest build log on Appveyor and I can't find the relevant one from "about 15 days ago".

mdavidsaver · 2020-06-24T18:35:12Z

latest build log on Appveyor

Right, this links to the job, but not the build. Which isn't useful.

https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33400081/job/rf9ojmbx2qmj0rpn#L10750

anjohnson · 2020-06-24T18:41:09Z

Thanks, it wasn't to you but it told me what I wanted to see.

Does the valgrind lifetime tracking that you're wanting to keep still work for threads that call epicsThreadMustJoin() themselves? Does it matter when they call it?

mdavidsaver · 2020-06-24T19:16:06Z

A thread which joins itself actually triggers a message about EDEADLK.

The point is to detect cases where one thread joins another so that subsequent access by the joining thread is known not to race with any access by the joined thread. As a counter-example. epicsThreadTest is written in a way which allows a wait()+destroy vs. trigger() race. This message would disappear if the worker thread were actually joined as valgrind could prove that no race was possible.

 Possible data race during read of size 1 at 0x50A5BC8 by thread #1
 Locks held: none
    at 0x48396DC: my_memcmp (hg_intercepts.c:211)
    by 0x4839A2D: mutex_destroy_WRK (hg_intercepts.c:850)
    by 0x488F4B9: epicsEventDestroy (osdEvent.c:72)
    by 0x4885EDA: epicsEvent::~epicsEvent() (epicsEvent.cpp:51)
    by 0x10AADB: (anonymous namespace)::testJoining() (epicsThreadTest.cpp:156)
    by 0x10AD1A: main (epicsThreadTest.cpp:229)
 
 This conflicts with a previous write of size 4 by thread #8
 Locks held: none
    at 0x4DEAEBA: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:50)
    by 0x483A236: mutex_unlock_WRK (hg_intercepts.c:1097)
    by 0x488F600: epicsEventTrigger (osdEvent.c:90)
    by 0x4885EF9: epicsEvent::trigger() (epicsEvent.cpp:56)
    by 0x10ADC1: epicsEvent::signal() (epicsEvent.h:95)
    by 0x10A976: (anonymous namespace)::joinTests(void*) (epicsThreadTest.cpp:149)
    by 0x488D171: start_routine (osdThread.c:410)
    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
  Address 0x50a5bc8 is 8 bytes inside a block of size 96 alloc'd
    at 0x48367CF: malloc (vg_replace_malloc.c:299)
    by 0x488F392: epicsEventCreate (osdEvent.c:49)
    by 0x4885E6F: epicsEvent::epicsEvent(epicsEventInitialState) (epicsEvent.cpp:42)
    by 0x10A9E0: (anonymous namespace)::testJoining() (epicsThreadTest.cpp:156)
    by 0x10AD1A: main (epicsThreadTest.cpp:229)
  Block was alloc'd by thread #1

anjohnson · 2020-06-25T15:29:49Z

A thread which joins itself actually triggers a message about EDEADLK.

I never mentioned calling pthread_join(), but I was really asking about non-joinable threads. After looking at the docs for valgrind I see that they can't be checked.

It seems to me that the only way to let valgrind properly handle the cases where an epicsThread has no parent that can join it would be to create some kind of reaper thread (whose name should be "Dēāṯẖ" BTW, see Terry Pratchett, or you could call it an undertaker since it the handles cleanup after death but that's boring) that waits on a message queue. With valgrind attached when a thread returns to start_routine() with a joinable flag of zero that routine (or a registered atThreadExit routine) should send the reaper thread a message asking "join me". In this mode posix threads would never need to actually be detached at all, they would only have their joinable flag manipulated.

This solves the valgrind tracking problem for detached threads using the both the C and C++ thread APIs. How/when the reaper thread would get cleaned up is an exercise I leave to you to ponder.

I'm working on #163 now BTW.

anjohnson · 2020-06-25T22:37:26Z

Okay, here's my analysis. I originally thought there would be a connection to the epicsThreadClass because of the messages you turned up, although that isn't the main cause of the problem.

Those messages appear after the CAChannelProvider destructor has supposedly shut down the ChannelConnectThread, while the epicsThread object is being destroyed. For them to be printed, the thread must still be running (the run() method hasn't returned) when the epicsThread destructor is called.

The CAChannelProvider destructor calls the ChannelConnectThread::stop() method:

void ChannelConnectThread::stop()
{
    {
        Lock xx(mutex);
        isStop = true;
    }
    waitForCommand.signal();
    waitForStop.wait();
}

It looks like the loop inside the run() method is the problem:

void ChannelConnectThread::run()
{
    while(true)
    {
         waitForCommand.wait();
         while(true) {
             bool more = false;
             NotifyChannelRequester* notifyChannelRequester(NULL);
             {
                 Lock lock(mutex);
                 if(!notifyChannelQueue.empty())
                 {
                      more = true;
                      // ...
                 }
             }
             if(!more) break;
             // ...
         }
         if(isStop) {
             waitForStop.signal();
             break;
         }
    }
}

The bug is most likely that isStop is being accessed towards the end of that loop without owning the mutex. I haven't proven this yet but I think it's likely. The same applies to the other 3 threads.

mdavidsaver · 2020-06-26T03:30:36Z

That would fit.

 Possible data race during write of size 1 at 0x596A008 by thread #1
 Locks held: 1, at address 0x596A0D0
    at 0x4B49734: epics::pvAccess::ca::ChannelConnectThread::stop() (channelConnectThread.cpp:61)
    by 0x4B54116: epics::pvAccess::ca::CAChannelProvider::~CAChannelProvider() (caProvider.cpp:83)
    by 0x4B541EB: epics::pvAccess::ca::CAChannelProvider::~CAChannelProvider() (caProvider.cpp:89)
    by 0x4B58CF5: std::_Sp_counted_ptr<epics::pvAccess::ca::CAChannelProvider*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:377)
    by 0x12140F: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:155)
    by 0x120144: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:728)
    by 0x11F479: std::__shared_ptr<epics::pvAccess::ChannelProvider, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:1167)
    by 0x11F495: std::shared_ptr<epics::pvAccess::ChannelProvider>::~shared_ptr() (shared_ptr.h:103)
    by 0x11E628: main (testCaProvider.cpp:787)
 
 This conflicts with a previous read of size 1 by thread #19
 Locks held: none
    at 0x4B499EC: epics::pvAccess::ca::ChannelConnectThread::run() (channelConnectThread.cpp:108)
    by 0x4FCAD27: epicsThreadCallEntryPoint (epicsThread.cpp:94)
    by 0x4FD4171: start_routine (osdThread.c:410)
    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
    by 0x552DFA2: start_thread (pthread_create.c:486)
    by 0x545E4CE: clone (clone.S:95)
  Address 0x596a008 is 8 bytes inside a block of size 136 alloc'd
    at 0x4836E3F: operator new(unsigned long) (vg_replace_malloc.c:334)
    by 0x4B49394: epics::pvAccess::ca::ChannelConnectThread::get() (channelConnectThread.cpp:29)
    by 0x4B53E5F: epics::pvAccess::ca::CAChannelProvider::CAChannelProvider(std::shared_ptr<epics::pvAccess::Configuration> const&) (caProvider.cpp:46)
    by 0x4B5862E: epics::pvAccess::SimpleChannelProviderFactory<epics::pvAccess::ca::CAChannelProvider>::sharedInstance() (pvAccess.h:1374)
    by 0x498D380: epics::pvAccess::ChannelProviderRegistry::getProvider(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ChannelAccessFactory.cpp:48)
    by 0x11D88B: main (testCaProvider.cpp:787)

valgrind --tool=helgrind flags other, earlier, races as well. eg. this one looks like a wait()+destroy vs. signal() race.

 Possible data race during write of size 8 at 0x596EA88 by thread #1
 Locks held: none
    at 0x4CC51C1: epics::pvData::Event::~Event() (event.cpp:35)
    by 0x123B35: TestChannel::~TestChannel() (testCaProvider.cpp:65)
    by 0x123B9F: TestChannel::~TestChannel() (testCaProvider.cpp:65)
... 
 This conflicts with a previous read of size 8 by thread #19
 Locks held: none
    at 0x4CC5226: epics::pvData::Event::signal() (event.cpp:46)
    by 0x119BB0: TestChannel::channelStateChange(std::shared_ptr<epics::pvAccess::Channel> const&, epics::pvAccess::Channel::ConnectionState) (testCaProvider.cpp:115)
    by 0x4B59382: epics::pvAccess::ca::CAChannel::notifyClient() (caChannel.cpp:113)
    by 0x4B499D5: epics::pvAccess::ca::ChannelConnectThread::run() (channelConnectThread.cpp:105)
...

mdavidsaver · 2020-06-29T15:03:56Z

I caught another hung test and tried to debug. Unfortunately, it seems that the visual studio license on the appveyor 2019 VM is expired. I only had 10 minutes left, which wasn't enough time to figure out another approach.

mdavidsaver · 2020-06-30T20:08:32Z

There are (I think) multiple issues at play. wrt. testpvalink in the first instance. This looks like one of the mysterious crash on exits.

An example of epicsTypesTest crashing on exit. Which is impressive given how little it does. Downloading epicsTypesTest.tap I can see that all 10 cases were 'ok'. Unfortunately, this was not a branch with ci-core-dumper installed.

https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33808590/job/117avxxjc6w5qx0x#L10369

mdavidsaver added the bug label Jun 10, 2020

mdavidsaver assigned anjohnson Jun 10, 2020

anjohnson linked a pull request Jun 30, 2020 that will close this issue

Fixing #165 #167

Merged

mdavidsaver closed this as completed in #167 Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testCaProvider deadlock #165

testCaProvider deadlock #165

mdavidsaver commented Jun 10, 2020

mdavidsaver commented Jun 10, 2020

anjohnson commented Jun 11, 2020

mdavidsaver commented Jun 21, 2020

anjohnson commented Jun 21, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020 •

edited

Loading

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 24, 2020 •

edited

Loading

mdavidsaver commented Jun 24, 2020

anjohnson commented Jun 24, 2020

mdavidsaver commented Jun 24, 2020

anjohnson commented Jun 24, 2020

mdavidsaver commented Jun 24, 2020 •

edited

Loading

anjohnson commented Jun 25, 2020

anjohnson commented Jun 25, 2020

mdavidsaver commented Jun 26, 2020

mdavidsaver commented Jun 29, 2020

mdavidsaver commented Jun 30, 2020

testCaProvider deadlock #165

testCaProvider deadlock #165

Comments

mdavidsaver commented Jun 10, 2020

mdavidsaver commented Jun 10, 2020

anjohnson commented Jun 11, 2020

mdavidsaver commented Jun 21, 2020

anjohnson commented Jun 21, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020 • edited Loading

anjohnson commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

mdavidsaver commented Jun 23, 2020

anjohnson commented Jun 24, 2020 • edited Loading

mdavidsaver commented Jun 24, 2020

anjohnson commented Jun 24, 2020

mdavidsaver commented Jun 24, 2020

anjohnson commented Jun 24, 2020

mdavidsaver commented Jun 24, 2020 • edited Loading

anjohnson commented Jun 25, 2020

anjohnson commented Jun 25, 2020

mdavidsaver commented Jun 26, 2020

mdavidsaver commented Jun 29, 2020

mdavidsaver commented Jun 30, 2020

mdavidsaver commented Jun 23, 2020 •

edited

Loading

anjohnson commented Jun 24, 2020 •

edited

Loading

mdavidsaver commented Jun 24, 2020 •

edited

Loading