Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testCaProvider deadlock #165

Closed
mdavidsaver opened this issue Jun 10, 2020 · 25 comments · Fixed by #167
Closed

testCaProvider deadlock #165

mdavidsaver opened this issue Jun 10, 2020 · 25 comments · Fixed by #167
Assignees
Labels

Comments

@mdavidsaver
Copy link
Member

I happened to time it right to catch a hung appveyor run for Base 7.0. I found testCaProvider.exe still running after ~45 min. I first tried killing caRepeater, with apparent effect. The CI run completed shortly after I then killed the test. The last step caused some new output to appear in the log.

https://ci.appveyor.com/project/mdavidsaver/epics-base/build/job/rf9ojmbx2qmj0rpn

perl -CSD testCaProvider.t -tap > testCaProvider.tap
epicsThread::~epicsThread(): "winadc" blocking for thread "channelConnectThread" to exit
was epicsThread object destroyed before thread exit ?
epicsThread::~epicsThread(): "winadc" blocking for thread "channelConnectThread" to exit
was epicsThread object destroyed before thread exit ?
...
@mdavidsaver
Copy link
Member Author

I don't think this is exactly the same as #163, since I think only one provider instance is created. It looks like an issue with the same code though.

@anjohnson
Copy link
Member

Thanks.

From that error message I suspect I know what the issue may be here, caused by deleting an epicsThread object in a different thread after the child has already called exitWait() or returned. I already have a fix and tests for the 3.15 branch but it needs some work to properly merge up due to conflicts with the epicsThreadJoin() changes in 7.0. I will bump that upwards in priority, and also work on #163

@mdavidsaver
Copy link
Member Author

@anjohnson Is your fix imminent? If not, I'm going to temporarily disable testCaProvider from running on windows. Having eg. 3 appveyor jobs timeout during one run is annoying.

diff --git a/testCa/Makefile b/testCa/Makefile
index 1b1ded0f..a575235a 100644
--- a/testCa/Makefile
+++ b/testCa/Makefile
@@ -11,7 +11,9 @@ PROD_SYS_LIBS_WIN32 += netapi32 ws2_32
 
 TESTPROD_HOST += testCaProvider
 testCaProvider_SRCS += testCaProvider.cpp
+ifneq (,$(filter win%,$(T_A)))
 TESTS += testCaProvider
+endif
 ifdef BASE_3_16
   testCaProvider_SRCS += testIoc_registerRecordDeviceDriver.cpp
   REGRDDFLAGS = -l

@anjohnson
Copy link
Member

Sorry, somehow I hadn't internalized that this bug was what was hanging the Appveyor builds; I will work on this next (tomorrow).

@anjohnson
Copy link
Member

I may have a fix, or a we may have a Heisenbug.

I aborted my first Appveyor build job of this commit after it spent an hour building, which could have been testCaProvider.exe hanging but I had no way of knowing that for sure.

I switched to using a .appveyor.yml file in a gist to allow me to remote into it (and also cut out the older compilers which didn't seem to fail testCaProvider anyway). The resulting build didn't hang any jobs, although it did have 2 other failures the first time it tried to run

Both of those test programs succeeded when they were re-run during the make test-results phase.

So I switched back to the standard built-in .appyevor.yml file in case enabling RDP was causing the problem and this time none of the jobs are hanging, although we are still getting tests failing:

Note that the above builds were all against the same git commit. I will be re-doing that before the final merge as I lost the connection to the 3.15 commit that it was derived from and I'm improving the API doxy-description in the header, but the code changes there are what I'm currently proposing to fix this issue. Comments, ideas, any other explanations for my first job hanging?

@mdavidsaver
Copy link
Member Author

There are (I think) multiple issues at play. wrt. testpvalink in the first instance. This looks like one of the mysterious crash on exits. I've seen various tests do this, include epicsMathTest. I'm hoping to troubleshoot this with my ci-core-dumper tool.

1..15
Warning: IOC is booting with TOP = "C:/projects/epics-base/modules/pva2pva"
          but was built with TOP = "C:/projects/epics-base"
############################################################################
## EPICS R7.0.4.1-DEV
## Rev. cf42f0910c2101177487-dirty
############################################################################
# ==== testGet ====
ok  1 - dbGetField("target:i.VAL", 7) -> 42 == 42
...
ok 15 - dbGetField("src:o2.VAL", 7) -> 14 == 14

none of the jobs are hanging

Yup... That's why these issues have persisted so long.

@anjohnson
Copy link
Member

If there are no objections I'll work on merging those epicsThreadClass fixes into 3.15 and up to 7.0 (along with a few other benign changes from sonar/cppcheck that Ralph applied there).

@mdavidsaver
Copy link
Member Author

What changes? Am I forgetting something?

@anjohnson
Copy link
Member

The changes to epicsThread.cpp that the words this commit linked to in the second paragraph of my longer post earlier today. The main code changes might be easier to understand if you first look at the equivalent commit on my 3.15 branch where the fix isn't mixed up with removing the unnecessary epicsThread::joined handling from the 7.0 branch (the epicsThread class already handled joining).

@mdavidsaver
Copy link
Member Author

mdavidsaver commented Jun 23, 2020

my longer post earlier today

This got past me. Looking at it now, please do not push.

The thread join is necessary for me, please do not remove it. Having epicsThread actually join was what motivated me to do that work. If nothing else, it cuts down on false positives from valgrind (which I run regularly).

What situation are you trying to avoid?

@anjohnson
Copy link
Member

The C++ epicsThread API was designed with two different ways for the epicsThread object to be destroyed. In one, the thread returns from its run() method when it has finished its work, and its parent is responsible for deleting the epicsThread object. The order in which these two are initiated doesn't matter, but the object destructor waits for the run() method to have returned before it will return (the parent can use exitWait() with a timeout if it needs to avoid blocking and only do the delete when that indicates the thread has exited). The older C API was designed so that when the thread returns all its resources are immediately freed (which is what VxWorks did), so in the C++ implementation only the epicsThread object needs to be tidied up in the C++ destructor.

A C++ thread may also delete its own epicsThread object before the run() method returns. It does that when its parent cannot wait for it to return, so the parent cannot call the epicsThread destructor. This avoids leaking the epicsThread object for 'fire-and-forget' type operations.

In the thread-join branch the code was changed so the C epicsThreadCreate() routine is called with the joinable flag set. In the first case above this unfortunately keeps the child thread alive after run() has returned, until the parent calls exitWait() or deletes the epicsThread object, so returning from run() no longer immediately frees the OS resources such as the thread's stack. If the epicsThread object is never deleted (say it gets started by an iocsh command and finishes sometime later), those child threads hang around forever.

A thread started with the C API can cancel its own joinable flag by calling epicsThreadMustJoin(). The equivalent in the C++ API would be for the thread to call exitWait(), which is what this change implements.

Can your need for a join not be implemented by the parent deleting the epicsThread object, causing it to wait?

@mdavidsaver
Copy link
Member Author

How does this result in testCaProvider hanging?

Can your need for a join not be implemented by the parent deleting the epicsThread object, causing it to wait?

Not without calling pthread_join(). This is where valgrind lifetime tracking hooks in.

The C++ epicsThread API was designed with two different ways for the epicsThread object to be destroyed

I'm dubious about this statement. This seems like apologizing for a bug.

@mdavidsaver
Copy link
Member Author

the thread returns from its run() method when it has finished its work, and its parent is responsible for deleting the epicsThread object.

A C++ thread may also delete its own epicsThread object before the run() method returns

Hang on, in both of these cases ~epicsThread() is called, and thus exitWait(10.0). All of the code paths where exitWait()==true should pass through epicsThreadMustJoin(). If this isn't the case, it is a bug.

@mdavidsaver
Copy link
Member Author

To be clear, a leak will occur if the exitWait(10.0) in the destructor times out and returns false. This is a bug by design.

@anjohnson
Copy link
Member

anjohnson commented Jun 24, 2020

Either I've misunderstood you, or you've missed the fact that the exitWait(10.0) in the destructor is in the condition of a while loop, not an if statement. There is no leak on timeout, the destructor just won't return at all until the thread's run() method has returned to epicsThreadCallEntryPoint() and the terminated flag is set.

That's how the join functionality of the epicsThread class delays the parent thread.

@mdavidsaver
Copy link
Member Author

You're right, I was seeing an if. So, I don't see any leak.

It would be helpful to me if you could explain what you see wrt. testCaProvider (or CaProvider more generally) which leads you to conclude that there is a bug in epicsThread?

@anjohnson
Copy link
Member

I will work on that. Can you add a link to the exact job log where you caught the errors please; the link in the issue description above only takes me to your latest build log on Appveyor and I can't find the relevant one from "about 15 days ago".

@mdavidsaver
Copy link
Member Author

latest build log on Appveyor

Right, this links to the job, but not the build. Which isn't useful.

https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33400081/job/rf9ojmbx2qmj0rpn#L10750

@anjohnson
Copy link
Member

Thanks, it wasn't to you but it told me what I wanted to see.

Does the valgrind lifetime tracking that you're wanting to keep still work for threads that call epicsThreadMustJoin() themselves? Does it matter when they call it?

@mdavidsaver
Copy link
Member Author

mdavidsaver commented Jun 24, 2020

A thread which joins itself actually triggers a message about EDEADLK.

The point is to detect cases where one thread joins another so that subsequent access by the joining thread is known not to race with any access by the joined thread. As a counter-example. epicsThreadTest is written in a way which allows a wait()+destroy vs. trigger() race. This message would disappear if the worker thread were actually joined as valgrind could prove that no race was possible.

 Possible data race during read of size 1 at 0x50A5BC8 by thread #1
 Locks held: none
    at 0x48396DC: my_memcmp (hg_intercepts.c:211)
    by 0x4839A2D: mutex_destroy_WRK (hg_intercepts.c:850)
    by 0x488F4B9: epicsEventDestroy (osdEvent.c:72)
    by 0x4885EDA: epicsEvent::~epicsEvent() (epicsEvent.cpp:51)
    by 0x10AADB: (anonymous namespace)::testJoining() (epicsThreadTest.cpp:156)
    by 0x10AD1A: main (epicsThreadTest.cpp:229)
 
 This conflicts with a previous write of size 4 by thread #8
 Locks held: none
    at 0x4DEAEBA: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:50)
    by 0x483A236: mutex_unlock_WRK (hg_intercepts.c:1097)
    by 0x488F600: epicsEventTrigger (osdEvent.c:90)
    by 0x4885EF9: epicsEvent::trigger() (epicsEvent.cpp:56)
    by 0x10ADC1: epicsEvent::signal() (epicsEvent.h:95)
    by 0x10A976: (anonymous namespace)::joinTests(void*) (epicsThreadTest.cpp:149)
    by 0x488D171: start_routine (osdThread.c:410)
    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
  Address 0x50a5bc8 is 8 bytes inside a block of size 96 alloc'd
    at 0x48367CF: malloc (vg_replace_malloc.c:299)
    by 0x488F392: epicsEventCreate (osdEvent.c:49)
    by 0x4885E6F: epicsEvent::epicsEvent(epicsEventInitialState) (epicsEvent.cpp:42)
    by 0x10A9E0: (anonymous namespace)::testJoining() (epicsThreadTest.cpp:156)
    by 0x10AD1A: main (epicsThreadTest.cpp:229)
  Block was alloc'd by thread #1

@anjohnson
Copy link
Member

A thread which joins itself actually triggers a message about EDEADLK.

I never mentioned calling pthread_join(), but I was really asking about non-joinable threads. After looking at the docs for valgrind I see that they can't be checked.

It seems to me that the only way to let valgrind properly handle the cases where an epicsThread has no parent that can join it would be to create some kind of reaper thread (whose name should be "Dēāṯẖ" BTW, see Terry Pratchett, or you could call it an undertaker since it the handles cleanup after death but that's boring) that waits on a message queue. With valgrind attached when a thread returns to start_routine() with a joinable flag of zero that routine (or a registered atThreadExit routine) should send the reaper thread a message asking "join me". In this mode posix threads would never need to actually be detached at all, they would only have their joinable flag manipulated.

This solves the valgrind tracking problem for detached threads using the both the C and C++ thread APIs. How/when the reaper thread would get cleaned up is an exercise I leave to you to ponder.

I'm working on #163 now BTW.

@anjohnson
Copy link
Member

Okay, here's my analysis. I originally thought there would be a connection to the epicsThreadClass because of the messages you turned up, although that isn't the main cause of the problem.

Those messages appear after the CAChannelProvider destructor has supposedly shut down the ChannelConnectThread, while the epicsThread object is being destroyed. For them to be printed, the thread must still be running (the run() method hasn't returned) when the epicsThread destructor is called.

The CAChannelProvider destructor calls the ChannelConnectThread::stop() method:

void ChannelConnectThread::stop()
{
    {
        Lock xx(mutex);
        isStop = true;
    }
    waitForCommand.signal();
    waitForStop.wait();
}

It looks like the loop inside the run() method is the problem:

void ChannelConnectThread::run()
{
    while(true)
    {
         waitForCommand.wait();
         while(true) {
             bool more = false;
             NotifyChannelRequester* notifyChannelRequester(NULL);
             {
                 Lock lock(mutex);
                 if(!notifyChannelQueue.empty())
                 {
                      more = true;
                      // ...
                 }
             }
             if(!more) break;
             // ...
         }
         if(isStop) {
             waitForStop.signal();
             break;
         }
    }
}

The bug is most likely that isStop is being accessed towards the end of that loop without owning the mutex. I haven't proven this yet but I think it's likely. The same applies to the other 3 threads.

@mdavidsaver
Copy link
Member Author

That would fit.

 Possible data race during write of size 1 at 0x596A008 by thread #1
 Locks held: 1, at address 0x596A0D0
    at 0x4B49734: epics::pvAccess::ca::ChannelConnectThread::stop() (channelConnectThread.cpp:61)
    by 0x4B54116: epics::pvAccess::ca::CAChannelProvider::~CAChannelProvider() (caProvider.cpp:83)
    by 0x4B541EB: epics::pvAccess::ca::CAChannelProvider::~CAChannelProvider() (caProvider.cpp:89)
    by 0x4B58CF5: std::_Sp_counted_ptr<epics::pvAccess::ca::CAChannelProvider*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:377)
    by 0x12140F: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:155)
    by 0x120144: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:728)
    by 0x11F479: std::__shared_ptr<epics::pvAccess::ChannelProvider, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:1167)
    by 0x11F495: std::shared_ptr<epics::pvAccess::ChannelProvider>::~shared_ptr() (shared_ptr.h:103)
    by 0x11E628: main (testCaProvider.cpp:787)
 
 This conflicts with a previous read of size 1 by thread #19
 Locks held: none
    at 0x4B499EC: epics::pvAccess::ca::ChannelConnectThread::run() (channelConnectThread.cpp:108)
    by 0x4FCAD27: epicsThreadCallEntryPoint (epicsThread.cpp:94)
    by 0x4FD4171: start_routine (osdThread.c:410)
    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
    by 0x552DFA2: start_thread (pthread_create.c:486)
    by 0x545E4CE: clone (clone.S:95)
  Address 0x596a008 is 8 bytes inside a block of size 136 alloc'd
    at 0x4836E3F: operator new(unsigned long) (vg_replace_malloc.c:334)
    by 0x4B49394: epics::pvAccess::ca::ChannelConnectThread::get() (channelConnectThread.cpp:29)
    by 0x4B53E5F: epics::pvAccess::ca::CAChannelProvider::CAChannelProvider(std::shared_ptr<epics::pvAccess::Configuration> const&) (caProvider.cpp:46)
    by 0x4B5862E: epics::pvAccess::SimpleChannelProviderFactory<epics::pvAccess::ca::CAChannelProvider>::sharedInstance() (pvAccess.h:1374)
    by 0x498D380: epics::pvAccess::ChannelProviderRegistry::getProvider(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ChannelAccessFactory.cpp:48)
    by 0x11D88B: main (testCaProvider.cpp:787)

valgrind --tool=helgrind flags other, earlier, races as well. eg. this one looks like a wait()+destroy vs. signal() race.

 Possible data race during write of size 8 at 0x596EA88 by thread #1
 Locks held: none
    at 0x4CC51C1: epics::pvData::Event::~Event() (event.cpp:35)
    by 0x123B35: TestChannel::~TestChannel() (testCaProvider.cpp:65)
    by 0x123B9F: TestChannel::~TestChannel() (testCaProvider.cpp:65)
... 
 This conflicts with a previous read of size 8 by thread #19
 Locks held: none
    at 0x4CC5226: epics::pvData::Event::signal() (event.cpp:46)
    by 0x119BB0: TestChannel::channelStateChange(std::shared_ptr<epics::pvAccess::Channel> const&, epics::pvAccess::Channel::ConnectionState) (testCaProvider.cpp:115)
    by 0x4B59382: epics::pvAccess::ca::CAChannel::notifyClient() (caChannel.cpp:113)
    by 0x4B499D5: epics::pvAccess::ca::ChannelConnectThread::run() (channelConnectThread.cpp:105)
...

@mdavidsaver
Copy link
Member Author

I caught another hung test and tried to debug. Unfortunately, it seems that the visual studio license on the appveyor 2019 VM is expired. I only had 10 minutes left, which wasn't enough time to figure out another approach.

@mdavidsaver
Copy link
Member Author

There are (I think) multiple issues at play. wrt. testpvalink in the first instance. This looks like one of the mysterious crash on exits.

An example of epicsTypesTest crashing on exit. Which is impressive given how little it does. Downloading epicsTypesTest.tap I can see that all 10 cases were 'ok'. Unfortunately, this was not a branch with ci-core-dumper installed.

https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33808590/job/117avxxjc6w5qx0x#L10369

@anjohnson anjohnson linked a pull request Jun 30, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants