-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testCaProvider deadlock #165
Comments
I don't think this is exactly the same as #163, since I think only one provider instance is created. It looks like an issue with the same code though. |
Thanks. From that error message I suspect I know what the issue may be here, caused by deleting an epicsThread object in a different thread after the child has already called exitWait() or returned. I already have a fix and tests for the 3.15 branch but it needs some work to properly merge up due to conflicts with the epicsThreadJoin() changes in 7.0. I will bump that upwards in priority, and also work on #163 |
@anjohnson Is your fix imminent? If not, I'm going to temporarily disable testCaProvider from running on windows. Having eg. 3 appveyor jobs timeout during one run is annoying. diff --git a/testCa/Makefile b/testCa/Makefile
index 1b1ded0f..a575235a 100644
--- a/testCa/Makefile
+++ b/testCa/Makefile
@@ -11,7 +11,9 @@ PROD_SYS_LIBS_WIN32 += netapi32 ws2_32
TESTPROD_HOST += testCaProvider
testCaProvider_SRCS += testCaProvider.cpp
+ifneq (,$(filter win%,$(T_A)))
TESTS += testCaProvider
+endif
ifdef BASE_3_16
testCaProvider_SRCS += testIoc_registerRecordDeviceDriver.cpp
REGRDDFLAGS = -l |
Sorry, somehow I hadn't internalized that this bug was what was hanging the Appveyor builds; I will work on this next (tomorrow). |
I may have a fix, or a we may have a Heisenbug. I aborted my first Appveyor build job of this commit after it spent an hour building, which could have been testCaProvider.exe hanging but I had no way of knowing that for sure. I switched to using a Both of those test programs succeeded when they were re-run during the So I switched back to the standard built-in Note that the above builds were all against the same git commit. I will be re-doing that before the final merge as I lost the connection to the 3.15 commit that it was derived from and I'm improving the API doxy-description in the header, but the code changes there are what I'm currently proposing to fix this issue. Comments, ideas, any other explanations for my first job hanging? |
There are (I think) multiple issues at play. wrt. testpvalink in the first instance. This looks like one of the mysterious crash on exits. I've seen various tests do this, include
Yup... That's why these issues have persisted so long. |
If there are no objections I'll work on merging those epicsThreadClass fixes into 3.15 and up to 7.0 (along with a few other benign changes from sonar/cppcheck that Ralph applied there). |
What changes? Am I forgetting something? |
The changes to epicsThread.cpp that the words this commit linked to in the second paragraph of my longer post earlier today. The main code changes might be easier to understand if you first look at the equivalent commit on my 3.15 branch where the fix isn't mixed up with removing the unnecessary |
This got past me. Looking at it now, please do not push. The thread join is necessary for me, please do not remove it. Having epicsThread actually join was what motivated me to do that work. If nothing else, it cuts down on false positives from valgrind (which I run regularly). What situation are you trying to avoid? |
The C++ epicsThread API was designed with two different ways for the epicsThread object to be destroyed. In one, the thread returns from its A C++ thread may also delete its own epicsThread object before the run() method returns. It does that when its parent cannot wait for it to return, so the parent cannot call the epicsThread destructor. This avoids leaking the epicsThread object for 'fire-and-forget' type operations. In the thread-join branch the code was changed so the C epicsThreadCreate() routine is called with the joinable flag set. In the first case above this unfortunately keeps the child thread alive after run() has returned, until the parent calls exitWait() or deletes the epicsThread object, so returning from run() no longer immediately frees the OS resources such as the thread's stack. If the epicsThread object is never deleted (say it gets started by an iocsh command and finishes sometime later), those child threads hang around forever. A thread started with the C API can cancel its own joinable flag by calling epicsThreadMustJoin(). The equivalent in the C++ API would be for the thread to call exitWait(), which is what this change implements. Can your need for a join not be implemented by the parent deleting the epicsThread object, causing it to wait? |
How does this result in testCaProvider hanging?
Not without calling
I'm dubious about this statement. This seems like apologizing for a bug. |
Hang on, in both of these cases |
To be clear, a leak will occur if the |
Either I've misunderstood you, or you've missed the fact that the That's how the join functionality of the epicsThread class delays the parent thread. |
You're right, I was seeing an It would be helpful to me if you could explain what you see wrt. testCaProvider (or CaProvider more generally) which leads you to conclude that there is a bug in epicsThread? |
I will work on that. Can you add a link to the exact job log where you caught the errors please; the link in the issue description above only takes me to your latest build log on Appveyor and I can't find the relevant one from "about 15 days ago". |
Right, this links to the job, but not the build. Which isn't useful. https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33400081/job/rf9ojmbx2qmj0rpn#L10750 |
Thanks, it wasn't to you but it told me what I wanted to see. Does the valgrind lifetime tracking that you're wanting to keep still work for threads that call |
A thread which joins itself actually triggers a message about EDEADLK. The point is to detect cases where one thread joins another so that subsequent access by the joining thread is known not to race with any access by the joined thread. As a counter-example. epicsThreadTest is written in a way which allows a wait()+destroy vs. trigger() race. This message would disappear if the worker thread were actually joined as valgrind could prove that no race was possible.
|
I never mentioned calling It seems to me that the only way to let valgrind properly handle the cases where an epicsThread has no parent that can join it would be to create some kind of reaper thread (whose name should be "Dēāṯẖ" BTW, see Terry Pratchett, or you could call it an undertaker since it the handles cleanup after death but that's boring) that waits on a message queue. With valgrind attached when a thread returns to This solves the valgrind tracking problem for detached threads using the both the C and C++ thread APIs. How/when the reaper thread would get cleaned up is an exercise I leave to you to ponder. I'm working on #163 now BTW. |
Okay, here's my analysis. I originally thought there would be a connection to the Those messages appear after the The void ChannelConnectThread::stop()
{
{
Lock xx(mutex);
isStop = true;
}
waitForCommand.signal();
waitForStop.wait();
} It looks like the loop inside the run() method is the problem: void ChannelConnectThread::run()
{
while(true)
{
waitForCommand.wait();
while(true) {
bool more = false;
NotifyChannelRequester* notifyChannelRequester(NULL);
{
Lock lock(mutex);
if(!notifyChannelQueue.empty())
{
more = true;
// ...
}
}
if(!more) break;
// ...
}
if(isStop) {
waitForStop.signal();
break;
}
}
} The bug is most likely that |
That would fit.
|
I caught another hung test and tried to debug. Unfortunately, it seems that the visual studio license on the appveyor 2019 VM is expired. I only had 10 minutes left, which wasn't enough time to figure out another approach. |
An example of epicsTypesTest crashing on exit. Which is impressive given how little it does. Downloading https://ci.appveyor.com/project/mdavidsaver/epics-base/builds/33808590/job/117avxxjc6w5qx0x#L10369 |
I happened to time it right to catch a hung appveyor run for Base 7.0. I found
testCaProvider.exe
still running after ~45 min. I first tried killing caRepeater, with apparent effect. The CI run completed shortly after I then killed the test. The last step caused some new output to appear in the log.https://ci.appveyor.com/project/mdavidsaver/epics-base/build/job/rf9ojmbx2qmj0rpn
The text was updated successfully, but these errors were encountered: