-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent hang on exit while omero.client keepalive is active #424
Conversation
Definitely! 😄
👍
|
Having looked at this in a lot more detail I'm pretty confident this is just a bug fix. There is more detailed explanation here:
The summary is that our code that is in question here was written in the Python 2.4 era and sometime around Python 2.6.5 the semantics of how The only other place I see |
For reference, current uses of
|
This code was originally added to address https://trac.openmicroscopy.org/ome/ticket/3260 where an exception was being raised: Traceback (most recent call last): File "/usr/lib/python2.4/logging/__init__.py", line 737, in emit self.stream.write(fs % msg) ValueError: I/O operation on closed file Due to the many changes in interpreter behaviour since Python 2.4, including those surrounding sys.exit in the ~2.6.5 era [^1], as well as improvements to the logging module itself this is no longer an issue. In the referenced Python issue 9501, the related code changes were originally committed in Subversion r84282, Mercurial changeset 42c70f17c2fede6116108568a75b0eec08b3c73a, Git commit 1ddd51fc71c6a6d255eee8e7e54d1bf7e05c5841. Found by tracing the Mercurial changeset via `Misc/svnmap.txt` in the cython repository. At a future date `omero.util.concurrency.AtExitEvent` and `omero.util.concurrency.get_event()` should likely be deprecated and their use discontinued. They attempt to handle issues that no longer exist and add complexity and maintenance overhead to our runtime. See also: * https://bugs.python.org/issue6333 * https://bugs.python.org/issue13807 [^1]: https://stackoverflow.com/questions/3713360/python-2-6-x-theading-signals-atexit-fail-on-some-versions/3765160#3765160
Detailed rationale for the logging changes in 0f8fe1c. Future deprecation, refactoring and cleanup of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both commits make sense
Without this PR, calling
Updating to this branch fixes that behaviour (clean exit with 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I audited the usage of the omero.util.Resources
API across the OME code:
- in
omero.clients
,BaseClient
object has a private__resources
field which gets filled with one (or theoretically multiple)Entry
objects which will periodically running akeepAlive()
call unTables.py
starts aomero.tables.TablesI
service server-side. This service is a subclass ofomero.util.Servant
which means it is anomero.util.SimpleServant
managing some resources. EachTableI
object retrieved via thegetTable
API is registered as a resource
-runProcessor.py
similarly starts theomero.processor.ProcessorI
service server-side which is also subclass ofomero.util.Servant
. Two types of resources are registered in this service:- the session kept alive via the
omero.processor.UseSessionHolder
object - the
omero.processor.ProcessI
(or subclass) launched by the service
- the session kept alive via the
- in omero-dropbox,
MonitorClientI
registers theomero.util.ServerContext
as a resource to be able to retrieve the active session
No regression in the existing unit and integration tests have been flagged since this PR got included. As shown above, in many scenarios the resources objects are used to maintain an active connection . Client-side, the keepAlive
task should definitely be terminated alongside the main thread which created the session and there are no active resources to cleanup.
Server-side, all usages of resources to maintain an active session should likely be terminated if the process itself is restarted.
I'll perform a round of functional testing trying and review the cleanup conditions for the table/processor use cases. Otherwise, I agree this should likely be released as a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some additional tests using the Processor service and the following minimal script which is waiting for 2 min and performing an action
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from omero.gateway import BlitzGateway
from omero.rtypes import rstring
import omero.scripts as scripts
import time
if __name__ == "__main__":
client = scripts.client(
'Example.py', """This script ...""",
version="0.1",
)
try:
conn = BlitzGateway(client_obj=client)
time.sleep(120)
conn.getObject("Image",12003)
client.setOutput("Message", rstring("Success"))
finally:
client.closeSession()
The Processor service was stopped:
- using
omero admin ice "server stop Processor-0"
orkill <process_pid>
- after the execution of the script or during the execution of the script
- with OMERO.py 5.19.4 and with OMERO.py built with this PR included
In all scenarios above, the Processor was successfully stopped with the following sequence of logs:
2024-09-04 10:24:17,574 INFO [ omero.util.Server] (MainThread) Cleanup
2024-09-04 10:24:17,574 INFO [ omero.util.Resources] (Thread-2 ) Halted
2024-09-04 10:24:17,574 INFO [ omero.processor.ProcessorI] (MainThread) Cleaning up
# Several events about killing session, changing job status...
# The nature of these depending on the sequence of events and whether active processes are running
2024-09-04 10:24:17,579 INFO [ omero.processor.ProcessorI] (MainThread) Done
2024-09-04 10:24:17,586 INFO [ omero.util.Server] (MainThread) Stopped
The omero.util.Resources
log statement indicate the Task.run
method is returning which is the expectation. The processor was successfully auto-restarted in all cases and new scripts were executable.
The only minor RFE that arose from this testing is that when interrupting the Processor while a script was running, an active OMERO.web session trying to poll the results of the script will throw the generic 500 page with the following stack trace
Traceback (most recent call last):
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
response = get_response(request)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 538, in wrapped
retval = f(request, *args, **kwargs)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 597, in wrapper
context = f(request, *args, **kwargs)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/webclient/views.py", line 3864, in activities
cb = ScriptsCallback(conn.c, proc)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero/callbacks.py", line 72, in __init__
process.registerCallback(self.prx)
File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero_Scripts_ice.py", line 1115, in registerCallback
return _M_omero.grid.Process._op_registerCallback.invoke(self, ((cb, ), _ctx))
Ice.ObjectNotExistException: exception ::Ice::ObjectNotExistException
{
id =
{
name = D515E601-3969-445A-BB3D-382778797953
category =
}
facet =
operation = registerCallback
}
<WSGIRequest: GET '/webclient/activities/?_=1725446734744'>
and refreshing the window still displays the 500 so I had to log out/log in again. I assume this is possibly something that could be handled more gracefully at the OMERO.web level, handling the underlying server exception.
Unless someone thinks of another testing scenario to evaluate, I would be inclined to get this released as OMERO.py 5.19.5 and start consuming it in various places to ensure it does not cause any unexpected regression
Released in 5.19.5 |
When using
omero.client
best practice appears to be to have everything in atry
block and explicitly callclient.closeSession
to shut everything down nicely, however users don't always remember to do this.If a user has called
client.enableKeepAlive(x)
then exiting without callingcloseSession
currently seems to cause Python to hang instead of exiting.Minimal example:
At least on my end, executing this script will throw the exception but fail to exit the interpreter.
Having investigated this, I believe that the task thread within the
omero.util.Resources
class fails to shut down when the main thread exits. The result is that, while the interpreter is waiting for all threads to close so that it can exit, the keepalive thread keeps spinning. It's currently set up to wait for an explicit close signal from the main thread, but this obviously never comes if that thread has stopped. You therefore get a hang.As a quick fix, making these worker threads into daemon threads ensures that they die if the main thread exits, thus fixing the problem. I'm not sure how acceptable this is as the Resources class seems to be used for other repeating processes, but I couldn't find a scenario where these should also continue even in the absence of the main thread. If that's not the case then you could instead build some sort of "is the main thread alive?" check into the task.
It might actually be worth a broader review of omero-py's cleanup/shutdown routines. It looks like much of that code is rather old and so modern Python could have more effective ways to manage connections.