Aries dangling long-running process when used through the UP #133

nicola-gigante · 2024-04-26T14:37:47Z

I'm using the Aries engine of the UP framework and to test what I'm developing I kill my code most of the time (Ctrl-C, so SIGTERM, not anything too violent).

In the last days I noticed my laptop (Apple MBP 14" M1 Pro) was draining the battery very quickly and strangely heating up quite a lot and I was worried about it having some hardware problems when instead I noticed there was a running
up-aries_macos_arm64 process in the background using up to 300% of CPU constantly.

So this was probably dangling from some of the instances of Aries launched by the UP during testing of my code, but which was not killed properly when the Python process stopped.

Unfortunately I tried but I don't know how to reproduce what happened. I'm issuing it anyway because it may be something easy to spot for somebody that knows the source and how the service process is supposed to be killed in anomalous situations.

The Python code itself is also difficult to post, but it is encoding some high-level problem into a hierarchical problem and then launching Aries with an AnytimePlanner instance in a rather standard way.

Edit: The Python packages versions I'm using are unified-planning==1.0.0.289.dev1 and up-aries==0.3.2.

Let me know if I can do anything to help you debug this.

The text was updated successfully, but these errors were encountered:

arbimo · 2024-04-26T17:32:43Z

Hi Nicola,

when used through UP, there is a process for aries (of the executable you mentioned) started for each planning request.
The life cycle of the process is entirely managed in the _Server class of up_aries:

aries/planning/unified/plugin/up_aries/solver.py

Line 472 in 98e9d5d

class _Server:

It is extremely simple: it creates the process on creation and kills it when garbage collected. I am on linux and I have been very pleasantly surprised by how reliable this has been. It don't remember seeing the process escape the watch recently (and on my machine that's is easy to spot as the fans would go crazy after a few minutes of such CPU load).

That is typically the kind of setting where small difference between operating system that may cause subtle bug. Having no access to macos machines I cannot really test it or directly help: /

Just to be clear is it something that happens on regular basis ? Or happened just once ?

nicola-gigante · 2024-04-26T20:04:28Z

Hi @arbimo, thanks for the quick answer!

It only happened once in the last few days, I'll double check in the future to see if it happens again.

I've looked at the code and indeed is quite straightforward.
So the process is killed by subprocess.kill() when the object is garbage collected, but what I'm doing is terminating the Python process externally (Ctrl+C on the terminal sends SIGTERM). Is Python guaranteed to run all the object destructors before terminating for SIGTERM? If not, that would explain what happened.

I have three possible solutions in mind:

The safest way would be to make the child process be killed when the parent dies. Doing this in a cross-platform way seems to be surprisingly complex (see e.g. https://stackoverflow.com/questions/284325);
with some options of subprocess you can create a "process group" when you launch the subprocess. This would make the child receive all the signals sent to the parent (just like shells do with commands they run), so my SIGTERM would terminate the child as well;
Lastly, since it seems you are using a socket to communicate between the processes, one may implement some form of watchdog and look periodically if the connection is still up and exit otherwise.

The number 2 seems to me the easiest and most effective (it's only an additional option to Popen, theoretically).

Let me know if you need anything.

nicola-gigante · 2024-04-27T09:47:08Z

I've checked the doc better, the option to give to the Popen creationflags argument is CREATE_NEW_PROCESS_GROUP, see here.

I can prepare a PR if you are interested.

arbimo · 2024-04-29T12:10:05Z

I think (but am unsure) that things are a bit more complex than that.

Essentially what we want to enforce is that the created subprocess may not outlive the parent one. This is something I did not find a proper way to enforce in a cross-platform way (which was really surprising to me).

What you propose is a bit different I think. Creating a process group would allow sending signals to the group. This does not mean that all signals are shared with the child. E.g. if a SIGKILL/SIGTERM is send to the parent process, it would not be sent to the planner (unless you explicitly send it to the group).

I think that Crtl-C does send SIGINT to the group but I don't think it is desirable to have the planner receive it. Indeed, the parent process may choose to handle the SIGINT signals and decide to ignore it. It cannot do that if if the solver also receives it and terminates prematurely.

Also I don't think SIGINT is our main problem, as it does appear to be handled properly (otherwise you would have a zombie process for any interrupted invocation of the planner).

Perhaps it would be good if the problem could be better characterized and scoped as it is still a bit unlcear now under which circumstances this may happen.

nicola-gigante · 2024-04-29T12:27:36Z

Yes, I agree, the process group was not a good idea.

Looking around there are some Linux-specific solutions (e.g. https://stackoverflow.com/a/36945270/3206471), but I cannot find anything about macOS, let alone Windows.

Maybe some more proactive approach is needed, where the two processes exchange some beacons periodically and if the parent dies and does not reply to the beacons anymore, the child quits itself.

arbimo · 2024-04-29T12:34:48Z

A better handling on the server side could go quite a long way, namely:

better monitoring the status of the connection and releasing CPU resources when the connection is closed; and
having a specialized version of the server that accepts a single request, and would terminate on completion/termination of this request

nicola-gigante · 2024-04-29T12:36:38Z

I apology for a basic question, but I have to understand better your architecture: is this client/server part specific to the UP integration or is it exposed by Aries generally?

arbimo · 2024-04-29T12:46:50Z

It is specific to the UP integration. The up-aries-XXXX executable essentially starts a gRPC server whose interface is defined in the UP library : https://github.com/aiplan4eu/unified-planning/blob/1bea6799cbf1217ca6d45540b9ce68a1e0eb2106/unified_planning/grpc/unified_planning.proto#L877

However, even though the interface is defined in the UP, all the client and server code (notably the way to launch the server and connect to it) is entirely done on the aries side.
The model proposed for up-aries (one process per request) is chosen for isolation and better handling of resources. But you can also a single long-running remote server to handle your requests, which only requires changing address of the server.

nicola-gigante · 2024-04-29T12:48:35Z

Cool. So this means you have complete control of the client code as well, which is on the UP side, right?

nicola-gigante · 2024-04-29T13:09:23Z

Btw, digging deeper in this problem scares me a bit about the complexity of some details of Unix behavior...

See here for someone claiming to have a portable solution to the problem: https://groups.google.com/g/comp.unix.programmer/c/CVATHnIVNv0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aries dangling long-running process when used through the UP #133

Aries dangling long-running process when used through the UP #133

nicola-gigante commented Apr 26, 2024 •

edited

Loading

arbimo commented Apr 26, 2024

nicola-gigante commented Apr 26, 2024 •

edited

Loading

nicola-gigante commented Apr 27, 2024 •

edited

Loading

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

Aries dangling long-running process when used through the UP #133

Aries dangling long-running process when used through the UP #133

Comments

nicola-gigante commented Apr 26, 2024 • edited Loading

arbimo commented Apr 26, 2024

nicola-gigante commented Apr 26, 2024 • edited Loading

nicola-gigante commented Apr 27, 2024 • edited Loading

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

arbimo commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

nicola-gigante commented Apr 29, 2024

nicola-gigante commented Apr 26, 2024 •

edited

Loading

nicola-gigante commented Apr 26, 2024 •

edited

Loading

nicola-gigante commented Apr 27, 2024 •

edited

Loading