Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up processes on start, but wait on shutdown #7185

Merged
merged 12 commits into from
Jan 24, 2025

Conversation

davidfowl
Copy link
Member

Trying to narrow down what might be causing #7098 (comment). Just saw a flaky test #7184 on a PR that had 2 containers hanging around and 6 networks:

+ docker container ls --all
CONTAINER ID   IMAGE                                        COMMAND                  CREATED         STATUS         PORTS                       NAMES
f7075871afca   mcr.microsoft.com/mssql/server:2022-latest   "/opt/mssql/bin/perm…"   2 minutes ago   Up 2 minutes   127.0.0.1:32773->1433/tcp   resource-qfczqtcf-44975b
1135e744ae8c   mcr.microsoft.com/mssql/server:2022-latest   "/opt/mssql/bin/perm…"   3 minutes ago   Up 3 minutes   127.0.0.1:32772->1433/tcp   sqlserver-jfhfxdez-73fe4195
+ docker volume ls
DRIVER    VOLUME NAME
+ docker network ls
NETWORK ID     NAME                                DRIVER    SCOPE
fb2665b475fc   bridge                              bridge    local
581e22aa3346   default-aspire-network-1kehlug2b8   bridge    local
be06bf54b5e1   default-aspire-network-8t4322bof0   bridge    local
40fe071b8ca3   default-aspire-network-elqgk6njg4   bridge    local
651ed8f1543e   default-aspire-network-fb93578n1o   bridge    local
c68e2fb49e65   default-aspire-network-kitbd1l6c0   bridge    local
f4b8495d6db3   default-aspire-network-pgrkmjt3g0   bridge    local
d4e6971983fe   host                                host      local
c5c27b532323   none                                null      local
+ pgrep -lf dotnet-tests|dcp.exe|dcpctrl.exe
+ awk {print ; system("kill -9 "$1)}
+ exit 1
['Aspire.Hosting.SqlServer.Tests' END OF WORK ITEM LOG: Command exited with 1]

It's unclear if dcp would have cleaned up some of these because the test infrastructure kills it ungracefully. Instead, clean up on the start of the test and after the test runs, we will wait for 60 seconds for dcp to quit, if it didn't then fail.

@davidfowl
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Member Author

@karolz-ms is there any way to associate this start-apiserver --monitor 12233 --detach --kubeconfig "/datadisks/disk1/work/ADFF09A7/t/aspire.YAFf8k/kubeconfig"

With the logs that come out of dcp? I have all of the dcp logs being extracted now but its hard to correlate.

@davidfowl
Copy link
Member Author

Also I'm seeing this repeated in the dcp logs:

{"level":"debug","ts":"2025-01-23T05:37:11.087Z","logger":"dcpctrl.ContainerOrchestrator","msg":"Running Docker command","ContainerRuntime":"","Command":"/usr/bin/docker network rm --force 76dddaa083d5b207e8187fc1f9af0974b27a842c1fa9b78e0c999a4c833c671b"}
{"level":"debug","ts":"2025-01-23T05:37:11.090Z","logger":"dcpctrl.os-executor","msg":"starting waiting for process to exit","pid":28066}
{"level":"debug","ts":"2025-01-23T05:37:11.139Z","logger":"dcpctrl.os-executor","msg":"process wait ended","pid":28066,"Error":"exit status 125"}

@davidfowl
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@karolz-ms
Copy link
Member

@karolz-ms is there any way to associate this start-apiserver --monitor 12233 --detach --kubeconfig "/datadisks/disk1/work/ADFF09A7/t/aspire.YAFf8k/kubeconfig"

With the logs that come out of dcp? I have all of the dcp logs being extracted now but its hard to correlate.

That path to the kubeconfig file should show up in the logs from dcpctrl process.. although I think Aspire tests share the same session folder for everything, so that might not help.
But what probably will help is that the argument to --monitor flag is the PID of the dcp process that started the dcpctrl process, and that PID should also appear in the dcpctrl logs.

@davidfowl davidfowl merged commit ff4368e into main Jan 24, 2025
9 checks passed
@davidfowl davidfowl deleted the davidfowl/cleanup-on-start-again branch January 24, 2025 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants