Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container stops intermittently - Disruptive behaviour #512

Open
parul157 opened this issue Jul 25, 2024 · 11 comments
Open

Container stops intermittently - Disruptive behaviour #512

parul157 opened this issue Jul 25, 2024 · 11 comments
Labels

Comments

@parul157
Copy link

parul157 commented Jul 25, 2024

Hello,

We are using shinyproxy with EKS infrastructure and intermittently the pods stops and one has to restart it get it back up again. The some cases we have noticed when the app is in use by 10 or more people at once, then the issue is prominent and in some cases it won't even work on restart/refresh.

shinyproxy v3.0.1
EKS version 1.27
Below is the error we receive

2024-07-24 11:27:20.728 ERROR 1 --- [   XNIO-1 I/O-1] io.undertow.proxy    : UT005028: Proxy request to /proxy_endpoint/00d3d5d6-da46-4d61-8260-62948726874d/websocket/ failed

java.io.IOException: UT001000: Connection closed
	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:600) ~[undertow-core-2.2.21.Final.jar!/:2.2.21.Final]
	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:535) ~[undertow-core-2.2.21.Final.jar!/:2.2.21.Final]
	at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.conduits.ReadReadyHandler$ChannelListenerHandler.readReady(ReadReadyHandler.java:66) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.nio.NioSocketConduit.handleReady(NioSocketConduit.java:89) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.nio.WorkerThread.run(WorkerThread.java:591) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]

We also tried to upgrade to the latest version 3.1.1 but the issue remains. Sharing below the error that we get.

2024-07-25T14:49:36+05:30 java.io.IOException: UT001033: Invalid connection state
2024-07-25T14:49:36+05:30 	at io.undertow.client.http.HttpClientConnection.sendRequest(HttpClientConnection.java:352) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler$ProxyAction.run(ProxyHandler.java:598) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.util.SameThreadExecutor.execute(SameThreadExecutor.java:35) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.HttpServerExchange.dispatch(HttpServerExchange.java:844) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler$ProxyClientHandler.completed(ProxyHandler.java:348) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler$ProxyClientHandler.completed(ProxyHandler.java:322) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.SimpleProxyClientProvider.getConnection(SimpleProxyClientProvider.java:70) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at eu.openanalytics.containerproxy.util.ProxyMappingManager$1.getConnection(ProxyMappingManager.java:180) ~[containerproxy-1.1.1.jar!/:1.1.1]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler$ProxyClientHandler.failed(ProxyHandler.java:361) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler.handleFailure(ProxyHandler.java:703) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.server.handlers.proxy.ProxyHandler$ResponseCallback.failed(ProxyHandler.java:770) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.client.http.HttpClientExchange.setFailed(HttpClientExchange.java:158) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:600) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:535) ~[undertow-core-2.3.13.Final.jar!/:2.3.13.Final]
2024-07-25T14:49:36+05:30 	at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
2024-07-25T14:49:36+05:30 	at org.xnio.conduits.ReadReadyHandler$ChannelListenerHandler.readReady(ReadReadyHandler.java:66) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
2024-07-25T14:49:36+05:30 	at org.xnio.nio.NioSocketConduit.handleReady(NioSocketConduit.java:89) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]
2024-07-25T14:49:36+05:30 	at org.xnio.nio.WorkerThread.run(WorkerThread.java:591) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]

Application Template Configuration we use

  spring:
    session:
      store-type: redis
      redis:
        configure-action: none
    redis:
      host: {{ .Values.shinyproxy.redis.host }}
      database: {{ .Values.shinyproxy.redis.database }}
      ssl: {{ .Values.shinyproxy.redis.ssl }}
  proxy:
    store-mode: Redis
    stop-proxies-on-shutdown: false
    default-webSocket-reconnection-mode: {{ .Values.shinyproxy.webSocketReconnection }}
    title: {{ .Values.global.title }}
    logo-url: {{ .Values.shinyApp.logoUrl }}
    port: {{ .Values.shinyproxy.targetPort }}
    template-path: xxxxx
    authentication: simple
    hide-navbar: true
    landing-page: xxxxxx
    heartbeat-rate: {{ .Values.shinyApp.heartbeat.rate }}
    heartbeat-timeout: {{ .Values.shinyApp.heartbeat.timeout }}
    oauth2:
      resource-id: xxxxxx
      jwks-url: xxxxxx
      username-attribute: xxxxxx
    container-backend: kubernetes
    kubernetes:
      internal-networking: xxxx
      namespace: xxxx
      node-selector: xxxx
      pod-wait-time: xxxx
    specs:
    - id: {{ .Values.global.webappname }}
      display-name: {{ .Values.global.title }}
      container-image: {{ .Values.shinyApp.image.url | quote }}
      labels:
        app.kubernetes.io/name: xxxxxx
        app.kubernetes.io/part-of: xxxxxx
        app.kubernetes.io/managed-by: xxxxxx
      container-memory-request: xxxxxx
      container-memory-limit: xxxxxx
      container-cpu-limit: xxxxxx
      container-cpu-request: xxxxxx
  logging:
    file:
      shinyproxy.log
@parul157 parul157 changed the title container stops intermittently Container stops intermittently - Disruptive behaviour Jul 26, 2024
@LEDfan
Copy link
Member

LEDfan commented Jul 29, 2024

Hi, there can be multiple reasons why the pods are stopped. Since you mention it happens more frequently when multiple uses are active, I suspect it's caused by a lack of resources. I see you are already assigning memory and cpu requests/limits which is great. However, are you using the same value for container-memory-request and container-memory-limit? If not, this could cause the pod to get oom killed, even if the pod is using less than the specific container-memory-limit.

In addition, I advice to have a look at the events of the pod when the app are stopped. E.g. you could run the following command and then try to re-produce the problem (e.g. by starting multiple instances of your app: https://shinyproxy.io/documentation/ui/#using-multiple-instances-of-an-app):

kubectl get events -n shinyproxy -w

@saurabh0402
Copy link

Hi @LEDfan,

  • Yes, we use the same value for container-memory-request and container-memory-limit.
  • We have tried assigning a much bigger values as well, and we still get the same result. We are pretty sure it is not because of OOM.
  • We are already running multiple pods for the app and even for ShinyProxy.

Here are a few things we noticed since we raised the issue

  • The application pod gets killed and deleted instantly when we see the error in ShinyProxy pods. Even adding pre-stop hooks to prevent the pod from getting deleted does nothing and the pod does get deleted anyways.
    • We tried checking the pod's events, and there isn't anything substantial in there.
  • Another thing we noticed is that it's not just about multiple users. Even if a single user refreshes the app a few times, the issue occurs.

@LEDfan
Copy link
Member

LEDfan commented Aug 5, 2024

Thanks for the additional information. Could you check whether your shinyproxy logs contain a line similar to Proxy unreachable/crashed, stopping it now, failed request?
In this case the pod is killed by ShinyProxy when the request fails. Can you check the network tools in the browser console to see whether a request fails?
Finally, which ShinyProxy version are you using? A fix was made in 3.1.1. that solves a similar issue.

@saurabh0402
Copy link

We don't see the Proxy unreachable/crashed, stopping it now, failed request error but we have noticed the following error whenever the pod has been killed

2024-07-24 11:27:20.728 ERROR 1 --- [   XNIO-1 I/O-1] io.undertow.proxy    : UT005028: Proxy request to /proxy_endpoint/00d3d5d6-da46-4d61-8260-62948726874d/websocket/ failed

java.io.IOException: UT001000: Connection closed
	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:600) ~[undertow-core-2.2.21.Final.jar!/:2.2.21.Final]
	at io.undertow.client.http.HttpClientConnection$ClientReadListener.handleEvent(HttpClientConnection.java:535) ~[undertow-core-2.2.21.Final.jar!/:2.2.21.Final]
	at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.conduits.ReadReadyHandler$ChannelListenerHandler.readReady(ReadReadyHandler.java:66) ~[xnio-api-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.nio.NioSocketConduit.handleReady(NioSocketConduit.java:89) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]
	at org.xnio.nio.WorkerThread.run(WorkerThread.java:591) ~[xnio-nio-3.8.8.Final.jar!/:3.8.8.Final]

When this error comes, there are a few failed API requests in the network tab and the screen then displays - This app has been stopped, you can now close this tab.

We are running v3.0.2 of shinyproxy but we have tried upgrading to v3.1.1 and faced similar issues on that as well.

@LEDfan
Copy link
Member

LEDfan commented Sep 17, 2024

Hi @saurabh0402 did you perhaps discover the cause of your issue already? We did not yet have a similar issue, making it difficult to give additional suggestions.

It seems to me that ShinyProxy is killing the pod, but then I would expect the Proxy unreachable/crashed, stopping it now, failed request message to be logged. Therefore I suggest that you try to find out what process is killing the pod, it should be possible to find this in the audit logs of EKS. If it's being killed by ShinyProxy, I could create a build of ShinyProxy that contains more logging.

@saurabh0402
Copy link

Hi, sadly we weren't able to find the cause. We did try looking into the pod events as well but couldn't find anything conclusive.
It would be really helpful if you could create a build with more logging. 🙏🏽

@saurabh0402
Copy link

saurabh0402 commented Sep 18, 2024

@LEDfan here are the logs for the shinyproxy pod with log-level set to DEBUG. Just opening the app, caused the pod to be deleted. Looking at the logs, it seems like the Shiny app returned a 503 error for one of the requests, after which ShinyProxy killed the pod and started returning 410 to subsequent requests.
Can you please take a look at the logs once, and see if it gives any hint around the issue?

I also saw this somewhere

Shiny apps can only handle one R session per user, and if multiple users are trying to access the app simultaneously, it may reach its concurrency limits, causing 503 errors.

The 503 errors seem to be coming when multiple requests come at once. Can this be causing the issue?

@parul157
Copy link
Author

Hi @LEDfan , can someone please check the above logs if that helps in figuring out the issue with shinyproxy. It has been a major blocker for us for a long time now.

@LEDfan
Copy link
Member

LEDfan commented Oct 28, 2024

Hi @saurabh0402 @parul157 I would remove the log file here, it seems to contains some sensitive information.

From the log file I see that you are using, istio, I'm wondering whether this could be causing some problems with the connections. Nevertheless, I made some adjustments to the way ShinyProxy handles crashes, this should improve the behavior when requests fails because of network issues. Can you test using the image openanalytics/shinyproxy-snapshot:3.2.0-SNAPSHOT-20241008.093924?

@saurabh0402
Copy link

Hi @LEDfan, Sorry for the late reply. I have removed the log file though the tokens it had were very short-lived so we are fine there.
We will try out the image and update here.

@saurabh0402
Copy link

Hey @LEDfan, we tried it out today. Here's what we observed

  • We don't see This app has been stopped, you can now close this tab anymore and the app is loaded even if there are connection issues inside ShinyProxy.
  • When a request fails, an incomplete app might get loaded. For example, if CSS or JS files fail to load because of connection issues inside ShinyProxy, though the app is loaded, it can have missing UI elements. A refresh is needed to fix this.
    • This, for sure, is a much better behavior than earlier but isn't ideal.

Looking at the behavior, we had the following questions

  • We had checked the Istio log earlier and did not see any logs suggesting that Istio is dropping connections. Are there any ways to check why the connection issues are happening?
  • Is there a retry strategy that ShinyProxy uses to connect to the app and load requests from it? If not, can that be added?
  • How should we go ahead with using the given image? Will the changes be merged into a stable release or do we have to keep using this image only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants