You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Trident PVCs could be mounted as normal on the worker node, but after some time or because of some unknown reason, Trident PVCs stop being able to be mounted on this exact node. Pods that are trying to mount a Trident PVC get the error message: "context deadline exceeded"
The exact same PVC can still be mounted on other worker nodes. This issue happens with all Trident PVCs, old and newly created after the issue started. Restarting the trident-node pod on said worker node does not fix the issue.
Trying to mount the NetApp shares manually on said node works completly fine.
Environment
Provide accurate information about the environment to help us reproduce the issue.
Trident version: 24.02
Trident installation flags used: Trident Operator with default values
Trident PVCs should be able to be mounted at all times.
Additional context
The cluster on which this problem occures is running all of our GitLab Runner build jobs. On this cluster dozens of build jobs are running simultaneously and multiple build jobs are starting at the same time that want to mount the same Trident PVCs.
Attached is the log of the trident-node pod on the node before we terminated and started a new one. trident-node-linux-t5f54.txt
The text was updated successfully, but these errors were encountered:
We also are starting to see this. We recently changed the SVM name in our TridentBackendConfig and things were running ok. We then upgraded to Openshift 4.16.18 and as it restarted pods, several are encountering the same context deadline exceeded message and won't mount.
Not sure if it is related to the 4.16 upgrade, the fact that we updated the Backend, or unrelated entirely
Trident v24.06.1 via Helm Chart
Hi @rusLukasRath, We need to investigate further to identify the root cause. Can you please open a NetApp support ticket, so they can help collect gather the required logs, info to investigate further ?
Hi @rusLukasRath, We need to investigate further to identify the root cause. Can you please open a NetApp support ticket, so they can help collect gather the required logs, info to investigate further ?
Hi @sjpeeris, we have had an case open for about 2 months now. We closed the ticket due to the support not really being helpful. We discovered that our issue seems to be related to memory pressure on our worker nodes. It seems that if the Trident CSI is losing its connection to the csi.sock whenever the worker runs full on memory. We have not seen this with other CSI drivers running in our cluster.
Since increasing the memory of our workers we have not been seen this issue anymore. Atleast not from what we have been alerted on.
Describe the bug
Trident PVCs could be mounted as normal on the worker node, but after some time or because of some unknown reason, Trident PVCs stop being able to be mounted on this exact node. Pods that are trying to mount a Trident PVC get the error message: "context deadline exceeded"
The exact same PVC can still be mounted on other worker nodes. This issue happens with all Trident PVCs, old and newly created after the issue started. Restarting the trident-node pod on said worker node does not fix the issue.
Trying to mount the NetApp shares manually on said node works completly fine.
Environment
Provide accurate information about the environment to help us reproduce the issue.
To Reproduce
Unknown
Expected behavior
Trident PVCs should be able to be mounted at all times.
Additional context
The cluster on which this problem occures is running all of our GitLab Runner build jobs. On this cluster dozens of build jobs are running simultaneously and multiple build jobs are starting at the same time that want to mount the same Trident PVCs.
Attached is the log of the trident-node pod on the node before we terminated and started a new one.
trident-node-linux-t5f54.txt
The text was updated successfully, but these errors were encountered: