Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVCs fail to mount on a node but it previously worked - context deadline exceeded #922

Open
rusLukasRath opened this issue Aug 30, 2024 · 3 comments

Comments

@rusLukasRath
Copy link

Describe the bug
Trident PVCs could be mounted as normal on the worker node, but after some time or because of some unknown reason, Trident PVCs stop being able to be mounted on this exact node. Pods that are trying to mount a Trident PVC get the error message: "context deadline exceeded"

%pn_2024-08-30_11-34-41

The exact same PVC can still be mounted on other worker nodes. This issue happens with all Trident PVCs, old and newly created after the issue started. Restarting the trident-node pod on said worker node does not fix the issue.

Trying to mount the NetApp shares manually on said node works completly fine.

WindowsTerminal_2024-08-30_10-35-49
WindowsTerminal_2024-08-30_10-40-36

Environment

Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: 24.02
  • Trident installation flags used: Trident Operator with default values
  • Container runtime: CRI-O 1.28.8-2.rhaos4.15.gitfcaab07.el9
  • Kubernetes version: v1.28.11
  • Kubernetes orchestrator: OpenShift v4.15.23
  • Kubernetes enabled feature gates:
  • OS: Red Hat Core OS with RHEL 9.2
  • NetApp backend types: ontap-nas & ontap-nas-economy
  • Other:

To Reproduce

Unknown

Expected behavior

Trident PVCs should be able to be mounted at all times.

Additional context

The cluster on which this problem occures is running all of our GitLab Runner build jobs. On this cluster dozens of build jobs are running simultaneously and multiple build jobs are starting at the same time that want to mount the same Trident PVCs.

Attached is the log of the trident-node pod on the node before we terminated and started a new one.
trident-node-linux-t5f54.txt

@MallocArray
Copy link

We also are starting to see this. We recently changed the SVM name in our TridentBackendConfig and things were running ok. We then upgraded to Openshift 4.16.18 and as it restarted pods, several are encountering the same context deadline exceeded message and won't mount.

Not sure if it is related to the 4.16 upgrade, the fact that we updated the Backend, or unrelated entirely
Trident v24.06.1 via Helm Chart

@sjpeeris
Copy link
Collaborator

Hi @rusLukasRath, We need to investigate further to identify the root cause. Can you please open a NetApp support ticket, so they can help collect gather the required logs, info to investigate further ?

@rusLukasRath
Copy link
Author

Hi @rusLukasRath, We need to investigate further to identify the root cause. Can you please open a NetApp support ticket, so they can help collect gather the required logs, info to investigate further ?

Hi @sjpeeris, we have had an case open for about 2 months now. We closed the ticket due to the support not really being helpful. We discovered that our issue seems to be related to memory pressure on our worker nodes. It seems that if the Trident CSI is losing its connection to the csi.sock whenever the worker runs full on memory. We have not seen this with other CSI drivers running in our cluster.

Since increasing the memory of our workers we have not been seen this issue anymore. Atleast not from what we have been alerted on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants