Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Resiliency Module #1766

Open
OA72280 opened this issue Feb 20, 2025 · 4 comments
Open

[BUG]: Resiliency Module #1766

OA72280 opened this issue Feb 20, 2025 · 4 comments
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.

Comments

@OA72280
Copy link

OA72280 commented Feb 20, 2025

Bug Description

The podmon module is tainting all nodes as unscheduleable when the PowerFlex API gateway becomes detached (un-reachable).

When the API gateway comes back to availability, the Podmon module does not remove the taints from the workers.

Logs

N/A

Screenshots

N/A

Additional Environment Information

N/A

Steps to Reproduce

N/A

Expected Behavior

Once API gateway is available podmon should remove taints from workers.

CSM Driver(s)

1.5.0

Installation Type

Operator

Container Storage Modules Enabled

Resiliency

Container Orchestrator

N/A

Operating System

N/A

@OA72280 OA72280 added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Feb 20, 2025
@csmbot
Copy link
Collaborator

csmbot commented Feb 20, 2025

@OA72280: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at [email protected].

@atye atye self-assigned this Feb 20, 2025
@atye atye added area/csm-resiliency Issue pertains to the CSM Resiliency module and removed needs-triage Issue requires triage. labels Feb 20, 2025
@atye
Copy link
Contributor

atye commented Feb 20, 2025

Hi @OA72280. Could you provide the logs from the driver installation? At least, we need the logs from the driver and podmon containers from each pod. You may email them to me directly rather than attaching them here.

@atye
Copy link
Contributor

atye commented Feb 21, 2025

Per internal conversation, logs are not available due to logs being rolled.

@alikdell
Copy link
Contributor

alikdell commented Feb 21, 2025

@OA72280 Are there taints still with nodes? If so, current logs should clearly state why Resiliency is not able to remove taints. When Resiliency fails to remove taints, it writes clear logs why so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

4 participants