Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

Merged
merged 3 commits into from
Nov 26, 2024

Conversation

demarna1
Copy link
Contributor

@demarna1 demarna1 commented Nov 7, 2024

Goal

Fix high ETCD usage of Kubeflow ScheduledWorkflows. Closes #8757

Context

Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.

Status:
  Conditions:
    Last Heartbeat Time:   2024-11-07T11:16:33Z
    Last Transition Time:  2024-11-07T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These heartbeat updates result in an infinite reconciliation loop:

  • SWF is added to controller work queue.
  • Controller processes the SWF and updates the status' LastProbeTime and LastTransitionTime to current time.
  • Object is re-written to ETCD and the resourceVersion is updated.
  • Shared informer detects that the resourceVersion has changed.
  • Controller event handler re-adds the SWF to the work queue.
  • This reconciliation loop occurs every 10 seconds for every SWF resource on the cluster. The reason it's 10s and not 1s is because the controller has a default queue backoff of 10s, so events are always queued for a minimum of 10s.

Description of the fix

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 for now in order to fix the ETCD performance issues (which for us has resulted in ETCD outages). By keeping these fields constant, the object can be reconciled and the writes to ETCD stop. The schedules continue to function as before. Verbose logging is significantly reduced in several pods. A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

ETCD performance before & after

I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage (see chart below).

The chart roughly agrees with the back-of-the-napkin math:

  • The average size of our SWF objects is 270kb.
  • Controller re-writes the object every 10 seconds (6x/min).
  • Bytes written to ETCD per minute = 270kb x 6/min = 1.6MB/minute per SWF.
  • Our cluster had 54 SWFs at the time of the analysis.
  • ETCD write throughput is 54*1.6mb/min = 86mb/min = 430MB every 5 min.

etcd

Checklist:

Copy link

Hi @demarna1. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@droctothorpe
Copy link
Contributor

/ok-to-test

Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...)
A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

Any reason for not removing them right now?

@hbelmiro
Copy link
Contributor

hbelmiro commented Nov 8, 2024

Also @demarna1, can you please link the PR to the issue?

@demarna1
Copy link
Contributor Author

demarna1 commented Nov 8, 2024

@hbelmiro linked the PR to the issue.

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...)
A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

Any reason for not removing them right now?

My first priority is addressing the ETCD performance issue and I didn't want a CRD change to delay it. But I see no reason we can't remove them and I'd be happy to do that in a follow-on PR!

@droctothorpe
Copy link
Contributor

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

@hbelmiro
Copy link
Contributor

hbelmiro commented Nov 8, 2024

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

@droctothorpe I don't know :(
It seems like something has changed in the repo's permissions.
The following used to work for first-time contributors.

/rerun-all
/ok-to-test

@droctothorpe
Copy link
Contributor

Thanks, @hbelmiro! @HumairAK @zijianjoy do you happen to know if this change was intentional? It's out of sync with this documentation about membership privileges.

@droctothorpe
Copy link
Contributor

Bump.

@demarna1
Copy link
Contributor Author

@HumairAK can you re-run CI?

@hbelmiro
Copy link
Contributor

@demarna1 can you please check the failing tests?

@demarna1
Copy link
Contributor Author

@hbelmiro I checked but doesn't appear to be related to my change. It looks like a timeout of some sort. Can we try re-running?

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 26, 2024
@HumairAK
Copy link
Collaborator

ETCD write throughput is 54*1.6mb/min = 86mb/min = 430MB every 5 min.

Oh. My. God. 🤦🏾

Awesome work folks! I agree we should either drop these fields, or only update these fields when actual non-status related updates occur. Can we get a follow up issue?

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Nov 26, 2024
@HumairAK HumairAK added this to the KFP 2.4.0 milestone Nov 26, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 9ccec4c into kubeflow:master Nov 26, 2024
16 checks passed
@demarna1 demarna1 deleted the stop-heartbeat branch November 26, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed All CI tests on a pull request have passed lgtm ok-to-test size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage
4 participants