-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363
Conversation
…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <[email protected]>
Hi @demarna1. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...)
A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).
Any reason for not removing them right now?
Also @demarna1, can you please link the PR to the issue? |
@hbelmiro linked the PR to the issue.
My first priority is addressing the ETCD performance issue and I didn't want a CRD change to delay it. But I see no reason we can't remove them and I'd be happy to do that in a follow-on PR! |
@hbelmiro do you happen to know why the |
@droctothorpe I don't know :( /rerun-all |
Thanks, @hbelmiro! @HumairAK @zijianjoy do you happen to know if this change was intentional? It's out of sync with this documentation about membership privileges. |
Bump. |
Signed-off-by: demarna1 <[email protected]>
@HumairAK can you re-run CI? |
@demarna1 can you please check the failing tests? |
@hbelmiro I checked but doesn't appear to be related to my change. It looks like a timeout of some sort. Can we try re-running? |
Oh. My. God. 🤦🏾 Awesome work folks! I agree we should either drop these fields, or only update these fields when actual non-status related updates occur. Can we get a follow up issue? /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HumairAK The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Goal
Fix high ETCD usage of Kubeflow ScheduledWorkflows. Closes #8757
Context
Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.
These heartbeat updates result in an infinite reconciliation loop:
Description of the fix
The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 for now in order to fix the ETCD performance issues (which for us has resulted in ETCD outages). By keeping these fields constant, the object can be reconciled and the writes to ETCD stop. The schedules continue to function as before. Verbose logging is significantly reduced in several pods. A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).
ETCD performance before & after
I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage (see chart below).
The chart roughly agrees with the back-of-the-napkin math:
Checklist: