Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Open
8 tasks
vlerenc opened this issue Sep 24, 2020 · 1 comment
Open
8 tasks

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

vlerenc opened this issue Sep 24, 2020 · 1 comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/robustness Robustness, reliability, resilience related component/hvpa HVPA kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage) topology/shoot Affects Shoot clusters
Milestone

Comments

@vlerenc
Copy link

vlerenc commented Sep 24, 2020

Feature (What you would like to be added):
For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):

  • VPA has many, many shortcoming, e.g. it is unable to deal properly with spikes or recommends requests below current usage, which is one of the main reasons we cannot scale to larger clusters, see ☂️ Improve VPA Recommendations gardener/autoscaler#47
  • HVPA will, once one OOMKilled pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)
  • We use HVPA to mitigate glaring issues with VPA and also for horizontal and vertical pod auto-scaling on the same metric, but possibly we should switch to request-based horizontal autoscaling once we have improved VPA and can drop HVPA completely
  • Once a large cluster control plane fails, it cannot recover by itself anymore as the components restart in a vicious cycle and nodes need to be onboarded in a controlled way for which standard Kubernetes provides no solution yet (batched/staged node onboarding to not overload the starting control plane again and again)
  • Clustered ETCD is required to make the cluster more resilient and don't let it die in a downward-spiral should we update ETCD or something happens to the instance we run
  • Core DNS is not stable, we see unbalanced load patterns that we must address that by means of node local DNS or better vertical pod autoscaling (as horizontal pod autoscaling is pretty much pointless)
  • Calico Typha is recommended to be used together with the cluster-proportional auto-scaler, but that's more a community bandaid as it only scales based on the number of nodes, whatever the size/load, so it boils down again to a better VPA to get that problem under control
  • Our monitoring/logging stacks have a fixed size (also to control the costs), but while we do not want to "pay" for excessive logging, the sizing should be more reasonable and match the basic cluster needs for the control plane and the Kubelets

Motivation (Why is this needed?):

  • Stable control plane, even if spikes or load tests stress it
  • Support for large clusters of 500 nodes (or more)
@vlerenc vlerenc added area/robustness Robustness, reliability, resilience related kind/epic Large multi-story topic topology/shoot Affects Shoot clusters component/hvpa HVPA area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related roadmap/external labels Sep 24, 2020
@vlerenc vlerenc added this to the 2021-Q1 milestone Sep 24, 2020
@vlerenc vlerenc changed the title ☂️ Control Plane Stabilization ☂️ Control Plane Stabilization & Auto-Scaling Oct 17, 2020
@vlerenc vlerenc changed the title ☂️ Control Plane Stabilization & Auto-Scaling ☂️ Control Plane Stabilization & Auto-Scaling to 500 Nodes Oct 17, 2020
@vlerenc vlerenc changed the title ☂️ Control Plane Stabilization & Auto-Scaling to 500 Nodes Control Plane Stabilization & Auto-Scaling to 500 Nodes Oct 18, 2020
@vasu1124
Copy link

Please plan for a public blog article that desribes this unique feature of Gardener on a high-level, such that it can be used to attract interest and establish thought leadership (internal as well as external)

@amshuman-kr amshuman-kr modified the milestones: 2021-Q1, 2021-Q3 Jun 9, 2021
@shreyas-s-rao shreyas-s-rao modified the milestones: 2021-Q3, 2022-Q2 Oct 6, 2021
@gardener-attic gardener-attic deleted a comment from gardener-robot Jan 28, 2022
@gardener-attic gardener-attic deleted a comment from gardener-robot Jan 28, 2022
@gardener-attic gardener-attic deleted a comment from gardener-robot Jan 28, 2022
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 28, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 24, 2023
@gardener-robot gardener-robot added kind/roadmap Roadmap BLI and removed roadmap/cloud labels Mar 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/robustness Robustness, reliability, resilience related component/hvpa HVPA kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage) topology/shoot Affects Shoot clusters
Projects
None yet
Development

No branches or pull requests

5 participants