Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

vlerenc · 2020-09-24T14:39:17Z

Feature (What you would like to be added):
For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):

VPA has many, many shortcoming, e.g. it is unable to deal properly with spikes or recommends requests below current usage, which is one of the main reasons we cannot scale to larger clusters, see ☂️ Improve VPA Recommendations gardener/autoscaler#47
HVPA will, once one OOMKilled pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)
We use HVPA to mitigate glaring issues with VPA and also for horizontal and vertical pod auto-scaling on the same metric, but possibly we should switch to request-based horizontal autoscaling once we have improved VPA and can drop HVPA completely
Once a large cluster control plane fails, it cannot recover by itself anymore as the components restart in a vicious cycle and nodes need to be onboarded in a controlled way for which standard Kubernetes provides no solution yet (batched/staged node onboarding to not overload the starting control plane again and again)
Clustered ETCD is required to make the cluster more resilient and don't let it die in a downward-spiral should we update ETCD or something happens to the instance we run
Core DNS is not stable, we see unbalanced load patterns that we must address that by means of node local DNS or better vertical pod autoscaling (as horizontal pod autoscaling is pretty much pointless)
Calico Typha is recommended to be used together with the cluster-proportional auto-scaler, but that's more a community bandaid as it only scales based on the number of nodes, whatever the size/load, so it boils down again to a better VPA to get that problem under control
Our monitoring/logging stacks have a fixed size (also to control the costs), but while we do not want to "pay" for excessive logging, the sizing should be more reasonable and match the basic cluster needs for the control plane and the Kubelets

Motivation (Why is this needed?):

Stable control plane, even if spikes or load tests stress it
Support for large clusters of 500 nodes (or more)

The text was updated successfully, but these errors were encountered:

vasu1124 · 2020-11-11T10:10:05Z

Please plan for a public blog article that desribes this unique feature of Gardener on a high-level, such that it can be used to attract interest and establish thought leadership (internal as well as external)

vlerenc added this to the 2021-Q1 milestone Sep 24, 2020

amshuman-kr mentioned this issue Oct 5, 2020

etcd-druid Pod is not scaled up despite being OOMKilled gardener/gardener#2890

Closed

gardener-robot added roadmap/cloud-sap and removed roadmap/external labels Oct 16, 2020

vlerenc added roadmap/team-internal and removed roadmap/cloud-sap labels Oct 17, 2020

vlerenc changed the title ~~☂️ Control Plane Stabilization~~ ☂️ Control Plane Stabilization & Auto-Scaling Oct 17, 2020

vlerenc changed the title ~~☂️ Control Plane Stabilization & Auto-Scaling~~ ☂️ Control Plane Stabilization & Auto-Scaling to 500 Nodes Oct 17, 2020

vlerenc added the roadmap/cloud-sap label Oct 17, 2020

gardener-robot removed the roadmap/team-internal label Oct 17, 2020

vlerenc changed the title ~~☂️ Control Plane Stabilization & Auto-Scaling to 500 Nodes~~ Control Plane Stabilization & Auto-Scaling to 500 Nodes Oct 18, 2020

gardener-robot added roadmap/cloud and removed roadmap/cloud-sap labels May 21, 2021

amshuman-kr modified the milestones: 2021-Q1, 2021-Q3 Jun 9, 2021

shreyas-s-rao modified the milestones: 2021-Q3, 2022-Q2 Oct 6, 2021

gardener-attic deleted a comment from gardener-robot Jan 28, 2022

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 28, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 24, 2023

gardener-robot added kind/roadmap Roadmap BLI and removed roadmap/cloud labels Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

vlerenc commented Sep 24, 2020 •

edited

Loading

vasu1124 commented Nov 11, 2020

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Comments

vlerenc commented Sep 24, 2020 • edited Loading

vasu1124 commented Nov 11, 2020

vlerenc commented Sep 24, 2020 •

edited

Loading