Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cluster gitops #47

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions Design-Docs/Cluster-GitOps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Proposal for GitOps Style IaC for Cluster Management

Kurt Garloff, v0.1, 2022-02-18

## Motivation and Goals

Using the Kubernetes Cluster-API (capi), we can use a k8s style declarative
way to describe the workload clusters that should be running and can manage
their lifecycle: creation, changes, rolling upgrades and clean up can all be
performed with it. The OpenStack provider (capo) has the basic integration
to manage the networks, virtual machines, load-balancers. For full automation,
a few more pieces have been developed in the SCS
[k8s-cluster-api-provider](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/)
repository: Registering the images, setting up extra security groups for
non-Calico CNI, creating anti-affinity server groups, deploying the OCCM
and cinder CSI integration and optionally some more services (such as Flux,
cert-manager, nginx ingress, ...)

This follows similar ideas as described on
<https://www.weave.works/blog/gitops-and-cluster-api-master-of-masters>.

Most of the simpler cluster setups can be done without ever touching the cluster-template.yaml
file -- just doing a dozen adjustments in clusterctl.yaml provides a reasonable amount
of flexibility. In SCS' k8s-cluster-api-provider setup, the standard settings from
the capo templates have been extended by the settings that let you chose cilium
as alternative CNI provider, anti-affinity, the OCCM and CSI deployment and the extra services.

To implement a simple gitops style management for a set of clusters, we would
basically create a reconciliation loop on the capi management node, which
gets the enhanced configuration files from git and creates, changes and
destroys clusters according to the settings there. The mechanism should
distinguish between base settings and overlays (kustomization style) to
modify and extend the settings and should allow an opt-in mechanism to
do more detailed adjustments via overriding also the cluster-templates
(kustomization style again).

Cluster-admin credentials need to be provided to the owners of the cluster.
The current thinking is to have a public key included in the git repo;
the cluster-admin credentials for the created cluster would be encrypted
with this public key and can then be published. (Would still use https
to ensure that these are the true credentials.) Only the owner(s) of
the private key can decrypt the credentials.
Comment on lines +37 to +42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bitnami-labs/sealed-secrets may be worth looking at for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for SOPS. Flux also has builtin support for it: https://fluxcd.io/docs/guides/mozilla-sops/


## Implementation thoughts

The reconciliation loop would roughly look like this:
1. Get the latest clusters from git (via a regular check or an event)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done by something like fluxcd, shouldn't it?

1. Per cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what (and why) exactly is needed here beyond what already happens when one applies a cluster resource.

In my mind, the workflow would look like: "install fluxcd, register git repository, put cluster manifests in there, done", so this looks very complex. Am I missing anything?

1. Ensure we have the image available, register if needed
1. Other pre-flight sanity checks (quota, syntax, flavors, inconsistencies, ...)
1. For a new cluster
1. Optionally create a new project (for a new cluster), if so share the image to it
1. Create two application credentials (one for capo, one for OCCM/CSI)
1. Create cilium security group
1. Create anti-affinity server groups (if not disabled)
1. Adjust settings in the cluster-template (cluster-name, sec groups, affinity, ...)
1. Process with clusterctl
1. Submit to capo
1. Wait for control plane readiness
1. For new cluster: Extract cluster-admin creds and encrypt with pubkey
1. Deploy CNI (calico or cilium) -- avoid switching unless forced
1. Deploy OCCM
1. Deploy cinder CSI
1. Deploy metrics service (if not disabled), otherwise remove
1. Sanity checks
1. For all other optional services (nginx, flux, cert-manager, harbor, ...):
1. Deploy service if enabled, otherwise remove (if deployed before)
1. Sanity checks
1. optional CI tests

## Open questions

* Can this be integrated into capo or do we really need to create a loop around it?

* On errors in this loop, we would move to the next cluster. However, how do we handle error reporting?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on fluxcd: it happen independent on every cluster on the same time (if you have not set an branch, tag (semver parsing included), or gitrepo for each cluster) ...

fluxcd has metrics, we use prometheus to get them and fire alerts, if something does not work.

How do we avoid that Operators would need to log in to get capo logs?

* Can we integrate this with the helm charts work?

* Can we implement this using flux?

* How does this relate exactly to <https://github.com/weaveworks/weave-gitops>?