[WIP] Updates to fluence to restore functionality and update builds #61

vsoch · 2024-02-17T08:20:25Z

This PR will likely be open for a few days while I carefully work through needed changes. I want to try and commit in a meaningful way (despite my previous commits being a bit messy I think) and have the diffs here to inspect. In particular, I plan to do the following:

Docker Builds

Update fluence container build to use flux-framework/fluxion-go (works!)
- Simplify corresponding Dockerfile to use flux-sched base
Update the build to include a new container, ghcr.io/flux-framework/fluence-controller
- Also update the automated build to deploy this container
Rename the helm install to something easy to remember, "fluence"

Miscellaneous

Remove unused manifests/fluence
Add reproducing examples for pod group clogging, and kind config
Improve developer documentation in README
Design documentation (diagram) in docs/README.md

Resources

Fluence should account for itself in resources (moving to future issue)
Control planes should not be scheduled to
Resource derivations should be parsing all containers (not just first)
Likely job name needs to be group, if pod group found (and not pod name)

Services

grpc to support service that exposes scheduler for kubectl fluence

PodGroup

Add the controller from upstream kubernetes-sigs to build here (anticipating working on PodGroup here)
Millisecond timestamps (this will need to be done again)
Refactor back to use PodGroup
- PodGroup creation should happen in reconciler we control
- PodGroup deletion, where (update, owner references didn't take, so we delete when pods finished/failed >= min members of group)

Controller

Build of fluence-controller here
Create mutating admission webhook for PodGroup
- Mutating admission webhook should watch for pods
- Mutating admission webhook should watch for jobs (and add to PodTemplate)
- Webhook should handle other pod abstractions

Update removed testing cases, because the deployment / statefulset etc. cleanup strategies need further thinking and discussion, unlikely cases so not priority.

This is a WIP, will change the title if/when ready for review, and add points as I work on / find them. The criteria (for me) for this to be ready will be resolving the clogging use case, and consistently working for the larger test case on GCP.

Also tweak some docstrings for fluence. Signed-off-by: vsoch <[email protected]>

Problem: the local upstream might get out of date Solution: provide an easy make update to pull from it. Signed-off-by: vsoch <[email protected]>

Signed-off-by: vsoch <[email protected]> There are too many edge cases / too much complexity and behavior that I do not understand to pursue having the pod group information cached with fluence. For now I am nuking it and testing the intial design as a sanity check.

Problem: the podgroups with second precision have interleaving Solution: try to create an internal representation with better precision. This looks promising with early testing, but I need to consider the edge cases and how to clean up the groups, otherwise a pod group might be re-created later and still in the cache. Signed-off-by: vsoch <[email protected]>

Signed-off-by: vsoch <[email protected]>

We install with the helm manifests, and the old fluence manifests might be confusing (they have changed). This commit will remove the old manifests, and also change some of the fmt.Print logging to use klog to be easier to parse. Signed-off-by: vsoch <[email protected]>

This adds a prototype support for an extra helm flag that dually enables adding an extra grpc set of endpoints, and then the configs (ingress and service) necessary to expose them. I next need to figure out how to interact with grpc from a local client, likely built from the same codebase and grpc spec. This is super cool!! Signed-off-by: vsoch <[email protected]>

Problem: we want to be able to persist PodGroup if upstream removes it Solution: build our own controller image, also allowing us to tweak it to enhance fluence. This commit also renames the helm install to be "fluence" so it is easier for the developer workflow Signed-off-by: vsoch <[email protected]>

Signed-off-by: vsoch <[email protected]>

Problem: we want to try a design where a mutating admission webhook can handle receiving and creating PodGroup from labels. We are choosing mutating with the expectation that, at some point, we might be able to change the size (min/max/desired) either for the PodGroup or some other watcher to jobs. Note that this is an empty skeleton - the webhook is added and running but basically doing nothing. I am also fixing a bug that I noticed while running kind that fluence was assigning work to the control plane. I think there maybe used to be logic (a commented out worker label) that was anticipating doing a check for a control plane, but it looks like on production clusters we do not always haave access and it was never finished. Note that this addition does not guarantee this design will work, but it is just one step. Since the helm charts are manually genereated for the scheduler-plugin (as far as I can tell) this took me almost 6 hours to figure out and get working. I am really starting to think there is no skill behind software engineering beyond absolute patience. Signed-off-by: vsoch <[email protected]>

Problem: we need every pod object coming into the cluster to be part of a group. Solution: This change adds logic to the mutating webhook to add the labels that indicate the group name and size. We can eventually add flexibility here. I also realize that we can easily watch for job objects first, and add the group size/name to the pod template. This will be much more efficient to then not have to add to the individual pods that are part of a larger job. With this approach I was able to create a fluence scheduled pod, and then see my labels added! It does not do anything beyond that. I am also adding a nice script that makes it easy to build, load, and install fluence freshly, otherwise you will use all your internet data for the month in like, two days. Do not do that :P Signed-off-by: vsoch <[email protected]>

vsoch · 2024-02-18T09:32:39Z

Ah this is really interesting - the PodGroup reconciler was already watching pod so there was never a need to manually create the PodGroup - given the label, it would have been created for us. The subtle difference between then and now is that we want everything scheduled by fluence to have a group, so given our added webhook that adds the label, we would only hit this case if a Pod came through that was scheduled by the default scheduler (because it wouldn't be using the controller alongside fluence, but would still be watching all Pods in the cluster). But maybe there is a use case to do that for combining across objects with pods, or asking for a different size than the number of pods in some set. Will think more about it tomorrow / later today. 💤

Ah I don't think that's right (was looking at my updated code) - in the original it would hit here and not be found, so it does need something to create it.

Problem: we want the labels (size and name) that are explicitly set to lead to the creation of the pod group so the user does not need to. This is done by way of a watcher on pod, which will trigger after the webhook that ensures that every pod (in a job or single pod) has the proper label. Likely we want to do this for other abstractions that hold pods as well, because it ensures that no matter how the pods go into pending, we have the correct size and name. The only case that a pod can come in without the label means that it was not scheduled by fluence. The user is directed to not do this, but it is not impossible (e.g., fluence sees itself show up here actually). So after this addition we have the full steps to add the labels and create the pod group, and next steps are (finally) to integrate this into fluence (and remove the old abstraction to store it in memory). Signed-off-by: vsoch <[email protected]>

vsoch · 2024-02-18T18:52:24Z

See last commit message for recent update: 000baac

The PR now supports the full pipeline of a webhook adding labels, a watcher on pod to get them and trigger a PodGroup reconcile, and creation of the PodGroup. In a diagram:

apply -> 
  podgroup webhook -> 
    add labels (size/name) -> 
      watch on pod in PodGroup reconciler --> 
        if Pending -> PodGroup

Here is a an example PodGroup created by a Job that has completions/parallelism set to 3 - we see that the size reflects that.

And then the PodGroup is created and enters the reconcile process akin to any other CRD. Note that I also decided to keep / use the orginal label for the name - I didn't see a strong reason to change it, and was worried about potential bugs that having two might cause in parts of the code we don't reproduce / aren't familiar with.

At this point we should have enough to (finally) integrate this back into fluence, of course the changes are now that the PodGroup creation (size detection) is automated, the timestamp is updated to be Microseconds, and we have better ownership / control over how the group and associated metadata is handled (good for future ideas I think).

Still no guarantee any of this will work, but i can say that we haven't failed yet.

Problem: fluence should only be storing state of jobid and presence of a group name in a map to indicate node assignment. Soluion: update the code here. Note that this is not working yet, and I am pushing / opening the PR to not use the work (and will update accordingly, and using this PR to test). Signed-off-by: vsoch <[email protected]>

[WIP] fluence: refactor to use new PodGroup

Problem: Since the PodGroup controller creates the PodGroup, it should delete it as well. Solution: Ideally I wanted to attach an owner reference, meaning that the top level job (that also owns the pod) would be owner to the PodGroup. But that does not seem to take - either because the controller is the owner or the field is read only for k8s. For the time being, I decided to delete the PodGroup when the group is determined to be Finished/Failed, which happens when that number of pods equals or exceeds the MinimumSize. I think granted that MinimumSize == size this should be OK with fluence, and we might need to consider other approaches if/when the min size is smaller than the total size (because fluence might still see a pod in the queue and try to schedule again. I think what we might do in that case is just update the MinSize for the group, so if fluence schedules again it will be for the smaller size. But not sure about that either! TBA. The important thing now is that the pod group cleans itself up! Signed-off-by: vsoch <[email protected]>

Problem: we need to be able to run deployments, stateful/replica sets and have them handled by fluence. Solution: allow the webhook to create pod groups for them. In the case they are not targeted for fluence (any abstraction) and get into the PreFilter, allow creation of a FauxPodGroup that will simply schedule one job for the pod. We do this twice - in PreFilter and in the events for update/delete. Signed-off-by: vsoch <[email protected]>

feat: add support for other abstractions

Problem: I noticed in testing that the time only had granularity down to the second. Solution: It appears that when we do a create of the PodGroup from the reconciler watch, the metadata (beyond name and namespace) does not stick. I am not sure why, but the labels are still retrievable from the pods (via the mutating webhook) after. So instead, we need to get the size and creation timestamp at the first hit in reconcile, which (given how that works) should still somewhat honor the order. I did try adding the timestamp to a label but it got hairy really quickly (kept me up about 3 hours longer than I intended to!) The good news now is that I see the microseconds in the Schedule Start Time, so we should be almost ready to test this on a GCP cluster. I also had lots of time waiting for the containers to rebuild so I made a diagram of how it is currently working. I have some concerns about the internal state of fluxion (my kind cluster stopped working after some hours and I do not know why) but we can address them later. We mostly need to see if there are jobs that are being forgotten, etc. Signed-off-by: vsoch <[email protected]>

bug: the metav1.MicroTime was not being set

Problem: the design description did not correspond with the numbers Solution: fix them up, also fix some bugs in the controller and fluence that assume we have pods / pod group (we do not always) Signed-off-by: vsoch <[email protected]>

docs: update to design description

I am making small changes as I test on GKE and EKS. My first tests on GKE had me creating / deleting jobs, and I think the state of fluence (fluxion) got out of sync with the jobs, meaning that fluxion thought jobs were running that were not and then was unable to allocate new ones. To adjust for that we can add back in the cancel response, but this will only work given that fluence has not lost memory of the job id. We likely need an approach that can either save the jobids to the state data (that could be reloaded) or a way to inspect jobs explicitly and purge, OR (better) a way to look up a job not based on the id, but based on the group id (the command in the jobspec). That way, regardless of a jobid, we could lose all of our state and still find the old (stale) job to delete. With a fresh state and larger cluster I am able to run jobs on GKE, but they are enormously slow - lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence - GKE networking sucks. To keep debugging I likely need to move over to AWS with EFA, of course that introduces more things to figure out like EFA, etc. Signed-off-by: vsoch <[email protected]>

testing: gke then eks

vsoch added 10 commits January 7, 2024 21:37

docs: add section to README for developer

4d826e4

Also tweak some docstrings for fluence. Signed-off-by: vsoch <[email protected]>

build: add commands to make for clone and update

1026549

Problem: the local upstream might get out of date Solution: provide an easy make update to pull from it. Signed-off-by: vsoch <[email protected]>

logs: more for various steps to see what is going on

d6949a0

Signed-off-by: vsoch <[email protected]>

add examples with lammps to reproduce error

f8ca47e

Signed-off-by: vsoch <[email protected]>

docker: simplify fluence build to use fluxion-go

41b2ad2

Signed-off-by: vsoch <[email protected]>

vsoch mentioned this pull request Feb 17, 2024

Attempt add webhook 2 #62

Merged

vsoch force-pushed the fluence-controller branch 3 times, most recently from 64d9f29 to 458f73c Compare February 17, 2024 23:32

ci: add support to build and deploy fluence-controller

1c0e5a3

Signed-off-by: vsoch <[email protected]>

vsoch force-pushed the fluence-controller branch from 458f73c to 1c0e5a3 Compare February 17, 2024 23:39

vsoch added 2 commits February 17, 2024 17:33

vsoch force-pushed the fluence-controller branch from 2029a1d to 10d624d Compare February 18, 2024 03:55

vsoch and others added 8 commits February 18, 2024 18:16

Merge pull request #63 from flux-framework/refactor-fluence-podgroup

962d035

[WIP] fluence: refactor to use new PodGroup

Merge pull request #64 from flux-framework/add-other-abstractions

5745fc2

feat: add support for other abstractions

Merge pull request #65 from flux-framework/tweak-timestamps-add-docs

8205335

bug: the metav1.MicroTime was not being set

docs: update to design description

d267510

Problem: the design description did not correspond with the numbers Solution: fix them up, also fix some bugs in the controller and fluence that assume we have pods / pod group (we do not always) Signed-off-by: vsoch <[email protected]>

vsoch and others added 3 commits February 19, 2024 08:51

Merge pull request #66 from flux-framework/update-design-notes

4cc78f8

docs: update to design description

Merge pull request #67 from flux-framework/tweaks-testing

71156c5

testing: gke then eks

vsoch mentioned this pull request Feb 19, 2024

Design Problems for Fluence #68

Open

vsoch closed this Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Updates to fluence to restore functionality and update builds #61

[WIP] Updates to fluence to restore functionality and update builds #61

vsoch commented Feb 17, 2024 •

edited

Loading

vsoch commented Feb 18, 2024 •

edited

Loading

vsoch commented Feb 18, 2024 •

edited

Loading

[WIP] Updates to fluence to restore functionality and update builds #61

[WIP] Updates to fluence to restore functionality and update builds #61

Conversation

vsoch commented Feb 17, 2024 • edited Loading

Docker Builds

Miscellaneous

Resources

Services

PodGroup

Controller

vsoch commented Feb 18, 2024 • edited Loading

vsoch commented Feb 18, 2024 • edited Loading

vsoch commented Feb 17, 2024 •

edited

Loading

vsoch commented Feb 18, 2024 •

edited

Loading

vsoch commented Feb 18, 2024 •

edited

Loading