-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Updates to fluence to restore functionality and update builds #61
Conversation
Also tweak some docstrings for fluence. Signed-off-by: vsoch <[email protected]>
Problem: the local upstream might get out of date Solution: provide an easy make update to pull from it. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]> There are too many edge cases / too much complexity and behavior that I do not understand to pursue having the pod group information cached with fluence. For now I am nuking it and testing the intial design as a sanity check.
Problem: the podgroups with second precision have interleaving Solution: try to create an internal representation with better precision. This looks promising with early testing, but I need to consider the edge cases and how to clean up the groups, otherwise a pod group might be re-created later and still in the cache. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
We install with the helm manifests, and the old fluence manifests might be confusing (they have changed). This commit will remove the old manifests, and also change some of the fmt.Print logging to use klog to be easier to parse. Signed-off-by: vsoch <[email protected]>
This adds a prototype support for an extra helm flag that dually enables adding an extra grpc set of endpoints, and then the configs (ingress and service) necessary to expose them. I next need to figure out how to interact with grpc from a local client, likely built from the same codebase and grpc spec. This is super cool!! Signed-off-by: vsoch <[email protected]>
Problem: we want to be able to persist PodGroup if upstream removes it Solution: build our own controller image, also allowing us to tweak it to enhance fluence. This commit also renames the helm install to be "fluence" so it is easier for the developer workflow Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
64d9f29
to
458f73c
Compare
Signed-off-by: vsoch <[email protected]>
458f73c
to
1c0e5a3
Compare
Problem: we want to try a design where a mutating admission webhook can handle receiving and creating PodGroup from labels. We are choosing mutating with the expectation that, at some point, we might be able to change the size (min/max/desired) either for the PodGroup or some other watcher to jobs. Note that this is an empty skeleton - the webhook is added and running but basically doing nothing. I am also fixing a bug that I noticed while running kind that fluence was assigning work to the control plane. I think there maybe used to be logic (a commented out worker label) that was anticipating doing a check for a control plane, but it looks like on production clusters we do not always haave access and it was never finished. Note that this addition does not guarantee this design will work, but it is just one step. Since the helm charts are manually genereated for the scheduler-plugin (as far as I can tell) this took me almost 6 hours to figure out and get working. I am really starting to think there is no skill behind software engineering beyond absolute patience. Signed-off-by: vsoch <[email protected]>
Problem: we need every pod object coming into the cluster to be part of a group. Solution: This change adds logic to the mutating webhook to add the labels that indicate the group name and size. We can eventually add flexibility here. I also realize that we can easily watch for job objects first, and add the group size/name to the pod template. This will be much more efficient to then not have to add to the individual pods that are part of a larger job. With this approach I was able to create a fluence scheduled pod, and then see my labels added! It does not do anything beyond that. I am also adding a nice script that makes it easy to build, load, and install fluence freshly, otherwise you will use all your internet data for the month in like, two days. Do not do that :P Signed-off-by: vsoch <[email protected]>
2029a1d
to
10d624d
Compare
Ah this is really interesting - the PodGroup reconciler was already watching pod so there was never a need to manually create the PodGroup - given the label, it would have been created for us. The subtle difference between then and now is that we want everything scheduled by fluence to have a group, so given our added webhook that adds the label, we would only hit this case if a Pod came through that was scheduled by the default scheduler (because it wouldn't be using the controller alongside fluence, but would still be watching all Pods in the cluster). But maybe there is a use case to do that for combining across objects with pods, or asking for a different size than the number of pods in some set. Will think more about it tomorrow / later today. 💤 Ah I don't think that's right (was looking at my updated code) - in the original it would hit here and not be found, so it does need something to create it. |
Problem: we want the labels (size and name) that are explicitly set to lead to the creation of the pod group so the user does not need to. This is done by way of a watcher on pod, which will trigger after the webhook that ensures that every pod (in a job or single pod) has the proper label. Likely we want to do this for other abstractions that hold pods as well, because it ensures that no matter how the pods go into pending, we have the correct size and name. The only case that a pod can come in without the label means that it was not scheduled by fluence. The user is directed to not do this, but it is not impossible (e.g., fluence sees itself show up here actually). So after this addition we have the full steps to add the labels and create the pod group, and next steps are (finally) to integrate this into fluence (and remove the old abstraction to store it in memory). Signed-off-by: vsoch <[email protected]>
See last commit message for recent update: 000baac The PR now supports the full pipeline of a webhook adding labels, a watcher on pod to get them and trigger a PodGroup reconcile, and creation of the PodGroup. In a diagram: apply ->
podgroup webhook ->
add labels (size/name) ->
watch on pod in PodGroup reconciler -->
if Pending -> PodGroup Here is a an example PodGroup created by a Job that has completions/parallelism set to 3 - we see that the size reflects that. And then the PodGroup is created and enters the reconcile process akin to any other CRD. Note that I also decided to keep / use the orginal label for the name - I didn't see a strong reason to change it, and was worried about potential bugs that having two might cause in parts of the code we don't reproduce / aren't familiar with. At this point we should have enough to (finally) integrate this back into fluence, of course the changes are now that the PodGroup creation (size detection) is automated, the timestamp is updated to be Microseconds, and we have better ownership / control over how the group and associated metadata is handled (good for future ideas I think). Still no guarantee any of this will work, but i can say that we haven't failed yet. |
Problem: fluence should only be storing state of jobid and presence of a group name in a map to indicate node assignment. Soluion: update the code here. Note that this is not working yet, and I am pushing / opening the PR to not use the work (and will update accordingly, and using this PR to test). Signed-off-by: vsoch <[email protected]>
[WIP] fluence: refactor to use new PodGroup
Problem: Since the PodGroup controller creates the PodGroup, it should delete it as well. Solution: Ideally I wanted to attach an owner reference, meaning that the top level job (that also owns the pod) would be owner to the PodGroup. But that does not seem to take - either because the controller is the owner or the field is read only for k8s. For the time being, I decided to delete the PodGroup when the group is determined to be Finished/Failed, which happens when that number of pods equals or exceeds the MinimumSize. I think granted that MinimumSize == size this should be OK with fluence, and we might need to consider other approaches if/when the min size is smaller than the total size (because fluence might still see a pod in the queue and try to schedule again. I think what we might do in that case is just update the MinSize for the group, so if fluence schedules again it will be for the smaller size. But not sure about that either! TBA. The important thing now is that the pod group cleans itself up! Signed-off-by: vsoch <[email protected]>
Problem: we need to be able to run deployments, stateful/replica sets and have them handled by fluence. Solution: allow the webhook to create pod groups for them. In the case they are not targeted for fluence (any abstraction) and get into the PreFilter, allow creation of a FauxPodGroup that will simply schedule one job for the pod. We do this twice - in PreFilter and in the events for update/delete. Signed-off-by: vsoch <[email protected]>
feat: add support for other abstractions
Problem: I noticed in testing that the time only had granularity down to the second. Solution: It appears that when we do a create of the PodGroup from the reconciler watch, the metadata (beyond name and namespace) does not stick. I am not sure why, but the labels are still retrievable from the pods (via the mutating webhook) after. So instead, we need to get the size and creation timestamp at the first hit in reconcile, which (given how that works) should still somewhat honor the order. I did try adding the timestamp to a label but it got hairy really quickly (kept me up about 3 hours longer than I intended to!) The good news now is that I see the microseconds in the Schedule Start Time, so we should be almost ready to test this on a GCP cluster. I also had lots of time waiting for the containers to rebuild so I made a diagram of how it is currently working. I have some concerns about the internal state of fluxion (my kind cluster stopped working after some hours and I do not know why) but we can address them later. We mostly need to see if there are jobs that are being forgotten, etc. Signed-off-by: vsoch <[email protected]>
bug: the metav1.MicroTime was not being set
Problem: the design description did not correspond with the numbers Solution: fix them up, also fix some bugs in the controller and fluence that assume we have pods / pod group (we do not always) Signed-off-by: vsoch <[email protected]>
docs: update to design description
I am making small changes as I test on GKE and EKS. My first tests on GKE had me creating / deleting jobs, and I think the state of fluence (fluxion) got out of sync with the jobs, meaning that fluxion thought jobs were running that were not and then was unable to allocate new ones. To adjust for that we can add back in the cancel response, but this will only work given that fluence has not lost memory of the job id. We likely need an approach that can either save the jobids to the state data (that could be reloaded) or a way to inspect jobs explicitly and purge, OR (better) a way to look up a job not based on the id, but based on the group id (the command in the jobspec). That way, regardless of a jobid, we could lose all of our state and still find the old (stale) job to delete. With a fresh state and larger cluster I am able to run jobs on GKE, but they are enormously slow - lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence - GKE networking sucks. To keep debugging I likely need to move over to AWS with EFA, of course that introduces more things to figure out like EFA, etc. Signed-off-by: vsoch <[email protected]>
testing: gke then eks
This PR will likely be open for a few days while I carefully work through needed changes. I want to try and commit in a meaningful way (despite my previous commits being a bit messy I think) and have the diffs here to inspect. In particular, I plan to do the following:
Docker Builds
ghcr.io/flux-framework/fluence-controller
Miscellaneous
Resources
Services
PodGroup
Controller
Update removed testing cases, because the deployment / statefulset etc. cleanup strategies need further thinking and discussion, unlikely cases so not priority.
This is a WIP, will change the title if/when ready for review, and add points as I work on / find them. The criteria (for me) for this to be ready will be resolving the clogging use case, and consistently working for the larger test case on GCP.