-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Adding validation webhook for v2 trainjob #2307
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
892a40b
to
f1a06c4
Compare
ce983eb
to
736a759
Compare
Pull Request Test Coverage Report for Build 11784298214Details
💛 - Coveralls |
f85da83
to
ba32e68
Compare
ba32e68
to
20136ef
Compare
20136ef
to
0aa9ee0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking this, and moving this forward.
And Sorry for the delay.
Namespace: new.Namespace, | ||
Name: new.Spec.RuntimeRef.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever seen the isseus when we use the old object names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get new
object here and not old
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am validating updated object instead of the existing one
@@ -140,3 +143,115 @@ func (j *JobSet) ReconcilerBuilders() []runtime.ReconcilerBuilder { | |||
}, | |||
} | |||
} | |||
|
|||
func (j *JobSet) Validate(oldObj, newObj *kubeflowv2.TrainJob, runtimeInfo *runtime.Info) (admission.Warnings, field.ErrorList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there are some conflicts between @andreyvelich PR and this.
@akshaychitneni Could you consult with @andreyvelich, then which PRs should we merge into the main, first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased with @andreyvelich's changes
@@ -31,7 +31,7 @@ func Setup(mgr ctrl.Manager, runtimes map[string]runtime.Runtime) (string, error | |||
return kubeflowv2.TrainingRuntimeKind, err | |||
} | |||
if err := setupWebhookForTrainJob(mgr, runtimes); err != nil { | |||
return "TrainJob", err | |||
return kubeflowv2.TrainJobKind, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! This is what I imagined architechture in my KubeflowJobPipeline framework design phase.
failedCtrlName, err := controllerv2.SetupControllers(mgr, runtimes) | ||
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName) | ||
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty()) | ||
if startControllers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever seen any issues like null pointer when we start the controllers for webhook testing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have seen but we might not need to start the controllers just to validate create/update requests and leave to reconciler tests to cover reconciliation
0aa9ee0
to
1b675c5
Compare
0e12bb4
to
a3ea261
Compare
f4d1430
to
a93ffb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @akshaychitneni!
I left initial comments.
// JobExporter is the Job name for the exporter. | ||
JobExporter string = "exporter" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can we implement the validation for exporter in the future once we design it as part of: #2245 ?
We should discuss whether we want to use sidecar container or another ReplicatedJob for model checkpointing.
cc @saileshd1402 @akshaychitneni @tenzen-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshaychitneni Please can you remove the values from your PR that we will not use for now (e.g. JobExporter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
return r.framework.RunComponentBuilderPlugins(ctx, jobSetTemplate.DeepCopy(), info, trainJob) | ||
} | ||
|
||
func (r *TrainingRuntime) runtimeInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of Runtime interface: https://github.com/kubeflow/training-operator/blob/a93ffb7125c3899519058ba43fa1d5419b498e85/pkg/runtime.v2/interface.go#L32
And should we name this API more explicit (e.g. getRuntimeInfo()
or initializeRuntimeInfo()
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be part of trainingRuntime as it depends on config from trainJob/trainingRuntume resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the Info object will be used for every runtime that we register with our manager.
What is the main motivation to create this helper function to construct the Info object for the TrainingRuntime ?
Namespace: new.Namespace, | ||
Name: new.Spec.RuntimeRef.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get new
object here and not old
?
numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode") | ||
if runtimeInfo.MLPolicy.MPI != nil { | ||
if _, err := strconv.Atoi(*newJobObj.Spec.Trainer.NumProcPerNode); err != nil { | ||
allErrs = append(allErrs, field.Invalid(numProcPerNodePath, newJobObj.Spec.Trainer.NumProcPerNode, "should have an int value")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, is this value compatible with the k8s API conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md ?
numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode") | ||
if runtimeInfo.RuntimePolicy.MLPolicy.Torch != nil && newObj.Spec.Trainer.NumProcPerNode != nil { | ||
allowedStringValList := []string{"auto", "cpu", "gpu"} | ||
numProcPerNode := *newObj.Spec.Trainer.NumProcPerNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshaychitneni @tenzen-y Can't we use CEL for that validation since we just validate values for .nProcPerNode
parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have included CEL validation for this path in trainingRuntimes #2313 but CEL can't be added here for trainJob config as it is requires referenced trainingRuntime config to validate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. Do we mean that in TrainJob NumProcPerNode can be different depends on the runtimeRef ?
E.g. for MPI we accept only int
values, but for Torch we accept auto
, cpu
, gpu
, and int
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this already reloved, right? In that case, let's rely on the CEL validation.
return nil, nil | ||
} | ||
|
||
if newObj.Spec.ModelConfig != nil && newObj.Spec.ModelConfig.Input != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, for now we should check the initContainers in JobSet, as I mentioned here: https://github.com/kubeflow/training-operator/blob/master/pkg/runtime.v2/framework/plugins/jobset/builder.go#L87-L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking the initContainers here https://github.com/kubeflow/training-operator/pull/2307/files#diff-935da6e0f990201db2f6ddf15c768526f70993d5a2408814013e96e3fedd5ebfR165. The condition here is only to check presence to initializer job if input modelconfig or dataset config is present in the trainJob
gomega.Expect(k8sClient.DeleteAllOf(ctx, &kubeflowv2.TrainJob{}, client.InNamespace(ns.Name))).To(gomega.Succeed()) | ||
}) | ||
|
||
ginkgo.When("Creating TrainJob", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @akshaychitneni What is right way to test our validations with integration or unit tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think integration tests might be helpful in this case as functioning of trainjob webhook relies on dependent objects like trainingRuntime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are useful. Basically, we add UTs for all testing cases including all edge case to UTs so that we can easily identify root cause under any problems. The integration tests have objectives to verify if the entire webhook mechanism works correct.
if we rely only on integration (or E2E) tests, it's challenging to identify the root cause and debug.
a93ffb7
to
241c4f1
Compare
fixing runtime Signed-off-by: Akshay Chitneni <[email protected]>
241c4f1
to
cb8c6c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left very initial comments. I will revisit here after UTs are implemented.
// JobExporter is the Job name for the exporter. | ||
JobExporter string = "exporter" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
@@ -20,6 +20,7 @@ import ( | |||
"context" | |||
"errors" | |||
"fmt" | |||
"k8s.io/utils/ptr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to second group.
func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind { | ||
return schema.GroupKind{ | ||
Group: ptr.Deref(runtimeRef.APIGroup, ""), | ||
Kind: ptr.Deref(runtimeRef.Kind, ""), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this function to https://github.com/kubeflow/trainer/blob/3f3a8d341e3a9244107591b0916260d14b9a0a79/pkg/runtime/core/registry.go?
It would be better to avoid util package as much as possible: https://go.dev/blog/package-names#bad-package-names
func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind { | ||
return schema.GroupKind{ | ||
Group: ptr.Deref(runtimeRef.APIGroup, ""), | ||
Kind: ptr.Deref(runtimeRef.Kind, ""), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind { | |
return schema.GroupKind{ | |
Group: ptr.Deref(runtimeRef.APIGroup, ""), | |
Kind: ptr.Deref(runtimeRef.Kind, ""), | |
} | |
} | |
func RuntimeRefToRuntimeRegistryKey(runtimeRef kubeflowv2.RuntimeRef) string { | |
return schema.GroupKind{ | |
Group: ptr.Deref(runtimeRef.APIGroup, ""), | |
Kind: ptr.Deref(runtimeRef.Kind, ""), | |
}.String() | |
} |
Additionally, could we make more specific helper since this objective is for runtime registry?
numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode") | ||
if runtimeInfo.RuntimePolicy.MLPolicy.Torch != nil && newObj.Spec.Trainer.NumProcPerNode != nil { | ||
allowedStringValList := []string{"auto", "cpu", "gpu"} | ||
numProcPerNode := *newObj.Spec.Trainer.NumProcPerNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this already reloved, right? In that case, let's rely on the CEL validation.
"errors" | ||
kubeflowv2 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/utils/ptr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"errors" | |
kubeflowv2 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1" | |
"k8s.io/apimachinery/pkg/runtime/schema" | |
"k8s.io/utils/ptr" | |
"errors" | |
"k8s.io/apimachinery/pkg/runtime/schema" | |
"k8s.io/utils/ptr" | |
trainer "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add unit testings?
jobSetTemplate := jobsetv1alpha2.JobSet{ | ||
Spec: trainingRuntime.Spec.Template.Spec, | ||
} | ||
return r.framework.RunCustomValidationPlugins(jobSetTemplate.DeepCopy(), info, old, new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm prefer current approach.
Ideally, combined all information to runtimeInfo, then use that each plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add UTs?
gomega.Expect(k8sClient.DeleteAllOf(ctx, &kubeflowv2.TrainJob{}, client.InNamespace(ns.Name))).To(gomega.Succeed()) | ||
}) | ||
|
||
ginkgo.When("Creating TrainJob", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are useful. Basically, we add UTs for all testing cases including all edge case to UTs so that we can easily identify root cause under any problems. The integration tests have objectives to verify if the entire webhook mechanism works correct.
if we rely only on integration (or E2E) tests, it's challenging to identify the root cause and debug.
Adds validation webhook for v2 trainjob.
Relates to #2209
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes # #2209
Checklist: