KEP-2170: Adding validation webhook for v2 trainjob #2307

akshaychitneni · 2024-10-24T21:34:20Z

Adds validation webhook for v2 trainjob.
Relates to #2209

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #2209

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2024-10-24T21:34:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

akshaychitneni · 2024-10-24T21:35:28Z

cc @tenzen-y @andreyvelich

coveralls · 2024-10-25T17:12:16Z

Pull Request Test Coverage Report for Build 11784298214

Details

6 of 6 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11758410179:	0.0%
Covered Lines:	78
Relevant Lines:	78

💛 - Coveralls

tenzen-y

Thank you for taking this, and moving this forward.
And Sorry for the delay.

pkg/controller.v2/trainjob_controller.go

tenzen-y · 2024-10-28T19:06:14Z

pkg/runtime.v2/core/clustertrainingruntime.go

+		Namespace: new.Namespace,
+		Name:      new.Spec.RuntimeRef.Name,


Have you ever seen the isseus when we use the old object names?

Why do we get new object here and not old ?

Here I am validating updated object instead of the existing one

pkg/runtime.v2/core/trainingruntime.go

pkg/runtime.v2/framework/plugins/jobset/jobset.go

tenzen-y · 2024-10-28T19:12:31Z

pkg/runtime.v2/framework/plugins/jobset/jobset.go

@@ -140,3 +143,115 @@ func (j *JobSet) ReconcilerBuilders() []runtime.ReconcilerBuilder {
 		},
 	}
 }
+
+func (j *JobSet) Validate(oldObj, newObj *kubeflowv2.TrainJob, runtimeInfo *runtime.Info) (admission.Warnings, field.ErrorList) {


It seems that there are some conflicts between @andreyvelich PR and this.
@akshaychitneni Could you consult with @andreyvelich, then which PRs should we merge into the main, first.

I rebased with @andreyvelich's changes

tenzen-y · 2024-10-28T19:20:28Z

pkg/webhook.v2/setup.go

@@ -31,7 +31,7 @@ func Setup(mgr ctrl.Manager, runtimes map[string]runtime.Runtime) (string, error
 		return kubeflowv2.TrainingRuntimeKind, err
 	}
 	if err := setupWebhookForTrainJob(mgr, runtimes); err != nil {
-		return "TrainJob", err
+		return kubeflowv2.TrainJobKind, err


Nice catch.

pkg/webhook.v2/trainjob_webhook.go

tenzen-y · 2024-10-28T19:24:35Z

pkg/webhook.v2/trainjob_webhook.go

Cool! This is what I imagined architechture in my KubeflowJobPipeline framework design phase.

tenzen-y · 2024-10-28T19:25:43Z

test/integration/framework/framework.go

-	failedCtrlName, err := controllerv2.SetupControllers(mgr, runtimes)
-	gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName)
-	gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty())
+	if startControllers {


Have you ever seen any issues like null pointer when we start the controllers for webhook testing, right?

I think I have seen but we might not need to start the controllers just to validate create/update requests and leave to reconciler tests to cover reconciliation

andreyvelich

Thank you for this effort @akshaychitneni!
I left initial comments.

andreyvelich · 2024-11-07T15:30:39Z

pkg/constants/constants.go

+	// JobExporter is the Job name for the exporter.
+	JobExporter string = "exporter"
+


Please can we implement the validation for exporter in the future once we design it as part of: #2245 ?
We should discuss whether we want to use sidecar container or another ReplicatedJob for model checkpointing.
cc @saileshd1402 @akshaychitneni @tenzen-y

Ack. Makes sense

@akshaychitneni Please can you remove the values from your PR that we will not use for now (e.g. JobExporter).

pkg/constants/constants.go

pkg/util.v2/runtime/runtime.go

andreyvelich · 2024-11-07T15:38:13Z

pkg/runtime.v2/core/trainingruntime.go

+	return r.framework.RunComponentBuilderPlugins(ctx, jobSetTemplate.DeepCopy(), info, trainJob)
+}
+
+func (r *TrainingRuntime) runtimeInfo(


Should this be part of Runtime interface: https://github.com/kubeflow/training-operator/blob/a93ffb7125c3899519058ba43fa1d5419b498e85/pkg/runtime.v2/interface.go#L32
And should we name this API more explicit (e.g. getRuntimeInfo() or initializeRuntimeInfo()) ?

I think it should be part of trainingRuntime as it depends on config from trainJob/trainingRuntume resources

Yeah, but the Info object will be used for every runtime that we register with our manager.
What is the main motivation to create this helper function to construct the Info object for the TrainingRuntime ?

andreyvelich · 2024-11-07T15:40:00Z

pkg/runtime.v2/core/clustertrainingruntime.go

+		Namespace: new.Namespace,
+		Name:      new.Spec.RuntimeRef.Name,


Why do we get new object here and not old ?

andreyvelich · 2024-11-07T15:56:26Z

pkg/runtime.v2/framework/plugins/mpi/mpi.go

+		numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode")
+		if runtimeInfo.MLPolicy.MPI != nil {
+			if _, err := strconv.Atoi(*newJobObj.Spec.Trainer.NumProcPerNode); err != nil {
+				allErrs = append(allErrs, field.Invalid(numProcPerNodePath, newJobObj.Spec.Trainer.NumProcPerNode, "should have an int value"))


I think so, is this value compatible with the k8s API conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md ?

andreyvelich · 2024-11-07T15:59:26Z

pkg/runtime.v2/framework/plugins/torch/torch.go

+		numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode")
+		if runtimeInfo.RuntimePolicy.MLPolicy.Torch != nil && newObj.Spec.Trainer.NumProcPerNode != nil {
+			allowedStringValList := []string{"auto", "cpu", "gpu"}
+			numProcPerNode := *newObj.Spec.Trainer.NumProcPerNode


@akshaychitneni @tenzen-y Can't we use CEL for that validation since we just validate values for .nProcPerNode parameter ?

I have included CEL validation for this path in trainingRuntimes #2313 but CEL can't be added here for trainJob config as it is requires referenced trainingRuntime config to validate

Hmm, I see. Do we mean that in TrainJob NumProcPerNode can be different depends on the runtimeRef ?
E.g. for MPI we accept only int values, but for Torch we accept auto, cpu, gpu, and int values.

Is this already reloved, right? In that case, let's rely on the CEL validation.

pkg/runtime.v2/framework/plugins/torch/torch.go

andreyvelich · 2024-11-07T16:07:01Z

pkg/runtime.v2/framework/plugins/jobset/jobset.go

+		return nil, nil
+	}
+
+	if newObj.Spec.ModelConfig != nil && newObj.Spec.ModelConfig.Input != nil {


I think, for now we should check the initContainers in JobSet, as I mentioned here: https://github.com/kubeflow/training-operator/blob/master/pkg/runtime.v2/framework/plugins/jobset/builder.go#L87-L89

I am checking the initContainers here https://github.com/kubeflow/training-operator/pull/2307/files#diff-935da6e0f990201db2f6ddf15c768526f70993d5a2408814013e96e3fedd5ebfR165. The condition here is only to check presence to initializer job if input modelconfig or dataset config is present in the trainJob

andreyvelich · 2024-11-07T16:09:46Z

test/integration/webhook.v2/trainjob_test.go

+		gomega.Expect(k8sClient.DeleteAllOf(ctx, &kubeflowv2.TrainJob{}, client.InNamespace(ns.Name))).To(gomega.Succeed())
+	})
+
+	ginkgo.When("Creating TrainJob", func() {


@tenzen-y @akshaychitneni What is right way to test our validations with integration or unit tests ?

I think integration tests might be helpful in this case as functioning of trainjob webhook relies on dependent objects like trainingRuntime

Both are useful. Basically, we add UTs for all testing cases including all edge case to UTs so that we can easily identify root cause under any problems. The integration tests have objectives to verify if the entire webhook mechanism works correct.

if we rely only on integration (or E2E) tests, it's challenging to identify the root cause and debug.

fixing runtime Signed-off-by: Akshay Chitneni <[email protected]>

tenzen-y

I left very initial comments. I will revisit here after UTs are implemented.

tenzen-y · 2025-02-14T18:27:10Z

pkg/constants/constants.go

+	// JobExporter is the Job name for the exporter.
+	JobExporter string = "exporter"
+


tenzen-y · 2025-02-14T18:27:26Z

pkg/controller.v2/trainjob_controller.go

@@ -20,6 +20,7 @@ import (
 	"context"
 	"errors"
 	"fmt"
+	"k8s.io/utils/ptr"


Move this to second group.

tenzen-y · 2025-02-14T18:30:56Z

pkg/runtime.v2/util/runtime.go

+func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind {
+	return schema.GroupKind{
+		Group: ptr.Deref(runtimeRef.APIGroup, ""),
+		Kind:  ptr.Deref(runtimeRef.Kind, ""),
+	}
+}


Could we move this function to https://github.com/kubeflow/trainer/blob/3f3a8d341e3a9244107591b0916260d14b9a0a79/pkg/runtime/core/registry.go?

It would be better to avoid util package as much as possible: https://go.dev/blog/package-names#bad-package-names

tenzen-y · 2025-02-14T18:32:19Z

pkg/runtime.v2/util/runtime.go

+func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind {
+	return schema.GroupKind{
+		Group: ptr.Deref(runtimeRef.APIGroup, ""),
+		Kind:  ptr.Deref(runtimeRef.Kind, ""),
+	}
+}


Suggested change

func RuntimeRefToGroupKind(runtimeRef kubeflowv2.RuntimeRef) schema.GroupKind {

return schema.GroupKind{

Group: ptr.Deref(runtimeRef.APIGroup, ""),

Kind: ptr.Deref(runtimeRef.Kind, ""),

}

}

func RuntimeRefToRuntimeRegistryKey(runtimeRef kubeflowv2.RuntimeRef) string {

return schema.GroupKind{

Group: ptr.Deref(runtimeRef.APIGroup, ""),

Kind: ptr.Deref(runtimeRef.Kind, ""),

}.String()

}

Additionally, could we make more specific helper since this objective is for runtime registry?

tenzen-y · 2025-02-14T18:41:15Z

pkg/runtime.v2/framework/plugins/torch/torch.go

+		numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode")
+		if runtimeInfo.RuntimePolicy.MLPolicy.Torch != nil && newObj.Spec.Trainer.NumProcPerNode != nil {
+			allowedStringValList := []string{"auto", "cpu", "gpu"}
+			numProcPerNode := *newObj.Spec.Trainer.NumProcPerNode


Is this already reloved, right? In that case, let's rely on the CEL validation.

tenzen-y · 2025-02-14T18:42:53Z

pkg/runtime.v2/util/runtime.go

+	"errors"
+	kubeflowv2 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1"
+	"k8s.io/apimachinery/pkg/runtime/schema"
+	"k8s.io/utils/ptr"


Suggested change

"errors"

kubeflowv2 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1"

"k8s.io/apimachinery/pkg/runtime/schema"

"k8s.io/utils/ptr"

"errors"

"k8s.io/apimachinery/pkg/runtime/schema"

"k8s.io/utils/ptr"

trainer "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v2alpha1"

tenzen-y · 2025-02-14T18:43:19Z

pkg/runtime.v2/framework/plugins/torch/torch.go

Could you add unit testings?

tenzen-y · 2025-02-14T18:57:19Z

pkg/runtime.v2/core/trainingruntime.go

+	jobSetTemplate := jobsetv1alpha2.JobSet{
+		Spec: trainingRuntime.Spec.Template.Spec,
+	}
+	return r.framework.RunCustomValidationPlugins(jobSetTemplate.DeepCopy(), info, old, new)


I'm prefer current approach.
Ideally, combined all information to runtimeInfo, then use that each plugins.

tenzen-y · 2025-02-14T18:58:08Z

pkg/runtime.v2/framework/plugins/jobset/jobset.go

Could you add UTs?

tenzen-y · 2025-02-14T19:00:51Z

test/integration/webhook.v2/trainjob_test.go

+		gomega.Expect(k8sClient.DeleteAllOf(ctx, &kubeflowv2.TrainJob{}, client.InNamespace(ns.Name))).To(gomega.Succeed())
+	})
+
+	ginkgo.When("Creating TrainJob", func() {


Both are useful. Basically, we add UTs for all testing cases including all edge case to UTs so that we can easily identify root cause under any problems. The integration tests have objectives to verify if the entire webhook mechanism works correct.

if we rely only on integration (or E2E) tests, it's challenging to identify the root cause and debug.

google-oss-prow bot requested a review from jinchihe October 24, 2024 21:34

google-oss-prow bot requested a review from kuizhiqing October 24, 2024 21:34

google-oss-prow bot added the size/XL label Oct 24, 2024

akshaychitneni force-pushed the webhookv2 branch 5 times, most recently from 892a40b to f1a06c4 Compare October 25, 2024 16:36

google-oss-prow bot added size/L and removed size/XL labels Oct 25, 2024

akshaychitneni force-pushed the webhookv2 branch 3 times, most recently from ce983eb to 736a759 Compare October 25, 2024 17:09

akshaychitneni force-pushed the webhookv2 branch 2 times, most recently from f85da83 to ba32e68 Compare October 25, 2024 18:23

google-oss-prow bot added size/XL and removed size/L labels Oct 25, 2024

akshaychitneni force-pushed the webhookv2 branch from ba32e68 to 20136ef Compare October 25, 2024 18:46

google-oss-prow bot added size/L and removed size/XL labels Oct 25, 2024

akshaychitneni force-pushed the webhookv2 branch from 20136ef to 0aa9ee0 Compare October 25, 2024 18:50

tenzen-y reviewed Oct 28, 2024

View reviewed changes

akshaychitneni force-pushed the webhookv2 branch from 0aa9ee0 to 1b675c5 Compare November 3, 2024 17:52

google-oss-prow bot added size/XL and removed size/L labels Nov 3, 2024

akshaychitneni force-pushed the webhookv2 branch 2 times, most recently from 0e12bb4 to a3ea261 Compare November 4, 2024 22:39

akshaychitneni force-pushed the webhookv2 branch 5 times, most recently from f4d1430 to a93ffb7 Compare November 6, 2024 18:08

andreyvelich reviewed Nov 7, 2024

View reviewed changes

akshaychitneni force-pushed the webhookv2 branch from a93ffb7 to 241c4f1 Compare November 11, 2024 17:46

Adding v2 trainjob validation webhook

cb8c6c3

fixing runtime Signed-off-by: Akshay Chitneni <[email protected]>

akshaychitneni force-pushed the webhookv2 branch from 241c4f1 to cb8c6c3 Compare November 11, 2024 18:37

tenzen-y reviewed Feb 14, 2025

View reviewed changes

tenzen-y mentioned this pull request Feb 15, 2025

KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime #2219

Open

		// JobExporter is the Job name for the exporter.
		JobExporter string = "exporter"

KEP-2170: Adding validation webhook for v2 trainjob #2307

Are you sure you want to change the base?

KEP-2170: Adding validation webhook for v2 trainjob #2307

Conversation

akshaychitneni commented Oct 24, 2024 • edited Loading

google-oss-prow bot commented Oct 24, 2024

akshaychitneni commented Oct 24, 2024

coveralls commented Oct 25, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11784298214

Details

💛 - Coveralls

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshaychitneni Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshaychitneni commented Oct 24, 2024 •

edited

Loading

coveralls commented Oct 25, 2024 •

edited

Loading

akshaychitneni Nov 6, 2024 •

edited

Loading

andreyvelich Nov 7, 2024 •

edited

Loading