Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk/backend): Add support for placeholders in resource limits #11501

Merged
merged 1 commit into from
Jan 9, 2025

Conversation

mprahl
Copy link
Contributor

@mprahl mprahl commented Jan 6, 2025

Description of your changes:

The API introduced new fields prefixed with Resource (e.g. ResourceCpuLimit) to replace the old fields without the prefix. The Driver hadn't been updated to honor those fields but the SDK started using them which led to unexpected behavior.

The Driver now honors both fields but prioritizes the new fields. The SDK now only sets the new fields.

The outcome is that resource limits/requests can now use input parameters.

Note that pipeline_spec_builder.py was doing some validation on the limits/requests being set, but that's already handled in the user facing method (e.g. set_cpu_limit).

Resolves:
#11500

Checklist:

@mprahl mprahl force-pushed the fix-param-resources branch 2 times, most recently from 9a90c6a to c0343b9 Compare January 6, 2025 22:26
@mprahl
Copy link
Contributor Author

mprahl commented Jan 6, 2025

/cc @HumairAK @rimolive

Note that I labeled this as a feature but it's also a bug fix as mentioned in the Slack thread here:
https://cloud-native.slack.com/archives/C073N7BMLB1/p1735746170303549

Additionally, it seems that this recent unreleased commit added placeholder support to the wrong accelerator type field, so it'd be nice to merge this PR before the next KFP release:
#11404

@google-oss-prow google-oss-prow bot requested review from HumairAK and rimolive January 6, 2025 22:27
@mprahl mprahl force-pushed the fix-param-resources branch from c0343b9 to 4d49c1d Compare January 6, 2025 22:36
@HumairAK HumairAK added this to the KFP 2.4.0 milestone Jan 8, 2025
@tomzx
Copy link

tomzx commented Jan 8, 2025

I've tested the SDK portion as we're eagerly awaiting the capability of setting cpu/ram/accelerator type/count dynamically in our projects.
I can confirm that it works, as in it passes compilation and can be submitted to GCP Vertex AI (for example).
The Vertex AI pipeline does use the dynamically defined cpu/ram/accelerator type/count to find the proper machine to use and to attach accelerators if needed.

I also see the following in the compiled pipeline configuration for a step using those dynamic properties.

        resources:
          accelerator:
            resourceCount: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_count'']}}'
            resourceType: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_type'']}}'
          resourceCpuLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-cpu'']}}'
          resourceMemoryLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-ram'']}}'

backend/src/v2/driver/driver.go Outdated Show resolved Hide resolved
backend/src/v2/driver/driver.go Outdated Show resolved Hide resolved
backend/src/v2/driver/driver.go Show resolved Hide resolved
@HumairAK
Copy link
Collaborator

HumairAK commented Jan 9, 2025

Note that this also resolves: #11375

Looks like a safe change. Folks pre-kfp 2.4 and post 2.4 will have continued support of old fields, even with pipelines rendered pre sdk 2.10, when submitted. So removing them from the pipeline render makes sense. Recompiling will remove these old fields, which does seem like it's api-changing (the api is implicit here in the yaml even when not defined in the proto) but looks backwards compatible since we'll continue to handle it on driver side. Since initial change completely removed it from vertex side, I do not think this is likely to have downstream effects there, but regardless fyi @chensun.

I tested this with deprecated and new values, and with parameter values which are correctly resolved. Which is great, thanks @mprahl!

some nit comments above @mprahl upto you to address them or not, let me know otherwise I can approve

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jan 9, 2025
@mprahl mprahl force-pushed the fix-param-resources branch from 4d49c1d to 5abf3ea Compare January 9, 2025 14:14
@google-oss-prow google-oss-prow bot removed the lgtm label Jan 9, 2025
@mprahl mprahl requested a review from HumairAK January 9, 2025 14:15
@HumairAK
Copy link
Collaborator

HumairAK commented Jan 9, 2025

/lgtm
/approve

Thanks!

The API introduced new fields prefixed with Resource (e.g.
ResourceCpuLimit) to replace the old fields without the prefix. The
Driver hadn't been updated to honor those fields but the SDK started
using them which led to unexpected behavior.

The Driver now honors both fields but prioritizes the new fields. The
SDK now only sets the new fields.

The outcome is that resource limits/requests can now use input
parameters.

Note that pipeline_spec_builder.py was doing some validation on the
limits/requests being set, but that's already handled in the user facing
method (e.g. set_cpu_limit).

Resolves:
kubeflow#11500

Signed-off-by: mprahl <[email protected]>
@mprahl mprahl force-pushed the fix-param-resources branch from 5abf3ea to d282374 Compare January 9, 2025 14:54
@google-oss-prow google-oss-prow bot removed the lgtm label Jan 9, 2025
@HumairAK
Copy link
Collaborator

HumairAK commented Jan 9, 2025

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jan 9, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7c931ae into kubeflow:master Jan 9, 2025
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants