Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backend): move comp logic to workflow params #10979
feat(backend): move comp logic to workflow params #10979
Changes from 4 commits
bde8025
15aa780
65ce19f
f962ef3
4c513eb
68d4c0b
a56df16
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not every component has a visible function name. Why not just use the original name as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, Chen!
So this PR tackles two separate problems:
This PR moves the component logic from the metadata annotations (which have a 256KB limit) to the spec (which has a much higher, configurable limit defined by the etcd request size limit). This in it of itself solves a major regression in v2.
In addition, this PR deduplicates component logic in the compiled AWF manifest. This duplication is a long-standing issue that was carried over from v1 and was reported as early as 2020. Since the IR compiler appends an incrementing index to duplicate tasks, and the IR -> AWF compiler doesn't properly account for this, the underlying component logic gets duplicated in full in the compiled AWF manifest for each instance. This is kind of like copying a function in full every time you want to invoke it. In a sense, we're kind of using the function name as a key in a set instead of adding each task's logic to a list. Since components can get quite large, with all this duplicate logic, it's easy to hit even the etcd requests size limit. Several of our internal model development teams have run into this exact problem and have been forced to partition large pipelines or minify compiled manifests (i.e. strip verbose annotations and comments), but minification only works in v1 where compilation happens client-side.
If the
--function_to_execute
is not specified in the executor, thenextractFunctionName
just returns the original (indexed) component name, i.e. "the original name as-is." So that's precisely what happens when the function name is not specified.At a bare minimum, this PR circumvents the 256KB limit, but it also substantially reduces the size of the compiled manifest (when possible) thanks to deduplication (we saw a 70% file size reduction in our tests).
Apologies if I'm over-explaining anything!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly as this function checks multiple cases, but doesn't guarantee any improvement over the original, so It feels to me an over-engineering that doesn't help much. I would suggest we remove it and use the original name as-is for simplicity, if the goal it to guarantee a short length and/or unique name, then we should use some hashing method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning the original name as is would guarantee redundancy in the compiled manifest. Hashing is an interesting approach! We could hash the
IR.deploymentSpec.executors.<executor>.container.command.
and use that as the key. One concern with that approach is that it significantly reduces the human readability of the AWF manifest. Also, could hashing hundreds of large components increase AWF compilation time?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Under what conditions does a component not include
--function_to_execute
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hashing, I was thinking hash the entire ComponentSpec (and possibly the expanded ExecutorSpec when applicable). That would guarantee uniqueness. I don't think it would add noticable AWF compilation time, but admit I didn't do any experiment.
Exactly what I was planning to explain. :)
The issue with the current
extraFunctionName implementation
is that it only makes a difference for maybe 50%-60% of the time. IMO, it's not worth the added complexity and it may even hurts AWF readability--if we value that--one needs to look into this implementation to figure out where the name comes from.I can list a number of examples where the current approach fails to improve over original name as-is:
pipelines/sdk/python/test_data/pipelines/pipeline_with_loops_and_conditions.yaml
Lines 64 to 84 in e128bdb
pipelines/sdk/python/test_data/pipelines/pipeline_in_pipeline_complex.yaml
Lines 62 to 80 in e128bdb
--function_to_execute
, and may comes with very generic entry point likemain
,run
, etc. And large enterprise may sometimes build a shared container with a single entrypoint that takes arguments to execute different component function. e.g:pipelines/sdk/python/test_data/components/container_io.yaml
Lines 20 to 28 in e128bdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my suggestion for this PR is to simply remove the
extractFunctionName
part and use original component name as-is. I think the major concern on the size limit has been solved by moving the spec from annotation to argument. If further size reduction is needed, we can then look into a hashing solution that would work for 100%.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the thorough response and additional context, @chensun. We really appreciate it.
We POCed the hash-based approach that you suggested before we saw your comment and pushed it up since the work was done anyway. It works like a charm!
We tested against the three edge cases that you posted. Hashing successfully deduplicates all of the corresponding component logic. It's true that duplicate conditionals, DAGs, and nested pipelines are not deduplicated per say, but we confirmed that their underlying shared components, which are what typically inflate the manifest file size beyond the 1.5 MB etcd limit, are in fact deduplicated.
For example, pipeline_in_pipeline_complex.yaml references
print_op1
three times: once at the root ofmy_pipeline
and twice in thegraph_component
sub_pipeline (since it's parallelized).We confirmed that the compiled AWF manifest consolidates these various tasks under the shared key,
implementations-83cf3e248270dfedd9adf4317961878cb8ebdf352ee8121750c35bfee5d408ae
, and therefore only stores the logic for the corresponding function once.We think the change may be small enough in scope and lines of code and the underlying problem significant enough to warrant keeping it in this PR. That being said, we don't mind tackling the hash approach in a discrete PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the
component_io.yaml
example slightly to demonstrate deduplication (with hashing) for container components that don't list the function name:The resulting AWF manifest looks like this:
The tasks that share logic share the same hash in the
Spec.Arguments.Parameters
so the component logic is not duplicated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to the pipelines_with_loops_and_conditions.yaml edge case you mentioned: it has 10 instances of the
print_text
component.Thanks to the deduplication (through hashing), it stores just one copy of the component logic!
print_text
is a small component (though there's plenty of fluff around even one line components in the form of pip installs and what not), but it's not uncommon to parallelize massive components; that 10x increase in storage consumption makes reaching the etcdmax-request-bytes
threshold inevitable for our more complex models.