Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

Merged

Conversation

MorpheusXAUT
Copy link
Contributor

TL;DR

dynamicNodeTaskNodeHandler now properly modifies transition to propagate TaskNodeInfo for dynamic parent nodes.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

While updating the transition of a dynamic sub-node, the ExecutionInfo (in particular the TaskNodeInfo containing metadata) about the cached outputs was not set on the handler.Transition properly (assignment/return was missing). This lead to flyteadmin not recognising the cached output for dynamic task parent nodes.
Additionally, the previous implementation dropped the OutputInfo already present on the transition.

Tests were adapted in a minimal fashion to verify the expected TaskNodeMetadata is returned on completion of sub nodes.

Tracking Issue

fixes flyteorg/flyte#3096

Follow-up issue

NA

@codecov
Copy link

codecov bot commented Nov 23, 2022

Codecov Report

Merging #501 (26c0a17) into master (4114572) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@hamersaw
Copy link
Contributor

@MorpheusXAUT thanks for the fix here! Two questions:
(1) any idea why there are still subnodes listed for the second dynamic run (ie. cache hit)? If the dynamic task doesn't run then it shouldn't report a subworkflow back to FlyteAdmin. At least, I don't think it should be.
(2) I tried testing this locally and for some reason it's not working (ie. first run the dynamic task still does not show CACHE_POPULATED). The first run is still "cache disabled" as this should fix:
first-run
The second run works as expected:
second-run

My test workflow is

@task(cache=True, cache_version="1.0", requests=Resources(cpu="1", mem="400Mi"), limits=Resources(cpu="2", mem="600Mi"))
def cached_square(n: int) -> int:
    return n * n

@dynamic(cache=True, cache_version="1.0", requests=Resources(cpu="1", mem="400Mi"), limits=Resources(cpu="2", mem="600Mi"))
def cached_dynamic_parallel_square(a: int, count: int) -> int:
    for i in range(count):
        cached_square(n=a)
    return a
    
@workflow
def cached_dynamic_parallel_square_wf(a: int, count: int) -> int:
    return cached_dynamic_parallel_square(a=a, count=count)

@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Nov 23, 2022

(1) any idea why there are still subnodes listed for the second dynamic run (ie. cache hit)? If the dynamic task doesn't run then it shouldn't report a subworkflow back to FlyteAdmin. At least, I don't think it should be.

I was wondering the same, but didn't have time to investigate that yet. I can continue to do so tomorrow, I thought this might've been intentional, but I agree, it doesn't really make sense to show them, at least not in unknown state.

(2) I tried testing this locally and for some reason it's not working (ie. first run the dynamic task still does not show CACHE_POPULATED).

Odd, it worked as intended for all my test runs, but I will check out your workflow tomorrow to see if there's something I missed, thanks for the example.
It's also extra strange how your second run doesn't have any sub-tasks listed at all...

Will update this when I know more!

@MorpheusXAUT
Copy link
Contributor Author

@hamersaw Strangely enough, your test workflow seems to be working as expected for me 🤔

image
The first execution updated its parent node status accordingly once all child executions finished successfully.

image
The second execution shows the parent node as retrieved from cache, however I still have the unknown child nodes displayed...

Running locally as a single binary and the k3d cluster using the latest commit of the flyte repository (f534819) with flytepropeller replaced with this commit (replace github.com/flyteorg/flytepropeller => github.com/blackshark-ai/flytepropeller v0.16.48-0.20221123132352-14a1b00391c3).

I'll see if I can reproduce your behavior while I continue working on the cache eviction (as that includes re-running a lot of workflows anyways during testing), but I'm not sure why we're seeing different outcome here.
Will try to look into the unknown child tasks as well, however that's currently on a lower prio, at least until I can get the first version of cache eviction done and ready as a draft.

hamersaw
hamersaw previously approved these changes Nov 28, 2022
Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested again and everything works great! Not sure what I was doing the first time, thanks for looking into it a second (unnecessary) time!

I just submitted a PR to update the boilerplate which will increase the timeout on end2end tests. This seems to be causing the failure right now. Once we get that figured out, lets merge!

@hamersaw hamersaw merged commit 46439c6 into flyteorg:master Dec 5, 2022
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
…mic parent nodes (flyteorg#501)

Signed-off-by: Nick Müller <[email protected]>

Signed-off-by: Nick Müller <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Cache status of dynamic (parent) tasks is not updated properly
2 participants