Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

MorpheusXAUT · 2022-11-23T13:31:28Z

TL;DR

dynamicNodeTaskNodeHandler now properly modifies transition to propagate TaskNodeInfo for dynamic parent nodes.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

While updating the transition of a dynamic sub-node, the ExecutionInfo (in particular the TaskNodeInfo containing metadata) about the cached outputs was not set on the handler.Transition properly (assignment/return was missing). This lead to flyteadmin not recognising the cached output for dynamic task parent nodes.
Additionally, the previous implementation dropped the OutputInfo already present on the transition.

Tests were adapted in a minimal fashion to verify the expected TaskNodeMetadata is returned on completion of sub nodes.

Tracking Issue

fixes flyteorg/flyte#3096

Follow-up issue

NA

codecov · 2022-11-23T13:38:41Z

Codecov Report

Merging #501 (26c0a17) into master (4114572) will increase coverage by 0.00%.
The diff coverage is 100.00%.

hamersaw · 2022-11-23T18:22:14Z

@MorpheusXAUT thanks for the fix here! Two questions:
(1) any idea why there are still subnodes listed for the second dynamic run (ie. cache hit)? If the dynamic task doesn't run then it shouldn't report a subworkflow back to FlyteAdmin. At least, I don't think it should be.
(2) I tried testing this locally and for some reason it's not working (ie. first run the dynamic task still does not show CACHE_POPULATED). The first run is still "cache disabled" as this should fix:

The second run works as expected:

My test workflow is

@task(cache=True, cache_version="1.0", requests=Resources(cpu="1", mem="400Mi"), limits=Resources(cpu="2", mem="600Mi"))
def cached_square(n: int) -> int:
    return n * n

@dynamic(cache=True, cache_version="1.0", requests=Resources(cpu="1", mem="400Mi"), limits=Resources(cpu="2", mem="600Mi"))
def cached_dynamic_parallel_square(a: int, count: int) -> int:
    for i in range(count):
        cached_square(n=a)
    return a
    
@workflow
def cached_dynamic_parallel_square_wf(a: int, count: int) -> int:
    return cached_dynamic_parallel_square(a=a, count=count)

MorpheusXAUT · 2022-11-23T18:25:47Z

(1) any idea why there are still subnodes listed for the second dynamic run (ie. cache hit)? If the dynamic task doesn't run then it shouldn't report a subworkflow back to FlyteAdmin. At least, I don't think it should be.

I was wondering the same, but didn't have time to investigate that yet. I can continue to do so tomorrow, I thought this might've been intentional, but I agree, it doesn't really make sense to show them, at least not in unknown state.

(2) I tried testing this locally and for some reason it's not working (ie. first run the dynamic task still does not show CACHE_POPULATED).

Odd, it worked as intended for all my test runs, but I will check out your workflow tomorrow to see if there's something I missed, thanks for the example.
It's also extra strange how your second run doesn't have any sub-tasks listed at all...

Will update this when I know more!

MorpheusXAUT · 2022-11-24T10:56:22Z

@hamersaw Strangely enough, your test workflow seems to be working as expected for me 🤔

The first execution updated its parent node status accordingly once all child executions finished successfully.

The second execution shows the parent node as retrieved from cache, however I still have the unknown child nodes displayed...

Running locally as a single binary and the k3d cluster using the latest commit of the flyte repository (f534819) with flytepropeller replaced with this commit (replace github.com/flyteorg/flytepropeller => github.com/blackshark-ai/flytepropeller v0.16.48-0.20221123132352-14a1b00391c3).

I'll see if I can reproduce your behavior while I continue working on the cache eviction (as that includes re-running a lot of workflows anyways during testing), but I'm not sure why we're seeing different outcome here.
Will try to look into the unknown child tasks as well, however that's currently on a lower prio, at least until I can get the first version of cache eviction done and ready as a draft.

hamersaw

Tested again and everything works great! Not sure what I was doing the first time, thanks for looking into it a second (unnecessary) time!

I just submitted a PR to update the boilerplate which will increase the timeout on end2end tests. This seems to be causing the failure right now. Once we get that figured out, lets merge!

…mic parent nodes Signed-off-by: Nick Müller <[email protected]>

…mic parent nodes (flyteorg#501) Signed-off-by: Nick Müller <[email protected]> Signed-off-by: Nick Müller <[email protected]>

MorpheusXAUT requested review from kumare3, EngHabu and hamersaw as code owners November 23, 2022 13:31

hamersaw previously approved these changes Nov 28, 2022

View reviewed changes

Fixed ExecutionInfo/TaskNodeInfo not being assigned properly for dyna…

26c0a17

…mic parent nodes Signed-off-by: Nick Müller <[email protected]>

MorpheusXAUT dismissed hamersaw’s stale review via 26c0a17 December 5, 2022 08:53

MorpheusXAUT force-pushed the dynamic-parent-node-cache-status branch from 14a1b00 to 26c0a17 Compare December 5, 2022 08:53

hamersaw approved these changes Dec 5, 2022

View reviewed changes

hamersaw merged commit 46439c6 into flyteorg:master Dec 5, 2022

MorpheusXAUT mentioned this pull request Dec 15, 2022

Cache eviction of past executions flyteorg/flyteadmin#504

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

MorpheusXAUT commented Nov 23, 2022

codecov bot commented Nov 23, 2022 •

edited

Loading

hamersaw commented Nov 23, 2022

MorpheusXAUT commented Nov 23, 2022 •

edited

Loading

MorpheusXAUT commented Nov 24, 2022

hamersaw left a comment

Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

Fixed TaskNodeInfo not being assigned properly for dynamic parent nodes #501

Conversation

MorpheusXAUT commented Nov 23, 2022

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Nov 23, 2022 • edited Loading

Codecov Report

hamersaw commented Nov 23, 2022

MorpheusXAUT commented Nov 23, 2022 • edited Loading

MorpheusXAUT commented Nov 24, 2022

hamersaw left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 23, 2022 •

edited

Loading

MorpheusXAUT commented Nov 23, 2022 •

edited

Loading