Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: show failed runs with --temp #10616

Open
gregstarr opened this issue Nov 8, 2024 · 13 comments
Open

exp: show failed runs with --temp #10616

gregstarr opened this issue Nov 8, 2024 · 13 comments
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@gregstarr
Copy link

Bug Report

Description

experiments run with --temp that fail appear to go missing. Sounds like this was an issue previously but was fixed, however I still am having this problem. I am using a recent version of DVC and the previous issue was from 2 years ago.

#8612

Reproduce

  1. dvc exp run --temp
  2. dvc exp show properly lists experiment
  3. first stage success
  4. second stage fail
  5. dvc exp show doesn't show the failed experiment

Expected

  1. dvc exp show shows the failed experiment
  2. if for some reason it shouldn't show the failed experiment, then note in the docs because this seems like a fairly significant behavior difference between running with the queue and with temp

Environment information

Output of dvc doctor:

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.11.10 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.8
Supports:
        http (aiohttp = 3.10.8, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.8, aiohttp-retry = 2.8.3)
Config:
        Global: /home/starrgw1/.config/dvc
        System: /etc/xdg/dvc
@gregstarr
Copy link
Author

gregstarr commented Nov 8, 2024

Here is the experiment that failed that I can't find, this is the JSON from the tmp/exps/run directory:

{
  "git_url": "path/.dvc/tmp/exps/standalone/tmpqkp9v9u1",
  "baseline_rev": "ce7991feda5aabbfa18d9541c5d90191850b7b09",
  "location": "tempdir",
  "root_dir": "path/.dvc/tmp/exps/standalone/tmpqkp9v9u1",
  "dvc_dir": ".dvc",
  "name": "mean_depth_norm",
  "wdir": ".",
  "result_hash": null,
  "result_ref": null,
  "result_force": false,
  "status": 1
}

@gregstarr
Copy link
Author

Wow I can git cat-file the sha from the name of the json file:

> tree a55df7458541c3aeecc816840830de454d66472d
parent ce7991feda5aabbfa18d9541c5d90191850b7b09
parent 1e3c6c8e652abbaf9fd1f47a63cdeadb4e8b5697
author starrgw1 email 1730829745 +0000
committer starrgw1 email 1730829745 +0000

dvc-exp:ce7991feda5aabbfa18d9541c5d90191850b7b09:ce7991feda5aabbfa18d9541c5d90191850b7b09:mean_depth_norm

And in fact I can apply the commit through either git or DVC. However, it doesn't look like I can get back the stage 1 results? dvc.lock seems unchanged, so nothing gets checkout out.

@shcheklein
Copy link
Member

@gregstarr could you share a script / project to reproduce this?

@shcheklein shcheklein added A: experiments Related to dvc exp triage Needs to be triaged labels Nov 8, 2024
@gregstarr
Copy link
Author

gregstarr commented Nov 9, 2024

Here is a minimal example: https://github.com/gregstarr/minimum-dvc

  1. clone
  2. install a recent version of dvc
  3. dvc exp run --temp --name "exp1"
  4. dvc exp show -A

the third command should fail in the third stage, then the fourth command should't show the experiment "exp1". I would expect dvc exp show -A to show the failed experiment. I would also think the output of stage 1, which finished successfully, should be in the cache.

@shcheklein
Copy link
Member

Thanks @gregstarr , from what I see it was indeed discussed here #8612 (comment) . I see that we completely drop the directory with failed --temp experiments and we don't collect them in show. @pmrowla do you remember by chance - is it just non-yet-implemented functionality or it was by design for some reason (e.g. we drop directories to save space and thus we don't show them since users won't be able to restore / apply them?)

@pmrowla
Copy link
Contributor

pmrowla commented Nov 16, 2024

IIRC this is expected behavior for --temp runs. Failed experiments are only collected when run via the queue. The failure state for a queued experiment run is tied to the celery/dvc-task job for that run, but --temp runs don't touch celery at all.

In the codebase you can see that queue runs have an implementation for collect_failed_data

def collect_failed_data(

but workspace runs do not (and --temp runs extend the workspace implementation)

def collect_failed_data(
self,
baseline_revs: Optional[Collection[str]],
**kwargs,
) -> dict[str, list["ExpRange"]]:
raise NotImplementedError

@shcheklein
Copy link
Member

@pmrowla yep, thanks. But what was the reason for this? (if you remember :) )

@pmrowla
Copy link
Contributor

pmrowla commented Nov 16, 2024

It's leftover from --temp predating the queue. Workspace runs have never had any kind of saved execution state (other than the git commit on success), and --temp was originally just the extension of "do exactly what we do in a workspace run except in an isolated directory".

@shcheklein shcheklein added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint and removed triage Needs to be triaged labels Nov 16, 2024
@shcheklein shcheklein changed the title exp: failed runs with --temp go missing exp: show failed runs with --temp Nov 16, 2024
@gregstarr
Copy link
Author

I have a shell script which calls dvc exp run --temp, which is nice because the command blocks until the pipeline completes. How would I replicate this behavior using the queue so that I can see when experiments fail and examine the failed state?

@shcheklein
Copy link
Member

@gregstarr how about analyzing the results of the dvc queue status command?

(I wonder if we should just do dvc queue wait or something)

@gregstarr
Copy link
Author

so have python or bash call dvc queue status periodically and check to see if the queue is empty?

@shcheklein
Copy link
Member

yep, something like that

@gregstarr
Copy link
Author

OK I think I will just continue to use --temp. Even though failures aren't captured, I like the fact that it is simple and output isn't captured by any intermediary i.e. the queue.

Is this issue an easy fix or pretty complicated? If it's going to be a while, what do you think about adding a note in the docs mentioning difference in behavior when using --temp?

dvc queue wait sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

3 participants