Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake jobs failing due to missing output files which do exist #123

Closed
kelly-sovacool opened this issue Oct 21, 2024 · 16 comments · Fixed by #130
Closed

snakemake jobs failing due to missing output files which do exist #123

kelly-sovacool opened this issue Oct 21, 2024 · 16 comments · Fixed by #130
Assignees
Labels
bug Something isn't working
Milestone

Comments

@kelly-sovacool
Copy link
Member

@kopardev found jobs will sometimes fail spontaneously and work on the re-run. It seems to be a file latency issue?

@kelly-sovacool kelly-sovacool added the bug Something isn't working label Oct 22, 2024
@kelly-sovacool
Copy link
Member Author

wilfried ran into this issue too. snakemake rules succeeded but the overall slurm job failed.

@kelly-sovacool kelly-sovacool self-assigned this Oct 22, 2024
@kelly-sovacool
Copy link
Member Author

retries is already set to 2 for both local and slurm mode 🤔

CHARLIE/charlie

Line 489 in d8f9cf0

--retries 2 \

CHARLIE/charlie

Line 531 in d8f9cf0

--retries 2 \

it seems snakemake is not honoring it?

@kopardev
Copy link
Member

@kelly-sovacool ... can you point me to the output folder.. I am looking for the jobinfo.short file.

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Oct 23, 2024

@kopardev

@kelly-sovacool ... can you point me to the output folder.. I am looking for the jobinfo.short file.

Wilfried's is here:
/data/CCBR/charlie_test_wil/charlie/

jobby short file /data/charlie_test_wil/charlie/logs/snakemake.log.jobby.short

@kopardev
Copy link
Member

  • I have access to /data/CCBR/charlie_test_wil/charlie folder .. but there is no jobby related file there ... why? @kelly-sovacool
  • I do not have access to /data/charlie_test_wil/charlie/logs/snakemake.log.jobby.short @wilfriedguiblet

@kopardev

This comment was marked as outdated.

@kelly-sovacool

This comment was marked as outdated.

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Oct 23, 2024

  • I have access to /data/CCBR/charlie_test_wil/charlie folder .. but there is no jobby related file there ... why? @kelly-sovacool

@kopardev charlie writes the jobby files in logs/

@kopardev

This comment was marked as outdated.

@kelly-sovacool

This comment was marked as outdated.

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Oct 23, 2024

Getting back to the retries / file latency issue:

Here's another output dir where I only ran charlie once and did not manually resubmit it: /data/CCBR/projects/techDev/charlie_test_rel-7/

grep FAIL logs/snakemake.log.jobby

star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048998.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049001.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049044.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049045.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049057.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049058.create_hq_bams.sample=GI1_T.err

It looks like it is correctly resubmitting failed jobs with --retries 2.

But rules seem to be failing due to missing output files on the first attempt even though they do exist.

star_circrnafinder

Error message for attempt 1:

Waiting at most 120 seconds for missing files.
MissingOutputException in rule star_circrnafinder in file /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/rules/align.smk, line 438:
Job 0 completed successfully, but some output files are missing. Missing files after 120 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.sam
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.junction
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.SJ.out.tab

Error message for attempt 2:

FATAL INPUT error, could not open input file with junctions from the 1st pass=GI1_N._STARpass1//SJ.out.tab

It completed successfully on the 3rd attempt:

grep star_circrnafinder.sample=GI1_N logs/snakemake.log.jobby

star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N COMPLETED       /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39046188.star_circrnafinder.sample=GI1_N.err

merge_alignment_stats

Error message for attempt 1:

paste: /data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt: No such file or directory

Error message for attempt 2:

cp: cannot create regular file '/data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt': File exists

It completed successfully on the 3rd attempt.

grep merge_alignment_stats logs/snakemake.log.jobby

merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
merge_alignment_stats.  COMPLETED       /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048379.merge_alignment_stats..err

create_hq_bams

All of these jobs failed due to an import error which will be resolved by upgrading the base container to v7 (#125). This is unrelated the current issue.

grep "ImportError" logs/*create_hq*

logs/39030852.39048998.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39048998.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:

@kelly-sovacool kelly-sovacool changed the title make sure snakemake --retries is set to 2 snakemake jobs failing due to missing output files which do exist Oct 23, 2024
@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Oct 23, 2024

@kopardev So I think the overall workflow will succeed once we fix the container problem with #125. But for this issue, it would be best if we could figure out how to avoid needing to retry rules multiple times. Should we try increasing latency wait?

@kopardev
Copy link
Member

@kelly-sovacool:

  • what do you have in mind for --latency-wait? 300?
  • another peculiar observation: if something fails .. it fails twice and succeeds on attempt no. 3... I could not find any rule which failed on attempt no. 1 and succeeded on attempt no. 2 ... Have you?

@kelly-sovacool
Copy link
Member Author

  • what do you have in mind for --latency-wait? 300?

I hesitate to go too high because that will needlessly delay the overall pipeline run completion. Should we reach out to biowulf staff about this?

  • another peculiar observation: if something fails .. it fails twice and succeeds on attempt no. 3... I could not find any rule which failed on attempt no. 1 and succeeded on attempt no. 2 ... Have you?

I thought this was true, until we encountered #127

@kopardev
Copy link
Member

kopardev commented Nov 6, 2024

@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.

@kelly-sovacool
Copy link
Member Author

@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.

so far I have not encountered this error recently, even with the original --latency-wait 120, despite multiple charlie runs that failed for other reasons (#127, #128)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants