Pair detection fails for files containing multiple _1 or _R1 substrings #57

pansapiens · 2021-05-13T01:47:40Z

When running with the commandline options -extn .fq.gz -paired -pairIds _1,_2 and input FASTQs named like R_10_1.fq.gz and R_10_2.fq.gz, RNAsik fails with the error:

Fatal error: /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/sikFqFiles.bds, line 247, pos 17. -paired set to true, but can't find _2 read. Is it single-end data? Also check your -pairIds _1,_2
Stack trace:
error "-paired set to $paired, but can't find $pai ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/sikFqFiles.bds:247
  samplesSheet = makeSamplesSheet( fqFiles,fqRgxs, ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/RNAsik.bds:93

I believe this is because when it looks for the corresponding _2file here, the filename is incorrectly generated:
https://github.com/MonashBioinformaticsPlatform/RNAsik-pipe/blob/master/src/sikFqFiles.bds#L242

eg, string chkR2 = fq.replace(pairIdsList[0], pairIdsList[1])

if:

fq = "R_10_1.fq.gz"
pairIdsList = ["_1", "_2"]

then we would get chkR2 set to "R_20_2.fq.gz" rather than the expected R_10_2.fq.gz

There are also corner cases where this could also result in mis-pairing of files (eg, if the incorrectly generated filename "R_20_2.fq.gz" happened to exist)

The text was updated successfully, but these errors were encountered:

serine · 2021-05-16T04:51:33Z

thanks @pansapiens, I understand it is annoying having to find and fight those bugs

you are right, yet another bug with sample files guessing and pre-processing.

I'm not sure what I should do here, this is an edge case, although will most certainly keep coming up. I have been working on a very different sample files parsing and input. And how args get passed to RNAsik. All args will be passed via single config file e.g

# main configuration

outDir = sikRun
metadata = metadata.tsv
#aligner = star
aligner = bwaMem
#refFiles = supplementary/examples/refFilesDir
#genomeIdx = supplementary/examples/refFilesDir/Mus_musculus.GRCm38.dna_sm.primary_assembly.starIdx
geneModels = ../refFiles/annotation.gff3
fasta = ../refFiles/other.fa
counts = true
#refFiles = refFiles
qc = true
paired = true
mdups = false
featureCounts = -t CDS, -g locus_tag

and metadata file a.k.a samples sheet looks like this

name: sample_A_rep1
group: sample_A
r1: raw-data/960_0#22_R1.fastq.gz
r1: raw-data/070_0#22_R1.fastq.gz
r2: raw-data/967_0#22_R2.fastq.gz
r2: raw-data/077_0#22_R2.fastq.gz

name: sample_B_rep1
group: sampleB
r1: raw-data/964_1#28_R1.fastq.gz
r1: raw-data/074_1#28_R1.fastq.gz
r2: raw-data/965_1#28_R2.fastq.gz
r2: raw-data/075_1#28_R2.fastq.gz

CSV (tabular) format would be more familiar to most users and easier (arguably) to generate / export from spreadsheet, however I felt that having single, line of text is hard to read (on cli), but more importantly it was hard to generalise when samples have been split across multiple sequencing lanes.

I think nf-core design.csv format is nice and simple, but again not as extensible if for example we want to pass through additional metadata information.

all have been implemented in version 2.0.0 here is the bit of code that does the parsing

pansapiens added the bug label May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pair detection fails for files containing multiple _1 or _R1 substrings #57

Pair detection fails for files containing multiple _1 or _R1 substrings #57

pansapiens commented May 13, 2021

serine commented May 16, 2021 •

edited

Loading

Pair detection fails for files containing multiple _1 or _R1 substrings #57

Pair detection fails for files containing multiple _1 or _R1 substrings #57

Comments

pansapiens commented May 13, 2021

serine commented May 16, 2021 • edited Loading

serine commented May 16, 2021 •

edited

Loading