Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pair detection fails for files containing multiple _1 or _R1 substrings #57

Open
pansapiens opened this issue May 13, 2021 · 1 comment
Labels

Comments

@pansapiens
Copy link
Contributor

When running with the commandline options -extn .fq.gz -paired -pairIds _1,_2 and input FASTQs named like R_10_1.fq.gz and R_10_2.fq.gz, RNAsik fails with the error:

Fatal error: /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/sikFqFiles.bds, line 247, pos 17. -paired set to true, but can't find _2 read. Is it single-end data? Also check your -pairIds _1,_2
Stack trace:
error "-paired set to $paired, but can't find $pai ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/sikFqFiles.bds:247
  samplesSheet = makeSamplesSheet( fqFiles,fqRgxs, ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.4/bin/../opt/rnasik-1.5.4/src/RNAsik.bds:93

I believe this is because when it looks for the corresponding _2file here, the filename is incorrectly generated:
https://github.com/MonashBioinformaticsPlatform/RNAsik-pipe/blob/master/src/sikFqFiles.bds#L242

eg, string chkR2 = fq.replace(pairIdsList[0], pairIdsList[1])

if:

fq = "R_10_1.fq.gz"
pairIdsList = ["_1", "_2"]

then we would get chkR2 set to "R_20_2.fq.gz" rather than the expected R_10_2.fq.gz

There are also corner cases where this could also result in mis-pairing of files (eg, if the incorrectly generated filename "R_20_2.fq.gz" happened to exist)

@pansapiens pansapiens added the bug label May 13, 2021
@serine
Copy link
Collaborator

serine commented May 16, 2021

thanks @pansapiens, I understand it is annoying having to find and fight those bugs

you are right, yet another bug with sample files guessing and pre-processing.

I'm not sure what I should do here, this is an edge case, although will most certainly keep coming up. I have been working on a very different sample files parsing and input. And how args get passed to RNAsik. All args will be passed via single config file e.g

# main configuration

outDir = sikRun
metadata = metadata.tsv
#aligner = star
aligner = bwaMem
#refFiles = supplementary/examples/refFilesDir
#genomeIdx = supplementary/examples/refFilesDir/Mus_musculus.GRCm38.dna_sm.primary_assembly.starIdx
geneModels = ../refFiles/annotation.gff3
fasta = ../refFiles/other.fa
counts = true
#refFiles = refFiles
qc = true
paired = true
mdups = false
featureCounts = -t CDS, -g locus_tag

and metadata file a.k.a samples sheet looks like this

name: sample_A_rep1
group: sample_A
r1: raw-data/960_0#22_R1.fastq.gz
r1: raw-data/070_0#22_R1.fastq.gz
r2: raw-data/967_0#22_R2.fastq.gz
r2: raw-data/077_0#22_R2.fastq.gz

name: sample_B_rep1
group: sampleB
r1: raw-data/964_1#28_R1.fastq.gz
r1: raw-data/074_1#28_R1.fastq.gz
r2: raw-data/965_1#28_R2.fastq.gz
r2: raw-data/075_1#28_R2.fastq.gz

CSV (tabular) format would be more familiar to most users and easier (arguably) to generate / export from spreadsheet, however I felt that having single, line of text is hard to read (on cli), but more importantly it was hard to generalise when samples have been split across multiple sequencing lanes.

I think nf-core design.csv format is nice and simple, but again not as extensible if for example we want to pass through additional metadata information.

all have been implemented in version 2.0.0 here is the bit of code that does the parsing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants