Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host map: ensure paired results #286

Open
necrolyte2 opened this issue Mar 3, 2016 · 10 comments
Open

Host map: ensure paired results #286

necrolyte2 opened this issue Mar 3, 2016 · 10 comments

Comments

@necrolyte2
Copy link
Member

If R1 is filtered in host map remove R2 as well
If R2 is filtered in host map remove R1 as well

Reasoning:
If one of the pair matches the host then both should have matched so both should be removed
Right now, there ends up to be a bunch of R2 reads that should have been filtered out, but end up going through costly blast stages

@necrolyte2 necrolyte2 modified the milestone: host map paired and summary Mar 3, 2016
@averagehat averagehat self-assigned this Mar 30, 2016
@averagehat
Copy link
Contributor

It wasn't obvious to me how to do this (without destroying RAM) until I realized the discarded reads are available to look at.

  1. Sort the discarded reads (using unix sort)
  2. Open host_map_R1.fastq and R1.discard
  3. Iterate through host_map_R1 as a generator;
    • If the current ID in host-map-R1 equals the top ID in R1.discard:
    1. filter that out of host-mapR1
    2. Advance to the next ID in R1.discard
      (essentially treating R1.discard like a stack)
  4. do the same for R2

This way only load one record/id into memory at a time.

NB: It would probably be better to chunk out the writes rather than writing them one at a time--anyway this will just pass SeqIO.write a generator and I'm not sure if they do chunking or not

@necrolyte2
Copy link
Member Author

is host_map_R1 sorted already?

@averagehat
Copy link
Contributor

No, I thought it would be, but it doesn't appear to be. edit-- this si wrong

On Mon, Apr 4, 2016 at 3:29 PM, Tyghe Vallard [email protected]
wrote:

is host_map_R1 sorted already?


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#286 (comment)

@necrolyte2
Copy link
Member Author

So you will have to sort both files then right?

@averagehat
Copy link
Contributor

oops, I misread your comment.
host_map_R1 is sorted. R1.discard is not.

I'm not sure where the sorting for host_map_r1 comes from, or rather why that's maintained when the other isn't. But it's a precondition of this algorithm to work.

@necrolyte2
Copy link
Member Author

K. Might want to verify that the sort is enforced somehow prior

@averagehat
Copy link
Contributor

In the least it can be an assertion somewhere.
If you finish and R1.discard is non-empty, you know something went wrong.
or

current = next(seq)
assert current.id > last.id

@necrolyte2
Copy link
Member Author

are they always going to be integers?

@averagehat
Copy link
Contributor

yeah this is after the pipeline has set the ids to integers

On Mon, Apr 4, 2016 at 4:13 PM, Tyghe Vallard [email protected]
wrote:

are they always going to be integers?


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#286 (comment)

@averagehat
Copy link
Contributor

Woops I phrased this wrong. You run with host_map_R1 and R2.discard, and host_map_R2 and R1.discard.
R1.discard will already have been dropped from host_map_R1, you need to drop what was dropped from the pair . . .

@necrolyte2 necrolyte2 removed this from the host map paired and summary milestone Apr 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants