Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicious "Ran out of path" Error #2

Open
glennhickey opened this issue Nov 17, 2015 · 3 comments
Open

Suspicious "Ran out of path" Error #2

glennhickey opened this issue Nov 17, 2015 · 3 comments

Comments

@glennhickey
Copy link
Contributor

I have 4 brca1 graphs, (extracted from the high-coverage-reads/indexes):
/cluster/home/hickey/ga4gh/hgvm-graph-bakeoff-evalutations/high_coverage_graphs/
cactus-brca1.vg
refonly-brca1.vg
snp1000g-brca1.vg
trivial-brca1.vg

I can successfully run corg on all pairs of these graphs except cactus and trivial, which gives me

Processing path: GI262359905
We ran out of path in one graph and not in the other!
We have a mapping
Our mapping: {"position": {"node_id": 856, "offset": 95}, "is_reverse": true, "rank": 842}
terminate called after throwing an instance of 'std::runtime_error'
what(): Ran out of mappings on one path before the other!
Aborted (core dumped)

If these two graphs have paths of different length, which seems to be what this error is saying, then I would not expect them both to compare successfully to refonly and snp1000g.

@glennhickey
Copy link
Contributor Author

Some other corg questions:

  • I now only include graphs if len(graphA) == len(corg(graphA, graphA)). Is this reasonable?
  • for cactus_sma, the above doesn't hold (the difference is only a few bases). is this a corg bug?

@adamnovak
Copy link
Owner

OK, I have looked into this. Here's what's going on.

snp1000g has no paths in common with any of the other graphs (not even a path called "ref"). It appears to contain a single path named "17". Thus, it can merge with any graph without error, since no path merging actually happens; the graphs just get placed side by side in one file.

To fix this, I've added a warning that will print if there aren't any common paths to merge on.

To debug the rest of it, I've added a pre-check of the path lengths to be merged, to make sure they are actually the same length before doing any merging. It will be slower, but probably safer, as it's now impossible to merge two graphs that disagree about a path's length and not have an error.

With this change, I can successfully merge:

Cactus and Camel

The length of GI262359905 doesn't match for:

Cactus and Trivial
Camel and Trivial

This is weird, but it seems sane and consistent. Cactus and Camel agree with each other about the path length, and both disagree with trivial. It seems like Cactus and Camel put the path at 81,191 bp and Trivial puts it at 81,189 bp, or 2 bases shorter.

Going back in with vg, it seems like this is actually the case:

[anovak@kolossus core-graph]$ vg mod -k GI262359905 -u /cluster/home/hickey/ga4gh/hgvm-graph-bakeoff-evalutations/high_coverage_graphs/cactus-brca1.vg | vg view -j - | jq '.node[] | {"length": .sequence | length, "name": .name}'
{
  "length": 2,
  "name": "GI262359905_0_2_0"
}
{
  "length": 81189,
  "name": null
}
[anovak@kolossus core-graph]$ vg mod -k GI262359905 -u /cluster/home/hickey/ga4gh/hgvm-graph-bakeoff-evalutations/high_coverage_graphs/trivial-brca1.vg | vg view -j - | jq '.node[] | {"length": .sequence | length, "name": .name}'
{
  "length": 81189,
  "name": null
}

Looking at the original FASTA, it seems as though the shorter length, which trivial is using, is correct:

$ cat /cluster/home/anovak/public_html/hgvm/BRCA1/GI262359905.fa | grep -v ">" | tr -d '\n' | wc -c
81189

There's an extra length-2 node tacked on (I think with a reversing join) in both the Cactus and Camel graphs.

Interestingly, it looks like the lengths are correct in the original HAL files:

[anovak@kolossus core-graph]$ halStats /cluster/home/hickey/ga4gh/cactus/results/BRCA1/brca1_star.hal
hal v2.1
(ref:0.001,GI262359905:0.001,GI528476558:0.001)Anc0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
Anc0, 3, 81189, 1, 0, 28
ref, 0, 81189, 1, 28, 0
GI262359905, 0, 81189, 1, 28, 0
GI528476558, 0, 81168, 1, 26, 0

[anovak@kolossus core-graph]$ halStats /cluster/home/anovak/hive/sgdev/output/out3/BRCA1/zipCmis2Min20Ed2Range100.hal

hal v2.1
(ref,GI528476558,GI262359905)rootSeq;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
rootSeq, 3, 81189, 1, 0, 28
ref, 0, 81189, 1, 28, 0
GI528476558, 0, 81168, 1, 27, 0
GI262359905, 0, 81189, 1, 28, 0

It does show up, however, after sg2vg:

[anovak@kolossus core-graph]$ sg2vg http://ga4gh-test1.cloudapp.net/camel-brca1/v0.6.g | vg view -Jv - | vg mod -k GI262359905 - | vg view -j - | jq '[.node[].sequence | length] | add'
Downloading References... (14 references retrieved)
Downloading Sequences... (14 sequences retrieved)
Downloading Joins... (26 joins retrieved)
Downloading allele paths... (3 paths retrieved)
Converting Side Graph to VG Sequence Graph
Writing VG JSON to stdout
81191
[anovak@kolossus core-graph]$ sg2vg http://ga4gh-test1.cloudapp.net/trivial-brca1/v0.6.g | vg view -Jv - | vg mod -k GI262359905 - | vg view -j - | jq '[.node[].sequence | length] | add'
Downloading References... (3 references retrieved)
Downloading Sequences... (3 sequences retrieved)
Downloading Joins... (0 joins retrieved)
Downloading allele paths... (3 paths retrieved)
Converting Side Graph to VG Sequence Graph
Writing VG JSON to stdout
81189

It could be that vg is messing this up, even with the simple commands I'm using here, but I don't think it's likely.

As for your other questions:

  • len(graphA) == len(corg(graphA, graphA)) only when the graph is completely covered by paths. I have added a check and a warning to see if either of the graphs has uncovered nodes.
  • cactus-sma.vg seems to not be covered by paths. It looks like it includes a node with ID 1 and sequence "NNNNNNNNNN" that has no paths visiting it.

@glennhickey
Copy link
Contributor Author

Hey Adam,

Thanks for this! I'm going to look into sg2vg right now to see what's up
with the path lengths and uncovered nodes.

-Glenn

On Fri, Dec 4, 2015 at 5:29 PM, adamnovak [email protected] wrote:

OK, I have looked into this. Here's what's going on.

snp1000g has no paths in common with any of the other graphs (not even a
path called "ref"). It appears to contain a single path named "17". Thus,
it can merge with any graph without error, since no path merging actually
happens; the graphs just get placed side by side in one file.

To fix this, I've added a warning that will print if there aren't any
common paths to merge on.

To debug the rest of it, I've added a pre-check of the path lengths to be
merged, to make sure they are actually the same length before doing any
merging. It will be slower, but probably safer, as it's now impossible to
merge two graphs that disagree about a path's length and not have an
error.

With this change, I can successfully merge:

Cactus and Camel

The length of GI262359905 doesn't match for:

Cactus and Trivial
Camel and Trivial

This is weird, but it seems sane and consistent. Cactus and Camel agree
with each other about the path length, and both disagree with trivial. It
seems like Cactus and Camel put the path at 81,191 bp and Trivial puts it
at 81,189 bp, or 2 bases shorter.

Going back in with vg, it seems like this is actually the case:

[anovak@kolossus core-graph]$ vg mod -k GI262359905 -u /cluster/home/hickey/ga4gh/hgvm-graph-bakeoff-evalutations/high_coverage_graphs/cactus-brca1.vg | vg view -j - | jq '.node[] | {"length": .sequence | length, "name": .name}'
{
"length": 2,
"name": "GI262359905_0_2_0"
}
{
"length": 81189,
"name": null
}
[anovak@kolossus core-graph]$ vg mod -k GI262359905 -u /cluster/home/hickey/ga4gh/hgvm-graph-bakeoff-evalutations/high_coverage_graphs/trivial-brca1.vg | vg view -j - | jq '.node[] | {"length": .sequence | length, "name": .name}'
{
"length": 81189,
"name": null
}

Looking at the original FASTA, it seems as though the shorter length,
which trivial is using, is correct:

$ cat /cluster/home/anovak/public_html/hgvm/BRCA1/GI262359905.fa | grep -v ">" | tr -d '\n' | wc -c
81189

There's an extra length-2 node tacked on (I think with a reversing join)
in both the Cactus and Camel graphs.

Interestingly, it looks like the lengths are correct in the original HAL
files:

[anovak@kolossus core-graph]$ halStats /cluster/home/hickey/ga4gh/cactus/results/BRCA1/brca1_star.hal
hal v2.1
(ref:0.001,GI262359905:0.001,GI528476558:0.001)Anc0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
Anc0, 3, 81189, 1, 0, 28
ref, 0, 81189, 1, 28, 0
GI262359905, 0, 81189, 1, 28, 0
GI528476558, 0, 81168, 1, 26, 0

[anovak@kolossus core-graph]$ halStats /cluster/home/anovak/hive/sgdev/output/out3/BRCA1/zipCmis2Min20Ed2Range100.hal

hal v2.1
(ref,GI528476558,GI262359905)rootSeq;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
rootSeq, 3, 81189, 1, 0, 28
ref, 0, 81189, 1, 28, 0
GI528476558, 0, 81168, 1, 27, 0
GI262359905, 0, 81189, 1, 28, 0

It does show up, however, after sg2vg:

[anovak@kolossus core-graph]$ sg2vg http://ga4gh-test1.cloudapp.net/camel-brca1/v0.6.g | vg view -Jv - | vg mod -k GI262359905 - | vg view -j - | jq '[.node[].sequence | length] | add'
Downloading References... (14 references retrieved)
Downloading Sequences... (14 sequences retrieved)
Downloading Joins... (26 joins retrieved)
Downloading allele paths... (3 paths retrieved)
Converting Side Graph to VG Sequence Graph
Writing VG JSON to stdout
81191
[anovak@kolossus core-graph]$ sg2vg http://ga4gh-test1.cloudapp.net/trivial-brca1/v0.6.g | vg view -Jv - | vg mod -k GI262359905 - | vg view -j - | jq '[.node[].sequence | length] | add'
Downloading References... (3 references retrieved)
Downloading Sequences... (3 sequences retrieved)
Downloading Joins... (0 joins retrieved)
Downloading allele paths... (3 paths retrieved)
Converting Side Graph to VG Sequence Graph
Writing VG JSON to stdout
81189

It could be that vg is messing this up, even with the simple commands I'm
using here, but I don't think it's likely.

As for your other questions:

len(graphA) == len(corg(graphA, graphA)) only when the graph is
completely covered by paths. I have added a check and a warning to see if
either of the graphs has uncovered nodes.

cactus-sma.vg seems to not be covered by paths. It looks like it
includes a node with ID 1 and sequence "NNNNNNNNNN" that has no paths
visiting it.


Reply to this email directly or view it on GitHub
#2 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants