Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError #7

Open
SNRNS opened this issue Feb 3, 2016 · 5 comments
Open

AssertionError #7

SNRNS opened this issue Feb 3, 2016 · 5 comments

Comments

@SNRNS
Copy link

SNRNS commented Feb 3, 2016

I successfully analysed the example files with RunTheta, however when I tried to analyse my sample it produces an error. I have the normal and primary tumour whole exome sequencing bam files. I created the Theta input file and the snp.withCounts files, one for normal and one for tumour files following the instructions. However when I run in the THetA-master directory:

bin/RunTHetA normal_primary.input --TUMOR_FILE BAF_OUTPUT/primary_snp.withCounts --NORMAL_FILE BAF_OUTPUT/normal_snp.withCounts

I get:

=================================================
Arguments are:
       Query File: normal_primary.input
       k: 3
       tau: 2
       Output Directory: ./
       Output Prefix: normal_primary
       Num Processes: 1
       Graph extension: .pdf

Valid sample for THetA analysis:
       Ratio Deviation: 0.1
       Min Fraction of Genome Aberrated: 0.05
       Program WILL cluster intervals.
=================================================
Reading in query file...
Frac with potential copy numbers: 0.274523486464
Reading SNP file at BAF_OUTPUT/primary_snp.withCounts
Reading SNP file at BAF_OUTPUT/normal_snp.withCounts
Reading interval file at normal_primary.input
Calculating BAFs
Determining heterozygosity.
Calculating BAFs.
First round of clustering...
Traceback (most recent call last):
 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 504, in <module>

   main()

 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 282, in main

   resultsfile2, boundsfile2 = run_fixed_N(2, args, intervals)

 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 319, in run_fixed_N

   lengths, tumorCounts, normalCounts, m, upper_bounds, lower_bounds, clusterAssignments, numClusters, clusterMeans, normalInd = clustering_BAF(n, intervals=intervals, missingData=missingData, prefix=prefix, outdir=directory, numProcesses=num_processes)

 File "/PATH/TO/.../THetA-master/python/ClusteringBAF.py", line 90, in clustering_BAF

   metaData = generate_meta_data(intervals, byChrm, numProcesses, sampleName, generateData, outdir)

 File "/PATH/TO/.../THetA-master/python/ClusteringBAF.py", line 143, in generate_meta_data

   results = p.map(cluster_wrapper, zip(intervals, linearizedSampleName, linearizedChrm, linearizedGenerateData))

 File "/PATH/TO/miniconda2/envs/theta2/lib/python2.7/multiprocessing/pool.py", line 251, in map

   return self.map_async(func, iterable, chunksize).get()

 File "/PATH/TO/miniconda2/envs/theta2/lib/python2.7/multiprocessing/pool.py", line 567, in get

   raise self._value

AssertionError

The normal_snp.withCounts file looks like this:

head -20 BAF_OUTPUT/normal_snp.withCounts

#Chrm   pos     A       C       G       T       total   refCount        mutCount
1       564621  0       0       0       0       0       0       0
1       564773  0       0       0       0       0       0       0
1       567753  0       0       0       0       0       0       0
1       721290  0       0       3       0       3       3       0
1       740857  0       0       0       0       0       0       0
1       752566  0       0       0       0       0       0       0
1       761732  0       0       0       0       0       0       0
1       765269  0       0       0       0       0       0       0
1       777122  0       0       0       0       0       0       0
1       785989  0       0       0       0       0       0       0
1       792480  0       0       0       0       0       0       0
1       798959  0       0       0       0       0       0       0
1       799463  0       0       0       0       0       0       0
1       888659  0       0       0       47      47      47      0
1       918573  0       0       0       0       0       0       0
1       926431  0       0       0       4       4       0       4
1       947034  0       0       0       0       0       0       0
1       949608  83      0       0       0       83      0       83

The primary_snp.withCounts looks like this:

head -20 BAF_OUTPUT/primary_snp.withCounts

#Chrm   pos     A       C       G       T       total   refCount        mutCount
1       564621  0       0       0       0       0       0       0
1       564773  0       0       0       0       0       0       0
1       567753  0       0       0       0       0       0       0
1       721290  0       0       0       0       0       0       0
1       740857  0       0       0       0       0       0       0
1       752566  0       0       0       0       0       0       0
1       761732  0       0       0       0       0       0       0
1       765269  0       0       0       0       0       0       0
1       777122  0       0       0       0       0       0       0
1       785989  0       0       0       0       0       0       0
1       792480  0       0       0       1       1       0       1
1       798959  0       0       0       0       0       0       0
1       799463  0       0       0       0       0       0       0
1       888659  0       0       0       165     165     165     0
1       918573  1       0       0       0       1       1       0
1       926431  0       0       0       0       0       0       0
1       947034  2       0       0       0       2       0       2
1       949608  191     0       0       0       191     0       191
1       990417  0       0       0       125     125     125     0

The normal_primary.input looks like this:

head normal_primary.input

#ID     chrm    start   end     tumorCount      normalCount
start_1_1:end_1_20000           1       1       20000   531     215
start_1_20001:end_1_40000       1       20001   40000   0       0
start_1_40001:end_1_60000       1       40001   60000   0       0
start_1_60001:end_1_80000       1       60001   80000   242     90
start_1_80001:end_1_100000      1       80001   100000  0       0
start_1_100001:end_1_120000     1       100001  120000  0       0
start_1_120001:end_1_140000     1       120001  140000  832     493
start_1_140001:end_1_160000     1       140001  160000  0       0
start_1_160001:end_1_180000     1       160001  180000  0       0

I would appreciate very much any help provided.

Many thanks,
Alejandro

@ChiragNepal
Copy link

Hi Alejandro,

Can you please share how you solved the problem?

I am getting the following.

Arguments are:
Query File: Test2Segment
k: 3
tau: 2
Output Directory: ./
Output Prefix: Test2Segment
Num Processes: 6
Graph extension: .pdf

Valid sample for THetA analysis:
Ratio Deviation: 0.1
Min Fraction of Genome Aberrated: 0.05

Reading in query file...
Traceback (most recent call last):
File "/usr/local/src/THetA/python/RunTHetA.py", line 504, in
main()
File "/usr/local/src/THetA/python/RunTHetA.py", line 275, in main
intervals = read_interval_file(args[0])
File "/usr/local/src/THetA/python/FileIO.py", line 400, in read_interval_file
tumor_counts.append(int(line[4]))
ValueError: invalid literal for int() with base 10: '38.3'

Thanks !

@SNRNS
Copy link
Author

SNRNS commented Jun 8, 2016

Hi @ChiragNepal,

I couldn't solve the problem. I used other tools in the end. For relative copy number alterations I used CopywriteR, for tumour purity and absolute copy numbers I used ABSOLUTE, and finally to get the cellular prevalence of the mutations I used PyClone.

I hope that helps.

Best,
Alejandro

@dpmccabe
Copy link

I'm also having this problem. Removing multiprocessing and changing p.map to map in ClusteringBAF.py results in this error instead:

First round of clustering...
Traceback (most recent call last):
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 504, in <module>
    main()
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 282, in main
    resultsfile2, boundsfile2 = run_fixed_N(2, args, intervals)
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 319, in run_fixed_N
    lengths, tumorCounts, normalCounts, m, upper_bounds, lower_bounds, clusterAssignments, numClusters, clusterMeans, normalInd = clustering_BAF(n, intervals=intervals, missingData=missingData, prefix=prefix, outdir=directory, numProcesses=num_processes)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 90, in clustering_BAF
    metaData = generate_meta_data(intervals, byChrm, numProcesses, sampleName, generateData, outdir)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 146, in generate_meta_data
    results = map(cluster_wrapper, zip(intervals, linearizedSampleName, linearizedChrm, linearizedGenerateData))
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 203, in cluster_wrapper
    mus, sigmas, clusterAssignments, numPoints, numClusters = cluster(binnedChrm, sampleName, chrm=chrm)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 259, in cluster
    Data = format_data(data, sampleName, chrm)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 302, in format_data
    Data = bnpy.data.XData(X=npArray)
  File "/xchip/scarter/dmccabe/bnpy-dev/bnpy/data/XData.py", line 97, in __init__
    self._check_dims()
  File "/xchip/scarter/dmccabe/bnpy-dev/bnpy/data/XData.py", line 121, in _check_dims
    assert self.X.flags.owndata
AssertionError

I don't have any problems running the example, for some reason. Just my own data.

@BaptisteAmeline
Copy link

It has been suggested in another forum that at least 1 SNP has to be present in every single interval_count. I assume that the example works perfectly because the number of interval_count is very limited compared to the number of SNP. Can someone confirm this hypothesis ? Is THetA still supported by the developper ?

If it is confirmed, as a consequence, if you want to be more resolutive in terms of interval_count (eg : in case of WGS presenting chromothripsis/chromoplexy events) it will be almost impossible to have 1 SNP per interval.

@simozacca
Copy link
Contributor

simozacca commented May 25, 2018

Dear @BaptisteAmeline,

As you can read in the manual (https://github.com/raphael-group/THetA/blob/master/doc/MANUAL.txt) the usage of SNPs is part of a recommended but optional step.
As far as I am aware of, there is no requirement about having a SNP per each interval_count.

However, to clarify your point (since almost all the allele-specific copy-number callers consider only segments having at least one heterozygous SNP), germinal SNPs are considered and their location is with respect to the reference genome which you used to align the reads. Therefore, catastrophic events as chromothripsis/chromoplexy do not affect in any way the position or presence of germinal-heterozygous SNPs in the reference genome. In the human-reference genome you may expect 1 SNP every 1k bases, and as such you should expect to have many SNPs in your intervals considering standard-bin sizes. The allele counts from these SNPs is used to infer the allele-specific copy numbers of each segment in the tumor sample due the effects of any aberration (including chromothripsis/chromoplexy). E.g. heterozygous SNPs should have a proportion of the alleles of 50%, if one of the alleles is lost, the expected proportions is 0%, assuming this occurs in all cells of the sample. You can search for B-allele frequency (BAF) to get to know more about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants