Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError #8

Open
pengxi1209 opened this issue Nov 18, 2024 · 3 comments
Open

RuntimeError #8

pengxi1209 opened this issue Nov 18, 2024 · 3 comments

Comments

@pengxi1209
Copy link

Hello, when I was using OncodriveFML with hg19 as the reference, I encountered an error: "RuntimeError: Sequence 'UN_GL000228' not found in genome build 'hg19' (/datasets/genomereference/hg19-20190201/UN_GL000228.txt)". How can I resolve this issue?
my code is oncodrivefml -i paad.txt.gz -e cds.tsv.gz -s wgs -c oncodrivefml_v2.conf -o result --force

@pengxi1209
Copy link
Author

Additionally, could you please tell me how to obtain the hg38 cds.tsv.gz file? I couldn't find it on UCSC or Ensembl.

@FedericaBrando
Copy link
Member

Hi, could you try re-download the genome reference with the following steps:

  1. Remove the ~/.bgdata folder
  2. Downloading the cadd scores:
    bgdata get genomicscores/caddpack/1.0
  3. Downloading the genome reference:
    bgdata get datasets/genomereference/hg19
  4. export BGDATA_LOCAL=~/.bgdata and BGDATA_OFFLINE=TRUE environment variables
  5. run the example.

Additionally, could you please tell me how to obtain the hg38 cds.tsv.gz file? I couldn't find it on UCSC or Ensembl.

You can build it yourself, following this structure https://oncodrivefml.readthedocs.io/en/latest/files.html#regions-file-format and pass it to OncodriveFML.

@pengxi1209
Copy link
Author

Thank you very much for your patient answer. I downloaded the CADD score and genomereference/hg19 and hg38 correctly, and successfully ran the example to obtain the result file(ex_paad-oncodrivefml.html; ex_paad-oncodrivefml.png; ex_paad-oncodrivefml.tsv.gz).

However, when running my own data, it was not smooth. Since my paad.txt.gz is based on hg38, I need the cds.tsv.gz of hg38. I have carefully read the detailed information about each column in the file format in the link you sent me before, but I still don't know how to obtain the data for each column.

I tried to select "Chromosome/scaffold name" "Gene start (bp)" "Gene end (bp)" "Gene stable ID" "Gene name"
in the ensemble database and built a list of more than 70000 rows, while your example 'cds.tsv.gz' of hg19 has 200000 lines, so when I run it with the cds.tsv.gz of hg38, which has over 70000 lines, many warnings appear: "background mismatch at position xxxx", and ultimately, no result file similar to the example was obtained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants