-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variables in the training data missing in newdata #1
Comments
Do you mind sharing some dummy input file for me to reproduce the error?
I found that methylation alone did not generate good enough prediction so I did not pursuit this further. |
Please find the attached files used as input. compute_score gives error: Error in predict.randomForest(reptile_classifier, epimark, type = "prob") : Files: |
Dear Dr. He,
I have shared the dummy input files in the github issue. Please find the
attached files herewith as well. The initial file is methylation rates for
each basepair. Is this issue due to the bigwig files generated from these
.bed files that have each basepair in each line or due to the fact that I
am using the mm39 mouse genome.
Sincerely,
Karamveer
…On Sun, Jun 2, 2024 at 11:31 PM Yupeng He ***@***.***> wrote:
Do you mind sharing some dummy input file for me to reproduce the error?
Are there any specific trained model available for only DNA methylation
data to predict enhancers.
I found that methylation alone did not generate good enough prediction so
I did not pursuit this further.
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHRJQJ3USWE2WBVQIHMR273ZFPPQTAVCNFSM6AAAAABITCKGKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGIYTCMJVGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Sincerely,
Karamveer
Post-doctoral Scholar
Department of Pediatrics
Penn State College of Medicine
Penn State University
|
Thanks. I will take a look. Using mm39 will be an issue but it probably won't be the cause of the error you saw. Unfortunately all models were trained and tested on data processed based on mm10. I would suggest reprocessing your data on mm10 if I am able to fix the error and you still want to run REPTILE on your data. |
Thanks for the suggestion. I will try it with mm10 data as well but my data is for mm39, and I am doing some other complimentary analysis on mm39. So, cannot convert it to mm10. |
I see. REPTILE probably won't generate what you want. I would recommend you to use peak calls from H3K27ac or the overlapping peaks of H3K27ac and H3K4me1 as predicted enhancers. |
Hi, Can you share the full training dataset used for training. Since there is a subset of dataset (Chr19) only available in example data. |
Hi,
I am trying to run reptile on pre-trained model mm_model_coreMarks.reptile using methylation data. Is there any issue with bw generation, I have methylation base call bed files containing chr no, start, end, methylation rate. I convereted it into bw file using the following commands:
awk '{printf "%s\t%d\t%d\t%2.3f\n" , $1,$2,$3,$4}' myBed.bed > myFile.bedgraph
sort -k1,1 -k2,2n myFile.bedgraph > myFile_sorted.bedgraph
bedGraphToBigWig myFile_sorted.bedgraph myChrom.sizes myBigWig.bw
I tried alone Meth epimark as well as all four H3K4me1 etc given for mm_model_coreMarks.reptile model. The output of REPTILE_preprocess.py is preprocessed.region_with_epimark.tsv file and look like this:
chr start end id Meth_E4 H3K4me1_E4 H3K4me3_E4 H3K27ac_E4
chr1 0 2000 bin_0 0.0 0.0 0.0 0.0
chr1 100 2100 bin_1 0.0 0.0 0.0 0.0
chr1 200 2200 bin_2 0.0 0.0 0.0 0.0
chr1 300 2300 bin_3 0.0 0.0 0.0 0.0
chr1 400 2400 bin_4 0.0 0.0 0.0 0.0
chr1 500 2500 bin_5 0.0 0.0 0.0 0.0
chr1 600 2600 bin_6 0.0 0.0 0.0 0.0
chr1 700 2700 bin_7 0.0 0.0 0.0 0.0
chr1 800 2800 bin_8 0.0 0.0 0.0 0.0
chr1 900 2900 bin_9 0.0 0.0 0.0 0.0
chr1 1000 3000 bin_10 0.0 0.0 0.0 0.0
.
.
chr1 3211200 3213200 bin_32112 5.0 5.0 5.0 5.0
chr1 3211300 3213300 bin_32113 5.0 5.0 5.0 5.0
chr1 3211400 3213400 bin_32114 5.0 5.0 5.0 5.0
chr1 3211500 3213500 bin_32115 4.0 4.0 4.0 4.0
chr1 3211600 3213600 bin_32116 3.3 3.3 3.3 3.3
chr1 3211700 3213700 bin_32117 2.54545 2.54545 2.54545 2.54545
chr1 3211800 3213800 bin_32118 2.69231 2.69231 2.69231 2.69231
chr1 3211900 3213900 bin_32119 3.0 3.0 3.0 3.0
chr1 3212000 3214000 bin_32120 2.85714 2.85714 2.85714 2.85714
Now when I run the compute score command:
REPTILE_compute_score.R -i data_info_file2 -m mm_model_coreMarks.reptile -a tmp/mm39_w2kb_s100bp_preprocessed.region_with_epimark.tsv -s E4 -o tmp/E4__compute_pred
I get the following error:
Error in predict.randomForest(reptile_classifier, epimark, type = "prob") :
variables in the training data missing in newdata
Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest
Execution halted
Are there any specific trained model available for only DNA methylation data to predict enhancers.
Note: I tried with both genome wide and region specific.
The text was updated successfully, but these errors were encountered: