-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chromBPNet training questions #200
Comments
** The above issue was accidentally closed - help would still be appreciated with the above questions. Hi, I have run into some additional questions regarding the training of my bias model since I last posted this question. I am concerned about the results of the training reports, specifically the training validity loss curves. Below I have described 3 models that I have trained so far as well as the concerns I have about them. Model 1: This model is training on the total scATAC-seq data set stemming from muscle lysate. I found that the total loss for both the training and the validity loss curves was generally very high (between 400 and 500) and that the separation between the two curves is also concerningly large (60). This makes me think that the model may not be generalizing the training data very well and that the training set size may need to be increased. However, the current (and suggested) training validation split has 3 testing chromosomes, 2 validation chromosomes, and 16 training chromosomes, making it hard to increase the training set. Model 2: I have tried subsetting the original ATAC-seq data to a specific cell type of interest (while keeping the original split), and this seems to decrease the total loss for the training and validity loss curves (between 230 and 265) but maintains the gap at around 28. Model 3: Lastly, I decreased model 2's training set to ensure that the model is not learning specific patterns (5 testing, 2 validation, 14 training), but this did not change the curves and worsened other metrics in the QC report. It is my understanding that the definition of a "good" training validity loss curve varies depending on the project. My concern stems from the fact that the example QC report provided by chromBPNet features a graph with a training loss between 155 and 161 and a gap of only 4. Though all other metrics adhere to the cut-offs, I am wondering if this may have to do with the fact that my pearsonr value for nonpeaks tends to veer towards the lower end (0.13, 0.29, and 0.00 respectively). If so, how would you suggest fixing this issue? I have attached the QC reports below for your convenience. Model_1_QC.pdf Thank you! |
Model 1 looks just fine. Note the y-axis of the loss plots. They are not
from 0 to X but are between the ranges of the losses for training and
validation. Can't really compare the loss values between runs directly. And
the separation between training and validation loss is not unexpected. It
does not indicate overfitting since the loss curves are both decreasing
across epochs and early stopping is implemented to avoid overfitting i.e.
training loss decreasing, validation loss increasing. The learned features
look just fine. I would go ahead with the model.
…-A
On Wed, Jul 31, 2024 at 11:16 AM kaillahs ***@***.***> wrote:
Reopened #200 <#200>.
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDWEPZ5ULVKH6KW4PH3ULZPESW3AVCNFSM6AAAAABKV2VDWWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTG4YTIMBYHA4DANY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Anshul - thank you for this guidance! I went ahead and tried training chromBPNet using biasModel_2. During the training, I received the following error message and am unsure how to resolve it. This is not an error I've run into either when following the tutorial or when training the bias models. Thank you in advance.
|
Hey @kaillahs You havent seen this error when running the tutorial and the htmls were generated correctly? |
Hi Anusri - thank you for your reply! I did not see this error when running the tutorial. All files were generated, and the overall report looked good. |
Can you share the command used to run this ? |
I am in the process of training two chromBPNet models - one on a full scATAC-seq data set and one trained on a specific population of interest. The above error from Friday came up when training the model on the population of interest using this command:
I also just encountered the following error a couple of minutes ago while trying to train the model on the full ATACseq data set using the bias mode 1 as Anshul suggested.
|
You are receiving the latest error because you do not have write permissions on your /tmp directory which is being used for sorting your bam. You can provide an alternate dir for your temp files for this using the argument Are you using a different cluster / machine / environment to do your tutorial runs versus these runs ? Is the setup exactly the same? |
I am just realizing that I am running low on disk space - I'll try the I ran the tutorial and trained all bias models in Jupyter Notebooks using shell commands, but decided to start training models directly in the terminal as the notebooks started crashing and giving errors with regard to volume and speed of the output. |
Its unlikely you will not see these errors in tutorials but see only in your data - unless your setup has changed or the data does not follow the same formatting. But I think the issue is the former, I would try and make sure the setup is similar for your jupyter environment versus your command line. Also run them in docker. |
Ok, thank you. I am running into some unrelated software issues, but once that is figured out, I will try to train the tutorial model in the terminal to confirm that the setup works. I'll let you know whether or not it ends up working. |
I haven't set up Docker but will continue to work on that. I just tried running the tutorial in the terminal and ran into a series of warnings followed by the same 'range iterator' error message. I am confused as to why running the code in jupyter notebook using shell commands works but running it directly in the shell doesn't...
|
I am running the tutorial in Docker and will let you know if the same error occurs. |
I just finished running the tutorial in docker and ran into the same error as the one mentioned in issue #201:
It looks like the package on docker installed scipy version 1.13.1 and numpy version 1.23.4. Thank you in advance for your help! |
what installation approach did you use? |
Installed docker, pulled the docker image, and then ran the container.
|
I meant how did you install the chrombpnet repo? Can you try this open |
The code is still running, but making that change resolved the nanmean error. It is almost done generating the profile shap scores so I hope it doesn't encounter an error in that step as it did two runs ago. |
The tutorial pipeline ran successfully and the report matched the tutorial report posted on GitHub. I will make sure to go in and edit the Now that I have ensured that my environment is set up correctly, I wanted to refer back to my original post to get your guidance on the following questions: - Would you recommend training chromeBPNet on only our cell population of interest, or all nuclei derived from our tissue lysate? My QC filtered dataset has 3580 cells of my population of interest. I’m inclined to train with a subset of my population of interest only, but am unsure whether my sample is sufficiently large for model training and subsequent analysis. In an above comment, Anshul had approved my bias_model_1 which was trained on the whole population. Should I move forward with this model when training the chrombpnet model? - How should I handle peak calling files? I was unsure how to handle peak-calling for the bias pipeline and have been training various version of bias models accordingly. I have treated (n = 2 bio replicates) and control (n = 1 bio replicates) available. Each of these files has a peak.bed files associated with it. According to Anushri inhttps://github.com//issues/117 on github, I should not be using the peak.bed files generated by 10x. Am I correct in understanding that the recommendation is to take the merged.bam file created in the previous step to peak call manually using MACS2? In doing this, I would have given the following command: !macs2 callpeak -t data/downloads/merged.bam -f BAMPE -n "MACS2Peaks" -g "mm" -p 0.01 --shift -75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits --outdir data/downloads/MACS2PeakCallingPE as recommended by the ENCODE pipeline with the exception of changing -f input to "BAMPE" instead of "BAM" as we are working with paired-end data. However, Anshul advised against this change in #176. Does this mean that I should keep the -f command as "BAM" even though I am working with paired-end data? - Multiple folds: I would like to confirm my understanding of the usage of multiple folds. As I understand it I should be creating multiple folds (is there a recommended number?) in the splits folder each of which contains a different combination of training and validation chromosomes. I would then train a bias model and chrombpnet model for each fold separately. Later on, when using the tools, I would have to average out the bigwig or h5 files before inputting them into a given tool. Please let me know if this sounds right. Thank you for your guidance! |
I just checked my GPUs with the |
This is something you need to figure out based on your hardware. It should
not take 19 hours. It is likely it's not using the GPU. You said you
previously got it to run in 4-5 hours. What did you change?
…On Thu, Aug 8, 2024, 12:02 PM kaillahs ***@***.***> wrote:
I just checked my GPUs with the nvidia-smi command and it looks like
everything is installed correctly and compatible (the docker container and
my system are both using CUDA 12.4). Is there a reason that this run is
taking so much longer or is this to be expected?
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDWENJRFR52FCFA3I2Q63ZQO6FVAVCNFSM6AAAAABKV2VDWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZWGQ3TGMJWG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I had to change PCs as the old one is having disk space issues and crashing. The 4-5 hour runs were on the old PC via jupyter notebook or the terminal whereas I am now using a docker container on the new PC. Based on issue #95 I think it may be because my PC is using python 3.12.1 which is keeping me from installing tensorflow-gpu version 2.8
|
I just tried creating a new contained and running the pipeline after ensuring that, within the container, I have cuda 11.2, cudnn 8.1, TensorRT 7.2.2, and tensorflow-gpu 2.8, but it looks like it's still only using CPUs. |
I've successfully trained one chrombpnet model and figured out how to use the various tools. It is my understanding that the outputs are going to be more accurate when using multiple models each trained on different folds and then averaging out their outputs. Should each model be using the same bias model or do I also need to train several separate ones? |
Hi - I've trained 3 models across different folds for the full data set as well as the population of interest. I am hesitant to average out the results because I am unsure of how to check the model's performance before moving forward with its outputs. As of now, I plan on choosing the model with the best performance. However, looking at the individual models, I am unsure of which one to move forward with as I am concerned about the validation loss functions only decreasing by around 3%. Subset_0: 3.8% Total_0: 2.6% I wanted to check in and see if this standard is for this pipeline or if there is something I should be adjusting on my end since all other metrics meet the given thresholds. Thank you! |
What are the other performance metrics for the 3 folds (correlation of
observed and predicted log counts across test set peaks) as output by the
code. Please dont use the loss to evaluate models. The loss values are not
necessarily comparable across models or folds and they are not particularly
calibrated against an interpretable upper/lower bound.
…On Tue, Aug 27, 2024 at 10:47 AM kaillahs ***@***.***> wrote:
Hi - I've trained 3 models across different folds for the full data set as
well as the population of interest. I am hesitant to average out the
results because I am unsure of how to check the model's performance before
moving forward with its outputs.
As of now, I plan on choosing the model with the best performance.
However, looking at the individual models, I am unsure of which one to move
forward with as I am concerned about the validation loss functions only
decreasing by around 3%.
Subset_0: 3.8%
Subset_1: 2.4%
Subset_2: 2%
Total_0: 2.6%
Total_1: 2.9%
Total_2: 3.4%
I wanted to check in and see if this standard is for this pipeline or if
there is something I should be adjusting on my end since all other metrics
meet the given thresholds.
Thank you!
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDWEMIXR4G64ERGQSFOF3ZTS3RJAVCNFSM6AAAAABKV2VDWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGE3TCNZQGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@akundaje - thank you for the prompt response! The other metrics are fairly constant for all 6 models: peaks.pearsonr ~ 0.75 |
Looks good. You should average predictions and contribution scores across
folds rather than using the "best fold"
…On Tue, Aug 27, 2024, 1:05 PM kaillahs ***@***.***> wrote:
@akundaje <https://github.com/akundaje> - thank you for the prompt
response! The other metrics are fairly constant for all 6 models:
peaks.pearsonr ~ 0.75
peaks.mse ~ 0.25
peaks.median_jsd ~ 0.37 for total, 0.45 for subset
peaks.median_norm_jsd ~ 0.37 for total, 0.32 for subset
average of max profiles ~ 0.002
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDWEJCJEH5KIXSZHQPLLDZTTLXZAVCNFSM6AAAAABKV2VDWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGQYDQMRUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok, thank you! I will average out the results and see what I get. |
Hi,
I am integrating ChromBPNet analysis to refine snATAC-seq data (10x genomics) derived from skeletal muscle lysate. I am running into the following questions when training chromeBPNet:
- Is there a minimum number of cells or read depth necessary for training chromBPNet? Our total number of cells pre-QC filtering is 6305 and 4705 post-QC filtering.
- Would you recommend training chromeBPNet on only our cell population of interest, or all nuclei derived from our tissue lysate? My QC filtered dataset has 3580 cells of my population of interest. I’m inclined to train with a subset of my population of interest only, but am unsure whether my sample is sufficiently large for model training and subsequent analysis.
- How should I handle peak calling files? I have treated (n = 2 bio replicates) and control (n = 1 bio replicates) available. Each of these files has a peak.bed files associated with it. According to Anushri in Issue #117 on github, I should not be using the peak.bed files generated by 10x. Am I correct in understanding that the recommendation is to take the merged.bam file created in the previous step to peak call manually using MACS2? In doing this, I would have given the following command:
!macs2 callpeak -t data/downloads/merged.bam -f BAMPE -n "MACS2Peaks" -g "mm" -p 0.01 --shift -75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits --outdir data/downloads/MACS2PeakCallingPE
as recommended by the ENCODE pipeline with the exception of changing -f input to "BAMPE" instead of "BAM" as we are working with paired-end data. However, Anshul advised against this change in issue #176. Does this mean that I should keep the -f command as "BAM" even though I am working with paired-end data?
- Multiple folds: I would like to confirm my understanding of the usage of multiple folds. As I understand it I should be creating multiple folds (is there a recommended number?) in the splits folder each of which contains a different combination of training and validation chromosomes. I would then train a bias model and chrombpnet model for each fold separately. Later on, when using the tools, I would have to average out the bigwig or h5 files before inputting them into a given tool. Please let me know if this sounds right.
Thank you!
The text was updated successfully, but these errors were encountered: