-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility with small genomes #1
Comments
Hello, Sorry not to have answered earlier. I believed I will receive an email in case of issue. The "small_genome" is currently not used. All prokaryotic genomes are considered as small genomes in this version. As you got results with Porphyromonas gingivalis W83 and Porphyromonas asaccharolytica, I suppose your GenBank files all include the complete nucleotidic sequence at the end. Please check Genbank files for pairs of genomes that provide no results. (if no sequence can be extracted, no computation is done and you have no result). The 96-98% similarity threshold for 16S is an advice based on my using of the program on Streptomyces, Mycobacterium, Eschierchia coli, Staphylococcus aureus. It is only intended to avoid highly conserved intergenic regions that are not significant. You must be sure too that gene IDs provided by MGBD database ( Another possiblity is that the density of genes is so high that you have very small intergenic regions (or even no intergenic regions?) To increase sensitivity, you can change some parameters in the Line 245, you can set 4 instead of 8 in
(it is the minimal number of pairs of orthologs that must have a given motif upstream to consider this motif as interesting, not to use with every genome) Eventually, you can set 1 for the following boolean line 220:
If you believe it is a bug, you can run the program with the parameters you used and put at the end of line To check if you have orthology relationships, verify that Let me know. Best regards |
Hi, Thank you for getting back to me. I was able to successfully get the program to run with P.gingivalis W83 and P.gingivalis TDC60 after looking at the MBGD file more closely. Apparently the genome files that MBGD uses have different annotations than those of the default/representative genomes from NCBI. My issue now seems to be that, in fact, the program is taking too long. I had it running for 7 days on a slurm node with 40 gb of RAM and 16 cpus (64 cores) requested, but it still ended up timing out. I know that 64gb of ram is specified in the readme, but while I was watching it, it never seemed to use more than 25gb of resident memory. From monitoring the process ID's, it seemed like the last task which was running was for the motif "www_wwnw". I assumed this extra length was due to the need for extension of this motif. Nonetheless, I do have motifs in my results directory, albeit the list is quite long. Given that these two genomes are of the same species, but different strains, would that make them inherently too similar for the program to work on? I have another two genomes which I can try to use that are more distantly related to W83 than TDC60 is, but they are still the same species, just different strains. I had thought about changing the my $nb_mini_occ parameter which you had mentioned, but in this case, it seems like that may not solve the issue of it taking too long to run. Let me know what you think, |
Hi, I advise 64 GB, but it can be less. I have no way to know the needed amount of memory in advance (it is not proportional to bacteria genome size or to number of genes). SIGffRid can provide many motifs, but, at the end of the process, you obtain two files with these motifs sorted by R (the highern the better, typical bacterial SFBS will have 0.7 or 0.8 for this value) and LRT respectively (significance of over-representation in upstream region, the higher, the better again). motif "www_wwnw" is the pair of seeds used to found conserved regions in upstream sequences of orthologous genes. Once occurrences are found, the first seed (here, because not "wilcard letter") is extended until the occurrences are significantly over-represented in upstream regions of the genome (or until the is not sequence or no significance). Changing $nb_mini_occ is not recommended. If you dicrease, SIGffRid will found much more motifs to extends, therefore it will take even more time to run. (and increasing is dangerous because even for strong SFBS, initial conserved motifs are rarely very frequent in pairs of upstream regions, 8 is a good balance). I hope I answered your questions. |
Hello, thanks again for your response. I've been able to run the program with a few different related species. I had one last question regarding the output files you mentioned. I'm having some difficulty with looking at the results, specifically knowing which genes are involved with a certain motif. When I look in the FINAL_RESULTS directory, I can see the motifs and their R/LRT scores, but I am not sure if I am checking the correct files to verify the list of genes which are associated with these motifs. To do this, I have been looking at the files under MOTIF_ANNOT_UPST - is this the right place to look? Thanks |
Hello. Please, take into account that SIGffRid gives only the first gene downstream occurrences, but you can have syntenies, meaning that additional genes may be regulated by this potential SFBS even if they are not directly downstream the regulatory motif. |
Hello,
I noticed in sigffrid_cmd_line.pl that there is an option "small_genome" which doesn't appear to ever be used. I'm currently trying to run an analysis with genomes around 2Mbases, which I figured would be considered small compared to the genomes of the example analyses shown for the program. I have only been able to get results from the program when using one specific comparison genome, but this one in particular is not 97-98% similar based on 16S comparison (my genome of interest is Porphyromonas gingivalis W83, and the comparison genome I was able to get results when using is Porphyromonas asaccharolytica). When running with other comparison genomes (some greater than 98% 16S similarity, some lower, but all higher than P.asaccharolytica), the program will run for maybe a minute, and after it completes, the main results files will be empty. It seems like perhaps the program is able to find motifs for these comparisons which are then not deemed significant. If this is a result of the smaller genome size, are there any parameters that can/should be adjusted to accommodate for this, or is this likely due to an issue of % similarity? Any recommendations would be sincerely appreciated!
The text was updated successfully, but these errors were encountered: