Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with small genomes #1

Closed
blew42 opened this issue Feb 25, 2022 · 5 comments
Closed

Compatibility with small genomes #1

blew42 opened this issue Feb 25, 2022 · 5 comments
Assignees
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@blew42
Copy link

blew42 commented Feb 25, 2022

Hello,

I noticed in sigffrid_cmd_line.pl that there is an option "small_genome" which doesn't appear to ever be used. I'm currently trying to run an analysis with genomes around 2Mbases, which I figured would be considered small compared to the genomes of the example analyses shown for the program. I have only been able to get results from the program when using one specific comparison genome, but this one in particular is not 97-98% similar based on 16S comparison (my genome of interest is Porphyromonas gingivalis W83, and the comparison genome I was able to get results when using is Porphyromonas asaccharolytica). When running with other comparison genomes (some greater than 98% 16S similarity, some lower, but all higher than P.asaccharolytica), the program will run for maybe a minute, and after it completes, the main results files will be empty. It seems like perhaps the program is able to find motifs for these comparisons which are then not deemed significant. If this is a result of the smaller genome size, are there any parameters that can/should be adjusted to accommodate for this, or is this likely due to an issue of % similarity? Any recommendations would be sincerely appreciated!

@FTouzain
Copy link
Member

FTouzain commented Apr 26, 2022

Hello,

Sorry not to have answered earlier. I believed I will receive an email in case of issue.

The "small_genome" is currently not used. All prokaryotic genomes are considered as small genomes in this version.

As you got results with Porphyromonas gingivalis W83 and Porphyromonas asaccharolytica, I suppose your GenBank files all include the complete nucleotidic sequence at the end. Please check Genbank files for pairs of genomes that provide no results. (if no sequence can be extracted, no computation is done and you have no result).

The 96-98% similarity threshold for 16S is an advice based on my using of the program on Streptomyces, Mycobacterium, Eschierchia coli, Staphylococcus aureus. It is only intended to avoid highly conserved intergenic regions that are not significant.
(the hypothesis of SIGffRid is that regulatory motifs are more conserved than other intergenic sequences).

You must be sure too that gene IDs provided by MGBD database (ortho....txt) are the one found in the "locus_tag" of the related Genbank file (to be sure that intergenic upstream regions of genes can be extracted by identifying orthologous genes). (in case of reannotation, it may be different).

Another possiblity is that the density of genes is so high that you have very small intergenic regions (or even no intergenic regions?)

To increase sensitivity, you can change some parameters in the Ftest_rech...pl program.

Line 245, you can set 4 instead of 8 in

my $nb_mini_occ   = 8;

(it is the minimal number of pairs of orthologs that must have a given motif upstream to consider this motif as interesting, not to use with every genome)

Eventually, you can set 1 for the following boolean line 220:

my $bool_fixed_motif_threshold = 0;

If you believe it is a bug, you can run the program with the parameters you used and put at the end of line
> debug.txt 2>&1
so as to create a debug.txt log file you can send to me (fabrice.touzain@ with the domain anses.fr)

To check if you have orthology relationships, verify that ortho_...txt file is not empty.
You must have non empty RMES_...txt files (first search of slighly over-represented words)
You must have non empty markov_mod_order3...txt files (markov model computation for the statistics used in motif extension)
You must have non empty many not_fasta_...txt files (used for various statistics and extracted from GenBank files).
If one of these conditions is not filled, it means there was a problem either in parameters on in running the processes.

Let me know. Best regards

@FTouzain FTouzain self-assigned this Apr 26, 2022
@FTouzain FTouzain added help wanted Extra attention is needed question Further information is requested labels Apr 26, 2022
@blew42
Copy link
Author

blew42 commented May 18, 2022

Hi,

Thank you for getting back to me. I was able to successfully get the program to run with P.gingivalis W83 and P.gingivalis TDC60 after looking at the MBGD file more closely. Apparently the genome files that MBGD uses have different annotations than those of the default/representative genomes from NCBI.

My issue now seems to be that, in fact, the program is taking too long. I had it running for 7 days on a slurm node with 40 gb of RAM and 16 cpus (64 cores) requested, but it still ended up timing out. I know that 64gb of ram is specified in the readme, but while I was watching it, it never seemed to use more than 25gb of resident memory. From monitoring the process ID's, it seemed like the last task which was running was for the motif "www_wwnw". I assumed this extra length was due to the need for extension of this motif. Nonetheless, I do have motifs in my results directory, albeit the list is quite long.

Given that these two genomes are of the same species, but different strains, would that make them inherently too similar for the program to work on? I have another two genomes which I can try to use that are more distantly related to W83 than TDC60 is, but they are still the same species, just different strains. I had thought about changing the my $nb_mini_occ parameter which you had mentioned, but in this case, it seems like that may not solve the issue of it taking too long to run.

Let me know what you think,
Thanks

@FTouzain
Copy link
Member

Hi,

I advise 64 GB, but it can be less. I have no way to know the needed amount of memory in advance (it is not proportional to bacteria genome size or to number of genes).
SIGffRid is very long, can take 14 days.
Yes, I think you must not use bacteria of the same species with different strains (first, in this case, what are orthologs?). Sequences will be too similar even in intergenic regions and you will have many spurious matchs.
The best is to choose a pair of bacterial species close enough to share the same regulatory mechanisms, transcription factors and motifs, and far enough not to have similar intergenic regions.

SIGffRid can provide many motifs, but, at the end of the process, you obtain two files with these motifs sorted by R (the highern the better, typical bacterial SFBS will have 0.7 or 0.8 for this value) and LRT respectively (significance of over-representation in upstream region, the higher, the better again).

motif "www_wwnw" is the pair of seeds used to found conserved regions in upstream sequences of orthologous genes. Once occurrences are found, the first seed (here, because not "wilcard letter") is extended until the occurrences are significantly over-represented in upstream regions of the genome (or until the is not sequence or no significance).
(seed similarity is evaluated on pairs of upstream sequences, extension is made independently for each bacteria and can therefore find distinct motif in each bacterium).

Changing $nb_mini_occ is not recommended. If you dicrease, SIGffRid will found much more motifs to extends, therefore it will take even more time to run. (and increasing is dangerous because even for strong SFBS, initial conserved motifs are rarely very frequent in pairs of upstream regions, 8 is a good balance).

I hope I answered your questions.

@blew42
Copy link
Author

blew42 commented Jun 6, 2022

Hello, thanks again for your response. I've been able to run the program with a few different related species. I had one last question regarding the output files you mentioned. I'm having some difficulty with looking at the results, specifically knowing which genes are involved with a certain motif. When I look in the FINAL_RESULTS directory, I can see the motifs and their R/LRT scores, but I am not sure if I am checking the correct files to verify the list of genes which are associated with these motifs. To do this, I have been looking at the files under MOTIF_ANNOT_UPST - is this the right place to look?

Thanks

@FTouzain
Copy link
Member

FTouzain commented Jun 6, 2022

Hello.
You're welcome.
Yes it is, annotations of downstream genes are in MOTIF_ANNOT_UPST directory in the file
annot_motif[IDbact]_[motif].txt

Please, take into account that SIGffRid gives only the first gene downstream occurrences, but you can have syntenies, meaning that additional genes may be regulated by this potential SFBS even if they are not directly downstream the regulatory motif.
I hope this helps.

@FTouzain FTouzain pinned this issue Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants