Skip to content

Commit

Permalink
v1.3.2: Fixed an issue with the hashing of the StORF IDs.
Browse files Browse the repository at this point in the history
The hashing of StORF IDs may have been different between GFF/DNA and the AA outputs. Now the hashing takes the GFF input filename or in the case of a Pyrodigal run, will take the input FASTA filename.
  • Loading branch information
NickJD committed Feb 7, 2024
1 parent 7915c28 commit a89e774
Show file tree
Hide file tree
Showing 8 changed files with 46,066 additions and 88,731 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ usage: StORF_Reporter.py [-h]
[-olap_filt [{none,single-strand,both-strand}]] [-start_filt {True,False}] [-so [{start_pos,strand}]] [-f_type [{StORF,CDS,ORF}]] [-olap OVERLAP_NT] [-ao ALLOWED_OVERLAP] [-overwrite {True,False}]
[-verbose {True,False}] [-v]

StORF-Reporter v1.3.1: StORF-Reporter Run Parameters.
StORF-Reporter v1.3.2: StORF-Reporter Run Parameters.

Required Options:
-anno [{Prokka,Bakta,Out_Dir,Multiple_Out_Dirs,Single_GFF,Multiple_GFFs,Ensembl,Feature_Types,Single_Genome,Multiple_Genomes,Single_Combined_GFF,Multiple_Combined_GFFs,Pyrodigal,Single_FASTA,Multiple_FASTA} ...]
Expand Down Expand Up @@ -174,7 +174,7 @@ usage: StORF_Extractor.py [-h] [-storf_input {Combined,Separate}] [-p PATH] [-gf
[-lw {True,False}] [-stop_ident {True,False}] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}]
[-verbose {True,False}] [-v]

Single_Genome v1.3.1: StORF-Extractor Run Parameters.
Single_Genome v1.3.2: StORF-Extractor Run Parameters.

Required Arguments:
-storf_input {Combined,Separate}
Expand Down Expand Up @@ -210,7 +210,7 @@ usage: StORF_Finder.py [-h] [-f FASTA] [-ua {True,False}] [-wc {True,False}] [-p
[-stop_ident {True,False}] [-f_type [{StORF,CDS,ORF}]] [-minorf MIN_ORF] [-maxorf MAX_ORF] [-codons STOP_CODONS] [-olap OVERLAP_NT] [-s SUFFIX] [-so [{start_pos,strand}]] [-spos {True,False}] [-oname O_NAME] [-odir O_DIR] [-gff {True,False}] [-aa {True,False}] [-aa_only {True,False}]
[-lw {True,False}] [-gff_fasta {True,False}] [-gz {True,False}] [-verbose {True,False}] [-v]

StORF-Reporter v1.3.1: StORF-Finder Run Parameters.
StORF-Reporter v1.3.2: StORF-Finder Run Parameters.

Required Arguments:
-f FASTA Input FASTA File - (UR_Extractor output)
Expand Down Expand Up @@ -274,7 +274,7 @@ StORF-Extractor -storf_input Combined -p .../Test_Datasets/Combined_GFFs/E-coli_
```python
usage: StORF_Extractor.py [-h] [-storf_input {Combined,Separate}] [-p PATH] [-gff_out {True,False}] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}] [-verbose {True,False}] [-v]

StORF-Reporter v1.3.1: StORF-Extractor Run Parameters.
StORF-Reporter v1.3.2: StORF-Extractor Run Parameters.

Required Arguments:
-storf_input {Combined,Separate}
Expand Down Expand Up @@ -307,7 +307,7 @@ StORF-Remover -gff .../Test_Datasets/StORF_Extractor_And_Remover/Myco_UR_StORF-R
usage: StORF_Remover.py [-h] [-gff GFF] [-blast BLAST] [-min_score MINSCORE] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}]
[-verbose {True,False}] [-v]

StORF-Reporter v1.3.1: UR-Extractor Run Parameters.
StORF-Reporter v1.3.2: UR-Extractor Run Parameters.

Required Arguments:
-gff GFF GFF annotation file for the FASTA
Expand Down
85,592 changes: 0 additions & 85,592 deletions Test_Datasets/Prokka_E-coli/Prokka_E-coli_StORF-Reporter_Extended.fasta

This file was deleted.

6,192 changes: 3,096 additions & 3,096 deletions Test_Datasets/Prokka_E-coli/Prokka_E-coli_StORF-Reporter_Extended.gff

Large diffs are not rendered by default.

35,535 changes: 35,535 additions & 0 deletions Test_Datasets/Prokka_E-coli/Prokka_E-coli_StORF-Reporter_Extended_aa.fasta

Large diffs are not rendered by default.

7,382 changes: 7,382 additions & 0 deletions Test_Datasets/Pyrodigal/E-coli_Pyrodigal_StORF-Reporter_Extended.gff

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = StORF-Reporter
version = v1.3.1
version = v1.3.2
author = Nicholas Dimonaco
author_email = [email protected]
description = StORF-Reporter - A a tool that takes an annotated genome and returns missing CDS genes (Stop-to-Stop) from unannotated regions.
Expand Down
2 changes: 1 addition & 1 deletion src/StORF_Reporter/Constants.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
StORF_Reporter_Version = 'v1.3.1'
StORF_Reporter_Version = 'v1.3.2'
82 changes: 46 additions & 36 deletions src/StORF_Reporter/StORF_Reporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,23 @@ def get_directory_names(path):
return directories
############################

def compute_hash(StORF, Reporter_options,track_contig):
### Compute the hash/locus tag here
ID = track_contig + '_UR_' + StORF[0] + '_' + StORF[10] + '_' + str(StORF[9])
try:
to_hash = Reporter_options.gff.split('.')[0] + '_' + ID # create unique hash from outfile name and ID
except AttributeError: # Will go here for Pyrodigal
to_hash = Reporter_options.fasta.split('.')[
0] + '_' + ID # create unique hash from outfile name and ID
if len(StORF[11].split(',')) >= 3: # fix for con-storfs - TBD
StORF_Type = 'Con-StORF_'
else:
StORF_Type = 'StORF_'
StORF_Hash = StORF_Type + hashlib.shake_256(to_hash.encode()).hexdigest(8)
return StORF_Hash, ID


############################
def get_outfile_name(Reporter_options):
if Reporter_options.o_dir == None and Reporter_options.o_name == None and Reporter_options.alt_filename == None:
if Reporter_options.annotation_type[1] in ['Out_Dir', 'Multiple_Out_Dirs']:
Expand Down Expand Up @@ -125,13 +142,7 @@ def get_outfile_name(Reporter_options):

############################

def GFF_StORF_write(Reporter_options, track_contig, gff_out, StORF, StORF_Num): # Consistency in outfile
ID = track_contig + '_UR_' + StORF[0] + '_' + StORF[10] + '_'+ str(StORF[9])
try:
to_hash = gff_out.split('.')[0] + '_' + ID # create unique hash from outfile name and ID
except AttributeError:
to_hash = gff_out.name.split('.')[0] + '_' + ID # create unique hash from outfile name and ID
locus_tag = hashlib.shake_256(to_hash.encode()).hexdigest(8)
def GFF_StORF_write(Reporter_options, track_contig, gff_out, StORF, StORF_Num, StORF_Hash, ID): # Consistency in outfile
### Write out new GFF entry -
strand = StORF[7]
start = StORF[3]
Expand Down Expand Up @@ -171,23 +182,16 @@ def GFF_StORF_write(Reporter_options, track_contig, gff_out, StORF, StORF_Num):
StORF_length = int(gff_stop) + 1 - int(gff_start) # +1 to adjust for base-1

gff_out.write(track_contig + '\tStORF-Reporter\t' + Reporter_options.feature_type + '\t' + str(gff_start) + '\t' + str(gff_stop) + '\t.\t' +
StORF[7] + '\t0\t' + StORF_Type + locus_tag + ';locus_tag=' + ID + ';INFO=Additional_Annotation_StORF-Reporter;UR_Stop_Locations=' + StORF[11].replace(',','-') + ';Name=' +
StORF[7] + '\t0\tID=' + StORF_Hash + ';locus_tag=' + ID + ';INFO=Additional_Annotation_StORF-Reporter;UR_Stop_Locations=' + StORF[11].replace(',','-') + ';Name=' +
StORF[10] + '_' + str(StORF_Num) + ';' + StORF[10] + '_Num_In_UR=' + str(StORF[9]) + ';' + StORF[10] + '_Length=' + str(StORF_length) + ';' + StORF[10] +
'_Frame=' + str(frame) + ';UR_' + StORF[10] + '_Frame=' + str(StORF[6]) + ';Start_Stop=' + start_stop + ';Mid_Stops=' + mid_stop + ';End_Stop='
+ end_stop + ';StORF_Type=' + StORF[10] + '\n')


def FASTA_StORF_write(Reporter_options, track_contig, fasta_out, StORF): # Consistency in outfile
## This should compute the same hash as the one for GFF_Write - Should not be computed twice though
ID = track_contig + '_UR_' + StORF[0] + '_' + StORF[10] + '_'+ str(StORF[9])
to_hash = fasta_out.name.split('.')[0] + '_' + ID # create unique hash from outfile name and ID
if len(StORF[11].split(',')) >= 3: # fix for con-storfs - TBD
StORF_Type = 'Con-StORF_'
else:
StORF_Type = 'StORF_'
locus_tag = StORF_Type + hashlib.shake_256(to_hash.encode()).hexdigest(8)
def FASTA_StORF_write(Reporter_options, fasta_out, StORF, StORF_Hash): # Consistency in outfile

### Wrtie out new FASTA entry - Currently only write out as nt
fasta_out.write('>'+locus_tag+'\n')
fasta_out.write('>'+StORF_Hash+'\n')
sequence = StORF[-1]
if Reporter_options.translate == True:
sequence = translate_frame(sequence[0:])
Expand All @@ -204,7 +208,7 @@ def FASTA_StORF_write(Reporter_options, track_contig, fasta_out, StORF): # Cons
storf_fasta_outfile = open(fasta_out.name.replace('.fasta','_StORFs_Only.fasta'),'a')
else:
storf_fasta_outfile = open(fasta_out.name.replace('.fasta.gz','_StORFs_Only.fasta.gz'),'a')
storf_fasta_outfile.write('>' + locus_tag + '\n')
storf_fasta_outfile.write('>' + StORF_Hash + '\n')
if Reporter_options.line_wrap == True:
wrapped = textwrap.wrap(sequence, width=60)
for wrap in wrapped:
Expand Down Expand Up @@ -567,9 +571,11 @@ def StORF_Filler(Reporter_options, Reported_StORFs):
fasta_outfile.write(Original_Seq+'\n')
if StORFs:
for StORF in StORFs:
GFF_StORF_write(Reporter_options, track_contig, Reporter_options.gff_outfile, StORF, StORF_Num) # To keep consistency
###Compute hash/locus tag
StORF_Hash, ID = compute_hash(StORF,Reporter_options, track_contig)
GFF_StORF_write(Reporter_options, track_contig, Reporter_options.gff_outfile, StORF, StORF_Num, StORF_Hash, ID) # To keep consistency
if (Reporter_options.annotation_type[1] in ['Out_Dir', 'Multiple_Out_Dirs'] or Reporter_options.storfs_out == True):
FASTA_StORF_write(Reporter_options, track_contig, fasta_outfile, StORF)
FASTA_StORF_write(Reporter_options, fasta_outfile, StORF, StORF_Hash)
StORF_Num += 1
if line != written_line:
Reporter_options.gff_outfile.write(line.strip()+'\n')
Expand All @@ -579,9 +585,11 @@ def StORF_Filler(Reporter_options, Reported_StORFs):
StORFs = find_after_StORFs(Reporter_options, Contig_URS, track_prev_start, track_prev_stop, track_prev_contig) # Changed to prev stop because we are switching from previous contig
if StORFs:
for StORF in StORFs:
GFF_StORF_write(Reporter_options, track_prev_contig, Reporter_options.gff_outfile, StORF, StORF_Num) # To keep consistency
###Compute hash/locus tag
StORF_Hash, ID = compute_hash(StORF,Reporter_options, track_contig)
GFF_StORF_write(Reporter_options, track_prev_contig, Reporter_options.gff_outfile, StORF, StORF_Num, StORF_Hash, ID) # To keep consistency
if Reporter_options.annotation_type[1] in ['Out_Dir', 'Multiple_Out_Dirs'] or Reporter_options.storfs_out == True:
FASTA_StORF_write(Reporter_options, track_contig, fasta_outfile, StORF)
FASTA_StORF_write(Reporter_options, fasta_outfile, StORF, StORF_Hash)
StORF_Num += 1
Reporter_options.gff_outfile.write(line.strip() + '\n')

Expand Down Expand Up @@ -764,8 +772,8 @@ def main():
else:
exit('StORF-Reporter: error: the following arguments are required: -anno, -p')

if Reporter_options.translate == True and Reporter_options.storfs_out == False and Reporter_options.annotation_type[1] != 'Multiple_Out_Dirs':
exit('StORF-Reporter "-sout True" is required when "-aa True" is selected')
if Reporter_options.translate == True and Reporter_options.storfs_out == False and Reporter_options.annotation_type[1] not in ['Out_Dir', 'Multiple_Out_Dirs']:
exit('StORF-Reporter: "-sout True" is required when "-aa True" is selected')


print("Thank you for using StORF-Reporter -- A detailed user manual can be found at https://github.com/NickJD/StORF-Reporter\n"
Expand Down Expand Up @@ -809,15 +817,18 @@ def main():
if Reporter_options.annotation_type[0] in ('Prokka','Bakta') and Reporter_options.annotation_type[1] == 'Out_Dir':
Reporter_options.output_file = output_file
#### Checking and cleaning
for fname in os.listdir(Reporter_options.path):
if '_StORF-Reporter_Extended' in fname and Reporter_options.overwrite == False:
parser.error(
'Prokka/Bakta directory not clean and already contains a StORF-Reporter output. Please delete or use "-overwrite True" and try again.')
elif '_StORF-Reporter_Extended' in fname and Reporter_options.overwrite == True:
file_path = os.path.join(Reporter_options.path, fname)
os.remove(file_path)
if Reporter_options.verbose == True:
print('StORF-Reporter output ' + fname + ' will be overwritten.')
try:
for fname in os.listdir(Reporter_options.path):
if '_StORF-Reporter_Extended' in fname and Reporter_options.overwrite == False:
parser.error(
'Prokka/Bakta directory not clean and already contains a StORF-Reporter output. Please delete or use "-overwrite True" and try again.')
elif '_StORF-Reporter_Extended' in fname and Reporter_options.overwrite == True:
file_path = os.path.join(Reporter_options.path, fname)
os.remove(file_path)
if Reporter_options.verbose == True:
print('StORF-Reporter output ' + fname + ' will be overwritten.')
except FileNotFoundError:
sys.exit("Incorrect file path '" + Reporter_options.path + "' - Please check input")
####
Reporter_options.gene_ident = "misc_RNA,gene,mRNA,CDS,rRNA,tRNA,tmRNA,CRISPR,ncRNA,regulatory_region,oriC,pseudo"
Contigs, Reporter_options = run_UR_Extractor_Directory(Reporter_options)
Expand All @@ -828,7 +839,7 @@ def main():
print("Finished: " + Reporter_options.gff.split(os.sep)[-1])

############## Setup for Multi[;e Prokka/Bakta output directories
if Reporter_options.annotation_type[0] in ('Prokka','Bakta') and Reporter_options.annotation_type[1] == 'Multiple_Out_Dirs':
elif Reporter_options.annotation_type[0] in ('Prokka','Bakta') and Reporter_options.annotation_type[1] == 'Multiple_Out_Dirs':
fixed_path = Reporter_options.path # So we can modify the path variable later
directories = get_directory_names(fixed_path)
for directory in directories:
Expand Down Expand Up @@ -948,7 +959,6 @@ def main():
else:
Reporter_options.output_file = output_file


if Reporter_options.verbose == True:
print("Starting: " + str(gff.split(os.sep)[-1]))
if Reporter_options.annotation_type[1] in ('Single_Combined_GFF', 'Multiple_Combined_GFFs'):
Expand Down

0 comments on commit a89e774

Please sign in to comment.