Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of read-bioinfo-metadata module for tool compatibility expansion #258

Merged
merged 59 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
24c73ae
added yml containing bioinfo results pattern
Daniel-VM Feb 27, 2024
ef3c161
enable to parse yaml config file for bioinfo metadata
Daniel-VM Feb 27, 2024
56d6039
add automated scanning of files in a directory
Daniel-VM Feb 29, 2024
64636d0
add validation steps on required files
Daniel-VM Feb 29, 2024
3e05aa7
organize bioinfo config settings according to software
Daniel-VM Mar 4, 2024
1221083
Added methods for scanning bioinfo results files regardless they sub-…
Daniel-VM Mar 11, 2024
03b49ac
Fixed divergent branch (origin - develop)
Daniel-VM Mar 11, 2024
408dd6d
update scanning folder method to detect both required and non-require…
Daniel-VM Mar 12, 2024
9c238f2
rm version scope and rename property field names in bioinfo config
Daniel-VM Mar 13, 2024
01f33a5
dummy changes - rename method, add todos and update comments
Daniel-VM Mar 14, 2024
aaf9902
add validation required files method
Daniel-VM Mar 14, 2024
255dbe1
rename variables for better understanding
Daniel-VM Mar 14, 2024
f3a5f99
added file format handlers and new method to add bioinfo metadata
Daniel-VM Mar 15, 2024
4a944ca
add method to handle program versions info in multiqc_report.html
Daniel-VM Mar 22, 2024
04f0010
forgot to remove old method to parse sw versions
Daniel-VM Mar 22, 2024
49ea8bd
update fixed values module and minor code changes
Daniel-VM Mar 22, 2024
7bc6d93
remove old unused methods that are now already embebed into map_metad…
Daniel-VM Mar 22, 2024
952b5d8
renaming variables for better understanding
Daniel-VM Mar 22, 2024
ecc68a5
added workaround to generalize parsing of sample-specific files
Daniel-VM Mar 22, 2024
adc0934
fix typo in config and remove todo
Daniel-VM Mar 25, 2024
3b9adba
fix assignment of variable in scope consensus
Daniel-VM Mar 25, 2024
2244890
fix remaining old methods after code refactoring
Daniel-VM Mar 25, 2024
cf8f495
remove tqdm dependency
Daniel-VM Mar 25, 2024
fd33f3a
clean unused method for required files and update method's name and d…
Daniel-VM Mar 25, 2024
3c92550
add method to vaildate software name
Daniel-VM Mar 25, 2024
a30e38e
rename input variables to avoid variable naming collision
Daniel-VM Mar 26, 2024
03e8639
added method to report process log and update methods: scann, add fil…
Daniel-VM Mar 26, 2024
e63c905
improve method add_bioinfo_results_metadata and add it feature to upd…
Daniel-VM Mar 27, 2024
9c1dc56
fix stderr.print identation and minor update of method description
Daniel-VM Mar 27, 2024
b6fa530
allow regex in scann_directory
Daniel-VM Apr 1, 2024
17b91c0
add function to select the most recent file of a sample given a list …
Daniel-VM Apr 2, 2024
6243562
add new strategy to: first, handle files; and second, mapping bioinfo…
Daniel-VM Apr 3, 2024
70a8148
refactor methods pangolin, consensus, and software version to adhere …
Daniel-VM Apr 3, 2024
d4add27
fix method handle_consensus_fasta
Daniel-VM Apr 4, 2024
7c4377b
exported method select-most-recent-file from read-bioinfo-metadata to…
Daniel-VM Apr 4, 2024
204c188
add todos and missing imported library
Daniel-VM Apr 4, 2024
79c49a0
recover independent management of software-config and files-found var…
Daniel-VM Apr 4, 2024
3e3eac1
remove method and config items that are not necessary anymore
Daniel-VM Apr 4, 2024
db8f57b
add new methods to show progress log report
Daniel-VM Apr 5, 2024
0da2f66
update methods description, rdelete accomplished todos and add small …
Daniel-VM Apr 5, 2024
1e78a24
recover method from older version to parse viralrecon long table
Daniel-VM Apr 5, 2024
97d5e6a
adaptation of method parse_long_table to refactored module
Daniel-VM Apr 5, 2024
ddd5522
1-relocate module that adds custom long-table data. 2-Now j_data is e…
Daniel-VM Apr 5, 2024
8581669
mv report logs methods to the new class BioinfoReportLog
Daniel-VM Apr 5, 2024
3682008
pipeline-specific methods are now invoked from assets/pipeline_utils
Daniel-VM Apr 6, 2024
d84358b
improve description and remove already accomplished todo tasks
Daniel-VM Apr 6, 2024
d9522a2
Merge remote-tracking branch 'upstream/develop' into develop
Daniel-VM Apr 6, 2024
2ba03f7
fix flake8 linting
Daniel-VM Apr 7, 2024
ddfc9b5
fix black linting
Daniel-VM Apr 7, 2024
4696d68
arranged methods and added new coments
Daniel-VM Apr 7, 2024
d8b7229
send remaining pipeline-specific file to assets
Daniel-VM Apr 7, 2024
f2b152c
Added method for appending bioinfo filepaths to readlab metadata, rep…
Daniel-VM Apr 7, 2024
5a5bf1f
fix missing mapping key in handle_consensus_fasta
Daniel-VM Apr 7, 2024
f3563ef
replace sample mapping key according to #240
Daniel-VM Apr 7, 2024
8324482
add image summarizing the process of read-bioinfo-metadata
Daniel-VM Apr 8, 2024
bdbb1d3
Merge branch 'develop' into develop
Daniel-VM Apr 8, 2024
1f28585
add reviewer sugestions #258
Daniel-VM Apr 9, 2024
2de9f15
added fixed and remaining reviewer suggestions on #258
Daniel-VM Apr 15, 2024
27db5f7
dummy change to fix rebarse-merge problem
Daniel-VM Apr 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions relecov_tools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@
import rich.traceback

import relecov_tools.utils
import relecov_tools.assets.pipeline_utils.viralrecon
import relecov_tools.read_lab_metadata
import relecov_tools.sftp_handle
import relecov_tools.json_validation
import relecov_tools.map_schema
import relecov_tools.feed_database
import relecov_tools.read_bioinfo_metadata
import relecov_tools.long_table_parse
import relecov_tools.metadata_homogeneizer
import relecov_tools.gisaid_upload
import relecov_tools.upload_ena_protocol
Expand Down Expand Up @@ -418,12 +418,16 @@ def update_db(user, password, json, type, databaseServer):
)
@click.option("-i", "--input_folder", type=click.Path(), help="Path to input files")
@click.option("-o", "--out_dir", type=click.Path(), help="Path to save output file")
def read_bioinfo_metadata(json_file, input_folder, out_dir):
@click.option("-s", "--software_name", help="Name of the software/pipeline used.")
def read_bioinfo_metadata(json_file, input_folder, out_dir, software_name):
"""
Create the json compliant from the Bioinfo Metadata.
"""
new_bioinfo_metadata = relecov_tools.read_bioinfo_metadata.BioinfoMetadata(
json_file, input_folder, out_dir
json_file,
input_folder,
out_dir,
software_name,
)

new_bioinfo_metadata.create_bioinfo_file()
Expand All @@ -439,7 +443,7 @@ def read_bioinfo_metadata(json_file, input_folder, out_dir):
)
@click.option("-o", "--output", type=click.Path(), help="Path to save json output")
def long_table_parse(longtable_file, output):
new_json_parse = relecov_tools.long_table_parse.LongTableParse(
new_json_parse = relecov_tools.assets.pipeline_utils.viralrecon.LongTableParse(
longtable_file, output
)
"""Create Json file from long table"""
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
333 changes: 333 additions & 0 deletions relecov_tools/assets/pipeline_utils/viralrecon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,333 @@
#!/usr/bin/env python
import json
import os
import sys
import re
import logging
import rich
import os.path

from pathlib import Path
from datetime import datetime

import relecov_tools.utils
from relecov_tools.config_json import ConfigJson
from relecov_tools.read_bioinfo_metadata import BioinfoReportLog

log = logging.getLogger(__name__)
stderr = rich.console.Console(
stderr=True,
style="dim",
highlight=False,
force_terminal=relecov_tools.utils.rich_force_colors(),
)


# INIT Class
class LongTableParse:
"""
- parse_a_list_of_dictionaries() : returns generated_JSON
- saving_file(generated_JSON)
- parsing_csv() : It manages all this proccess:
- calling first to parse_a_list_of_dictionaries() and then calling to saving_file()
"""

def __init__(self, file_path=None, output_directory=None):
if file_path is None:
self.file_path = relecov_tools.utils.prompt_path(
msg="Select the csv file which contains variant long table information"
)
else:
self.file_path = file_path

if not os.path.exists(self.file_path):
log.error("Variant long table file %s does not exist ", self.file_path)
stderr.print(
f"[red] Variant long table file {self.file_path} does not exist"
)
sys.exit(1)

if not self.file_path.endswith(".csv"):
log.error("Variant long table file %s is not a csv file ", self.file_path)
stderr.print(
f"[red] Variant long table file {self.file_path} must be a csv file"
)
sys.exit(1)

if output_directory is None:
use_default = relecov_tools.utils.prompt_yn_question("Use default path?: ")
if use_default:
self.output_directory = os.getcwd()
else:
self.output_directory = relecov_tools.utils.prompt_path(
msg="Select the output folder:"
)
else:
self.output_directory = output_directory
Path(self.output_directory).mkdir(parents=True, exist_ok=True)

json_file = os.path.join(
os.path.dirname(__file__), "..", "..", "conf", "bioinfo_config.json"
)
config_json = ConfigJson(json_file)
self.software_config = config_json.get_configuration("viralrecon")
self.long_table_heading = self.software_config["variants_long_table"]["content"]

def validate_file(self, heading):
"""Check if long table file has all mandatory fields defined in
configuration file
"""
for field in self.long_table_heading:
if field not in heading:
log.error("Incorrect format file. %s is missing", field)
stderr.print(f"[red]Incorrect Format. {field} is missing in file")
sys.exit(1)
return True

def parse_file(self):
"""This function generates a json file from the csv file entered by
the user (long_table.csv).
Validate the file by checking the header line
"""

with open(self.file_path, encoding="utf-8-sig") as fh:
lines = fh.readlines()

stderr.print("[green]\tSuccessful checking heading fields")
heading_index = {}
headings_from_csv = lines[0].strip().split(",")
for heading in self.long_table_heading.values():
heading_index[heading] = headings_from_csv.index(heading)

samp_dict = {}
for line in lines[1:]:
line_s = line.strip().split(",")

sample = line_s[heading_index["SAMPLE"]]
if sample not in samp_dict:
samp_dict[sample] = []

variant_dict = {
key: (
{key2: line_s[heading_index[val2]] for key2, val2 in value.items()}
if isinstance(value, dict)
else line_s[heading_index[value]]
)
for key, value in self.long_table_heading.items()
}

if re.search("&", line_s[heading_index["GENE"]]):
# Example
# 215184,NC_045512.2,27886,AAACGAACATGAAATT,A,PASS,1789,1756,1552,0.87,ORF7b&ORF8,gene_fusion,n.27887_27901delAACGAACATGAAATT,.,.,ivar,B.1.1.318
# This only occurs (for now) as gene fusion, so we just duplicate lines with same values
genes = re.split("&", line_s[heading_index["GENE"]])
for gene in genes:
variant_dict_copy = variant_dict.copy()
variant_dict_copy["Gene"] = gene
samp_dict[sample].append(variant_dict_copy)
else:
variant_dict["Gene"] = line_s[heading_index["GENE"]]
samp_dict[sample].append(variant_dict)
stderr.print("[green]\tSuccessful parsing data")
return samp_dict

def convert_to_json(self, samp_dict):
j_list = []
# Grab date from filename
result_regex = re.search(
"variants_long_table(?:_\d{8})?\.csv", os.path.basename(self.file_path)
)
if result_regex is None:
stderr.print(
"[red]\tWARN: Couldn't find variants long table file. Expected file name is:"
)
stderr.print(
"[red]\t\t- variants_long_table.csv or variants_long_table_YYYYMMDD.csv. Aborting..."
)
sys.exit(1)
else:
date_regex = re.search(r"(\d{8})", result_regex.group())
if date_regex is not None:
analysis_date = date_regex
stderr.print(
f"[green]\tDate {analysis_date.group()} found in {self.file_path}"
)
else:
analysis_date = "Not Provided [GENEPIO:0001668]"
stderr.print(
f"[yellow]\tWARN:No analysis date found in long table: {self.file_path}"
)

for key, values in samp_dict.items():
j_dict = {"sample_name": key, "analysis_date": analysis_date.group()}
j_dict["variants"] = values
j_list.append(j_dict)
return j_list

def save_to_file(self, j_list):
"""Transform the parsed data into a json file"""
date_now = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
file_name = "long_table_" + date_now + ".json"
file_path = os.path.join(self.output_directory, file_name)

try:
with open(file_path, "w") as fh:
fh.write(json.dumps(j_list, indent=4))
stderr.print("[green]\tParsed data successfully saved to file:", file_path)
except Exception as e:
stderr.print("[red]\tError saving parsed data to file:", str(e))

def parsing_csv(self):
"""
Function called when using the relecov-tools long-table-parse function.
"""
# Parsing longtable file
parsed_data = self.parse_file()
j_list = self.convert_to_json(parsed_data)
# Saving long table data into a file
self.save_to_file(j_list)
stderr.print("[green]\tProcess completed")
return


# END of Class


# START util functions
def handle_pangolin_data(files_list):
"""File handler to parse pangolin data (csv) into JSON structured format.

Args:
files_list (list): A list with paths to pangolin files.

Returns:
pango_data_processed: A dictionary containing pangolin data handled.
"""
method_name = f"{handle_pangolin_data.__name__}"
method_log_report = BioinfoReportLog()

# Handling pangolin data
pango_data_processed = {}
valid_samples = []
try:
files_list_processed = relecov_tools.utils.select_most_recent_files_per_sample(
files_list
)
for pango_file in files_list_processed:
try:
pango_data = relecov_tools.utils.read_csv_file_return_dict(
pango_file, sep=","
)
# Add custom content in pangolin
pango_data_key = next(iter(pango_data))
pango_data[pango_data_key]["lineage_analysis_date"] = (
relecov_tools.utils.get_file_date(pango_file)
)

# Rename key in f_data
pango_data_updated = {
key.split()[0]: value for key, value in pango_data.items()
}
pango_data_processed.update(pango_data_updated)
valid_samples.append(pango_data_key.split()[0])
except (FileNotFoundError, IndexError) as e:
method_log_report.update_log_report(
method_name,
"warning",
f"Error occurred while processing file {pango_file}: {e}",
)
continue
except Exception as e:
method_log_report.update_log_report(
method_name, "warning", f"Error occurred while processing files: {e}"
)
if len(valid_samples) > 0:
method_log_report.update_log_report(
method_name,
"valid",
f"Successfully handled data in samples: {', '.join(valid_samples)}",
)
method_log_report.print_log_report(method_name, ["valid", "warning"])
return pango_data_processed


def parse_long_table(files_list):
"""File handler to retrieve data from long table files and convert it into a JSON structured format.
This function utilizes the LongTableParse class to parse the long table data.
Since this utility handles and maps data using a custom way, it returns None to be avoid being transferred to method read_bioinfo_metadata.BioinfoMetadata.mapping_over_table().

Args:
files_list (list): A list of paths to long table files.

Returns:
None: Indicates that the function does not return any meaningful value.
"""
method_name = f"{parse_long_table.__name__}"
method_log_report = BioinfoReportLog()

# Handling long table data
if len(files_list) == 1:
files_list_processed = files_list[0]
if not os.path.isfile(files_list_processed):
method_log_report.update_log_report(
method_name, "error", f"{files_list_processed} given file is not a file"
)
sys.exit(method_log_report.print_log_report(method_name, ["error"]))
long_table = LongTableParse(files_list_processed)
# Parsing long table data and saving it
long_table.parsing_csv()
elif len(files_list) > 1:
method_log_report.update_log_report(
method_name,
"warning",
f"Found {len(files_list)} variants_long_table files. This version is unable to process more than one variants long table each time.",
)
# This needs to return none to avoid being parsed by method mapping-over-table
return None


def handle_consensus_fasta(files_list):
"""File handler to parse consensus data (fasta) into JSON structured format.

Args:
files_list (list): A list with paths to condensus files.

Returns:
consensus_data_processed: A dictionary containing consensus data handled.
"""
method_name = f"{handle_consensus_fasta.__name__}"
method_log_report = BioinfoReportLog()

consensus_data_processed = {}
missing_consens = []
for consensus_file in files_list:
try:
record_fasta = relecov_tools.utils.read_fasta_return_SeqIO_instance(
consensus_file
)
except FileNotFoundError as e:
missing_consens.append(e.filename)
continue
sample_key = os.path.basename(consensus_file).split(".")[0]

# Update consensus data for the sample key
consensus_data_processed[sample_key] = {
"sequence_name": record_fasta.description,
"genome_length": str(len(record_fasta)),
"sequence_filepath": os.path.dirname(consensus_file),
"sequence_filename": sample_key,
"sequence_md5": relecov_tools.utils.calculate_md5(consensus_file),
# TODO: Not sure this is correct. If not, recover previous version: https://github.com/BU-ISCIII/relecov-tools/blob/09c00c1ddd11f7489de7757841aff506ef4b7e1d/relecov_tools/read_bioinfo_metadata.py#L211-L218
"number_of_base_pairs_sequenced": len(record_fasta.seq),
}

# Report missing consensus
conserrs = len(missing_consens)
if conserrs >= 1:
method_log_report.update_log_report(
method_name,
"warning",
f"{conserrs} samples missing in consensus file: {missing_consens}",
)
method_log_report.print_log_report(method_name, ["valid", "warning"])
return consensus_data_processed
Loading
Loading