Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental data upload for mutations, case lists, patient and sample attributes #32

Merged
merged 42 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
5dfe298
Add clinical_attribute_meta records to the seed mini
forus Mar 21, 2024
531b10a
Implement sample attribute rewriting flag
forus Mar 21, 2024
248a08c
Add --overwrite-existing for the rest of test cases
forus Mar 21, 2024
2bc7271
Test that mutations stay after updating the sample attributes
forus Mar 21, 2024
31e3194
Add overwrite-existing support for mutations data
forus Mar 21, 2024
bd023a9
Fix --overwirte-existing flag description for importer of profile data
forus Mar 22, 2024
c49bbf3
Add loader command to update case list with sample ids
forus Mar 28, 2024
1f5695d
Add option to remove sample ids from the remaining case lists
forus Mar 28, 2024
77cd6a8
Make removing sample ids from not mentioned case lists a default beha…
forus Mar 29, 2024
bd8c4b2
Make update case list command to read case lists files
forus Mar 29, 2024
5fc633b
Fix test clinical data headers
forus Mar 29, 2024
f7132c9
Test incremental patient upload
forus Apr 1, 2024
f45e1e8
Add flag to reload patient clinical attributes
forus Apr 2, 2024
8cc95a0
Add TODO comment to remove MIXED_ATTRIBUTES data type
forus Apr 3, 2024
fa32b7f
WIP adopt py script to incremental upload
forus Apr 3, 2024
f044c3b
Fix java.sql.SQLException: Generated keys not requested
forus Apr 4, 2024
48fca03
Clean alteration_driver_annotation during mutations inc. upload
forus Apr 5, 2024
1302a8e
Fix validator and importer py scripts for inc. upload
forus Apr 5, 2024
659f352
Add test/demo data for incremental loading of study_es_0 study
forus Apr 5, 2024
b5952e3
Rename and move incremental tests to incementalTest folder
forus Apr 8, 2024
753119b
Update TODO comment how to deal with multiple sample files
forus Apr 9, 2024
5725d42
Move study_es_0_inc to the new test data folder
forus Apr 9, 2024
299466a
Fix removing patient attributes on samples inc. upload
forus Apr 9, 2024
c0c28e2
Change study_es_0_inc to contain more diverse data
forus Apr 11, 2024
c6eddbb
Specify that data_directory for incremental data
forus Apr 11, 2024
595d24f
Disambiguate clinical data constants names
forus Apr 11, 2024
c8b4c73
Remove not necessary TODO comments
forus Apr 11, 2024
efd34d8
Remove MSK copyright mistakenly copy-pasted
forus Apr 11, 2024
3b39e0d
Fix comment of UpdateCaseListsSampleIds.run() method
forus Apr 11, 2024
fc785f6
Make --overwrite-existing flag description more generic
forus Apr 11, 2024
e782951
Add TODO comments for possible reuse of the code
forus Apr 11, 2024
b53c8c4
Update case lists for multiple clinical sample files
forus Apr 11, 2024
99550b5
Extract and reuse common logic to read and validate case lists
forus Apr 11, 2024
1829842
Fix TestIntegrationTest
forus Apr 30, 2024
e785a53
Revert RESOURCE_DEFINITION_DICTIONARY initialsation to empty set
forus Apr 30, 2024
e09e1e2
Minor improvments. Apply PRs feedback
forus Apr 30, 2024
7b527b6
Make tests fail the build. Conduct exit status of tests correctly
forus May 1, 2024
f5e8217
Write Validation complete only in case of successful validation
forus May 1, 2024
8d3aaed
Add python tests for incremental/full data import
forus May 1, 2024
1b6ba41
Add unit test for incremental data validation
forus May 1, 2024
d252001
Test rough order of importer commands. Remove sorting in the script t…
forus May 3, 2024
c27b8f1
Extract smaller functions from the big one in py script
forus May 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/validate-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
- name: 'Validate tests'
working-directory: ./cbioportal-core
run: |
docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/bash -c '
docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/sh -c '
cd cbioportal-core &&
pip install -r requirements.txt &&
source test_scripts.sh'
./test_scripts.sh'
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ After you are done with the setup, you can build and test the project.

1. Execute tests through the provided script:
```bash
source test_scripts.sh
./test_scripts.sh
```

2. Build the loader jar using Maven (includes testing):
Expand Down
155 changes: 121 additions & 34 deletions scripts/importer/cbioportalImporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import logging
import re
from pathlib import Path
from typing import Dict, Tuple

# configure relative imports if running as a script; see PEP 366
# it might passed as empty string by certain tooling to mark a top level module
Expand Down Expand Up @@ -39,6 +40,8 @@
from .cbioportal_common import ADD_CASE_LIST_CLASS
from .cbioportal_common import VERSION_UTIL_CLASS
from .cbioportal_common import run_java
from .cbioportal_common import UPDATE_CASE_LIST_CLASS
from .cbioportal_common import INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES


# ------------------------------------------------------------------------------
Expand Down Expand Up @@ -101,8 +104,17 @@ def remove_study_id(jvm_args, study_id):
args.append("--noprogress") # don't report memory usage and % progress
run_java(*args)

def update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir = None):
args = jvm_args.split(' ')
args.append(UPDATE_CASE_LIST_CLASS)
args.append("--meta")
args.append(meta_filename)
if case_lists_file_or_dir:
args.append("--case-lists")
args.append(case_lists_file_or_dir)
run_java(*args)

def import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None):
def import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None, incremental = False):
args = jvm_args.split(' ')

# In case the meta file is already parsed in a previous function, it is not
Expand Down Expand Up @@ -133,6 +145,10 @@ def import_study_data(jvm_args, meta_filename, data_filename, update_generic_ass
importer = IMPORTER_CLASSNAME_BY_META_TYPE[meta_file_type]

args.append(importer)
if incremental:
if meta_file_type not in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
raise NotImplementedError("This type does not support incremental upload: {}".format(meta_file_type))
args.append("--overwrite-existing")
if IMPORTER_REQUIRES_METADATA[importer]:
args.append("--meta")
args.append(meta_filename)
Expand Down Expand Up @@ -212,11 +228,20 @@ def process_command(jvm_args, command, meta_filename, data_filename, study_ids,
else:
raise RuntimeError('Your command uses both -id and -meta. Please, use only one of the two parameters.')
elif command == IMPORT_STUDY_DATA:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
elif command == IMPORT_CASE_LIST:
import_case_list(jvm_args, meta_filename)

def process_directory(jvm_args, study_directory, update_generic_assay_entity = None):
def get_meta_filenames(data_directory):
meta_filenames = [
os.path.join(data_directory, meta_filename) for
meta_filename in os.listdir(data_directory) if
re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
flags=re.IGNORECASE) and
not (meta_filename.startswith('.') or meta_filename.endswith('~'))]
return meta_filenames

def process_study_directory(jvm_args, study_directory, update_generic_assay_entity = None):
"""
Import an entire study directory based on meta files found.

Expand All @@ -241,12 +266,7 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
cna_long_filepair = None

# Determine meta filenames in study directory
meta_filenames = (
os.path.join(study_directory, meta_filename) for
meta_filename in os.listdir(study_directory) if
re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
flags=re.IGNORECASE) and
not (meta_filename.startswith('.') or meta_filename.endswith('~')))
meta_filenames = get_meta_filenames(study_directory)

# Read all meta files (excluding case lists) to determine what to import
for meta_filename in meta_filenames:
Expand Down Expand Up @@ -353,53 +373,53 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
raise RuntimeError('No sample attribute file found')
else:
meta_filename, data_filename = sample_attr_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, we need to import resource definitions for resource data
if resource_definition_filepair is not None:
meta_filename, data_filename = resource_definition_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, we need to import sample definitions for resource data
if sample_resource_filepair is not None:
meta_filename, data_filename = sample_resource_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, import everything else except gene panel, structural variant data, GSVA and
# z-score expression. If in the future more types refer to each other, (like
# in a tree structure) this could be programmed in a recursive fashion.
for meta_filename, data_filename in regular_filepairs:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import structural variant data
if structural_variant_filepair is not None:
meta_filename, data_filename = structural_variant_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import cna data
if cna_long_filepair is not None:
meta_filename, data_filename = cna_long_filepair
import_study_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
meta_file_dictionary=study_meta_dictionary[meta_filename])
import_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
meta_file_dictionary=study_meta_dictionary[meta_filename])

# Import expression z-score (after expression)
for meta_filename, data_filename in zscore_filepairs:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import GSVA genetic profiles (after expression and z-scores)
if gsva_score_filepair is not None:

# First import the GSVA score data
meta_filename, data_filename = gsva_score_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Second import the GSVA p-value data
meta_filename, data_filename = gsva_pvalue_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

if gene_panel_matrix_filepair is not None:
meta_filename, data_filename = gene_panel_matrix_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import the case lists
case_list_dirname = os.path.join(study_directory, 'case_lists')
Expand All @@ -412,6 +432,70 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
# enable study
update_study_status(jvm_args, study_id)

def get_meta_filenames_by_type(data_directory) -> Dict[str, Tuple[str, Dict]]:
"""
Read all meta files in the data directory and return meta information (filename, content) grouped by type.
"""
meta_file_type_to_meta_files = {}

# Determine meta filenames in study directory
meta_filenames = get_meta_filenames(data_directory)

# Read all meta files (excluding case lists) to determine what to import
for meta_filename in meta_filenames:

# Parse meta file
meta_dictionary = cbioportal_common.parse_metadata_file(
meta_filename, logger=LOGGER)

# Retrieve meta file type
meta_file_type = meta_dictionary['meta_file_type']
if meta_file_type is None:
# invalid meta file, let's die
raise RuntimeError('Invalid meta file: ' + meta_filename)
if meta_file_type not in meta_file_type_to_meta_files:
meta_file_type_to_meta_files[meta_file_type] = []

meta_file_type_to_meta_files[meta_file_type].append((meta_filename, meta_dictionary))
return meta_file_type_to_meta_files

def import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files):
"""
Load all data types that are available and support incremental upload
"""
for meta_file_type in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
meta_pairs = meta_file_type_to_meta_files[meta_file_type]
for meta_pair in meta_pairs:
meta_filename, meta_dictionary = meta_pair
data_filename = os.path.join(data_directory, meta_dictionary['data_filename'])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, meta_dictionary, incremental=True)

def update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files):
"""
Updates case lists if clinical sample provided.
The command takes case_list/ folder as optional argument.
If folder exists case lists will be updated accordingly.
"""
if MetaFileTypes.SAMPLE_ATTRIBUTES in meta_file_type_to_meta_files:
case_list_dirname = os.path.join(data_directory, 'case_lists')
sample_attributes_metas = meta_file_type_to_meta_files[MetaFileTypes.SAMPLE_ATTRIBUTES]
for meta_pair in sample_attributes_metas:
meta_filename, meta_dictionary = meta_pair
LOGGER.info('Updating case lists with sample ids', extra={'filename_': meta_filename})
update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir=case_list_dirname if os.path.isdir(case_list_dirname) else None)

def process_data_directory(jvm_args, data_directory, update_generic_assay_entity = None):
"""
Incremental import of data directory based on meta files found.
"""

meta_file_type_to_meta_files = get_meta_filenames_by_type(data_directory)

not_supported_meta_types = meta_file_type_to_meta_files.keys() - INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES
if not_supported_meta_types:
raise NotImplementedError("These types do not support incremental upload: {}".format(", ".join(not_supported_meta_types)))
import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files)
update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files)

def usage():
# TODO : replace this by usage string from interface()
Expand All @@ -435,26 +519,27 @@ def check_files(meta_filename, data_filename):
print('data-file cannot be found:' + data_filename, file=ERROR_FILE)
sys.exit(2)

def check_dir(study_directory):
def check_dir(data_directory):
# check existence of directory
if not os.path.exists(study_directory) and study_directory != '':
print('Study cannot be found: ' + study_directory, file=ERROR_FILE)
if not os.path.exists(data_directory) and data_directory != '':
print('Directory cannot be found: ' + data_directory, file=ERROR_FILE)
sys.exit(2)

def add_parser_args(parser):
parser.add_argument('-s', '--study_directory', type=str, required=False,
help='Path to Study Directory')
data_source_group = parser.add_mutually_exclusive_group()
data_source_group.add_argument('-s', '--study_directory', type=str, help='Path to Study Directory')
data_source_group.add_argument('-d', '--data_directory', type=str, help='Path to Data Directory')
parser.add_argument('-jvo', '--java_opts', type=str, default=os.environ.get('JAVA_OPTS'),
help='Path to specify JAVA_OPTS for the importer. \
(default: gets the JAVA_OPTS from the environment)')
(default: gets the JAVA_OPTS from the environment)')
parser.add_argument('-jar', '--jar_path', type=str, required=False,
help='Path to scripts JAR file')
help='Path to scripts JAR file')
parser.add_argument('-meta', '--meta_filename', type=str, required=False,
help='Path to meta file')
parser.add_argument('-data', '--data_filename', type=str, required=False,
help='Path to Data file')

def interface():
def interface(args=None):
parent_parser = argparse.ArgumentParser(description='cBioPortal meta Importer')
add_parser_args(parent_parser)
parser = argparse.ArgumentParser()
Expand Down Expand Up @@ -484,7 +569,7 @@ def interface():
# TODO - add same argument to metaimporter
# TODO - harmonize on - and _

parser = parser.parse_args()
parser = parser.parse_args(args)
if parser.command is not None and parser.subcommand is not None:
print('Cannot call multiple commands')
sys.exit(2)
Expand Down Expand Up @@ -547,14 +632,16 @@ def main(args):

# process the options
jvm_args = "-Dspring.profiles.active=dbcp " + args.java_opts
study_directory = args.study_directory

# check if DB version and application version are in sync
check_version(jvm_args)

if study_directory != None:
check_dir(study_directory)
process_directory(jvm_args, study_directory, args.update_generic_assay_entity)
if args.data_directory is not None:
check_dir(args.data_directory)
process_data_directory(jvm_args, args.data_directory, args.update_generic_assay_entity)
elif args.study_directory is not None:
check_dir(args.study_directory)
process_study_directory(jvm_args, args.study_directory, args.update_generic_assay_entity)
else:
check_args(args.command)
check_files(args.meta_filename, args.data_filename)
Expand All @@ -564,5 +651,5 @@ def main(args):
# ready to roll

if __name__ == '__main__':
parsed_args = interface()
parsed_args = interface(args)
main(parsed_args)
9 changes: 9 additions & 0 deletions scripts/importer/cbioportal_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
IMPORT_CANCER_TYPE_CLASS = "org.mskcc.cbio.portal.scripts.ImportTypesOfCancers"
IMPORT_CASE_LIST_CLASS = "org.mskcc.cbio.portal.scripts.ImportSampleList"
ADD_CASE_LIST_CLASS = "org.mskcc.cbio.portal.scripts.AddCaseList"
UPDATE_CASE_LIST_CLASS = "org.mskcc.cbio.portal.scripts.UpdateCaseListsSampleIds"
VERSION_UTIL_CLASS = "org.mskcc.cbio.portal.util.VersionUtil"

PORTAL_PROPERTY_DATABASE_USER = 'db.user'
Expand Down Expand Up @@ -364,6 +365,14 @@ class MetaFileTypes(object):
},
}

# in order of they should be loaded
INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES = [
MetaFileTypes.PATIENT_ATTRIBUTES,
MetaFileTypes.SAMPLE_ATTRIBUTES,
MetaFileTypes.MUTATION,
# TODO Add more types here as incremental upload is enabled
forus marked this conversation as resolved.
Show resolved Hide resolved
]

IMPORTER_CLASSNAME_BY_META_TYPE = {
MetaFileTypes.STUDY: IMPORT_STUDY_CLASS,
MetaFileTypes.CANCER_TYPE: IMPORT_CANCER_TYPE_CLASS,
Expand Down
17 changes: 10 additions & 7 deletions scripts/importer/metaImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,11 @@ class Color(object):

def interface():
parser = argparse.ArgumentParser(description='cBioPortal meta Importer')
parser.add_argument('-s', '--study_directory', type=str, required=True,
help='path to directory.')
data_source_group = parser.add_mutually_exclusive_group()
data_source_group.add_argument('-s', '--study_directory',
type=str, help='path to study directory.')
data_source_group.add_argument('-d', '--data_directory',
type=str, help='path to data directory for incremental upload.')
portal_mode_group = parser.add_mutually_exclusive_group()
portal_mode_group.add_argument('-u', '--url_server',
type=str,
Expand Down Expand Up @@ -115,7 +118,7 @@ def interface():
# supply parameters that the validation script expects to have parsed
args.error_file = False

study_dir = args.study_directory
data_dir = args.data_directory if args.data_directory is not None else args.study_directory

# Validate the study directory.
print("Starting validation...\n", file=sys.stderr)
Expand All @@ -139,9 +142,9 @@ def interface():
# Import OncoKB annotations when asked, and there are no validation warnings or warnings are overruled
study_is_valid = exitcode == 0 or (exitcode == 3 and args.override_warning)
if study_is_valid and args.import_oncokb:
mutation_meta_file_path = libImportOncokb.find_meta_file_by_fields(study_dir, {'genetic_alteration_type': 'MUTATION_EXTENDED'})
mutation_meta_file_path = libImportOncokb.find_meta_file_by_fields(data_dir, {'genetic_alteration_type': 'MUTATION_EXTENDED'})
mutation_data_file_name = libImportOncokb.find_data_file_from_meta_file(mutation_meta_file_path)
mutation_data_file_path = os.path.join(study_dir, mutation_data_file_name)
mutation_data_file_path = os.path.join(data_dir, mutation_data_file_name)
study_is_modified = False
print("\n")
if os.path.exists(mutation_data_file_path):
Expand All @@ -163,9 +166,9 @@ def interface():
for log_handler in validator_logger.handlers:
log_handler.close()
validator_logger.handlers = []
cna_meta_file_path = libImportOncokb.find_meta_file_by_fields(study_dir, {'genetic_alteration_type': 'COPY_NUMBER_ALTERATION', 'datatype': 'DISCRETE'})
cna_meta_file_path = libImportOncokb.find_meta_file_by_fields(data_dir, {'genetic_alteration_type': 'COPY_NUMBER_ALTERATION', 'datatype': 'DISCRETE'})
cna_data_file_name = libImportOncokb.find_data_file_from_meta_file(cna_meta_file_path)
cna_data_file_path = os.path.join(study_dir, cna_data_file_name)
cna_data_file_path = os.path.join(data_dir, cna_data_file_name)
if os.path.exists(cna_data_file_path):
print("Starting import of OncoKB annotations for discrete CNA file ...\n", file=sys.stderr)
try:
Expand Down
Loading