Merge pull request #4831 from thehyve/docu_custom_attributes

Improve documentation on custom clinical attributes and fix broken URLs
cBioPortal · Nov 28, 2018 · 851f1fc · 851f1fc
2 parents 864859b + 8e73f8e
commit 851f1fc
Show file tree

Hide file tree

Showing 7 changed files with 27 additions and 30 deletions.
diff --git a/core/src/main/scripts/importer/validateData.py b/core/src/main/scripts/importer/validateData.py
@@ -1913,7 +1913,8 @@ class ClinicalValidator(Validator):
                       'datatype',
                       'priority')
 
-    # Only a core set of attributes must be either specific in the patient or sample clinical data.
+    # A core set of attributes must be either specific in the patient or sample clinical data.
+    # See GET /api/ at http://oncotree.mskcc.org/cdd/swagger-ui.html#/
     PREDEFINED_ATTRIBUTES = {
         'AGE': {
             'is_patient_attribute': '1',

diff --git a/core/src/test/scripts/unit_tests_validate_data.py b/core/src/test/scripts/unit_tests_validate_data.py
@@ -254,7 +254,7 @@ def test_hardcoded_attributes(self):
         self.assertEqual(record.column_number, 2)
         self.assertIn(record.cause, 'STRING')
 
-        # Expect warning for sample attribute in patient clinical data
+        # Expect error for sample attribute in patient clinical data
         record = next(record_iterator)
         self.assertEqual(record.levelno, logging.ERROR)
         self.assertEqual(record.line_number, 5)

diff --git a/docs/Data-Loading-How-the-loader-deals-with-Hugo-symbols.md b/docs/Data-Loading-How-the-loader-deals-with-Hugo-symbols.md
diff --git a/docs/Docker-Prerequisites.md b/docs/Docker-Prerequisites.md
@@ -9,12 +9,7 @@ First, make sure that you have the latest version of Docker installed on your ma
 
 ## 2. Download Seed DB
 
-The latest cBioPortal Seed files are available from the [cBioPortal Datahub](https://github.com/cBioPortal/datahub/tree/master/seedDB).    
-You can download these files by using the links below:
-
-- **Schema 1.3.0**: [SQL file with create table statements for portal release 1.3.1](https://raw.githubusercontent.com/cBioPortal/cbioportal/v1.3.1/core/src/main/resources/db/cgds.sql) 
-- **Seed data, part1**: [cbioportal-seed SQL (.gz) file - part1 (no pdb_ tables)](https://github.com/cbioportal/datahub/raw/b69c86803c40d543080bf31a645721d06c82d08d/seedDB/seed-cbioportal_no-pdb_hg19.sql.gz)
-- **Seed data, part2 (optional)** [cbioportal-seed SQL (.gz) file - part2 (only pdb_ tables)](https://github.com/cbioportal/datahub/raw/b69c86803c40d543080bf31a645721d06c82d08d/seedDB/seed-cbioportal_only-pdb.sql.gz)
+The latest cBioPortal seed file and database schema are available from [cBioPortal Datahub](https://github.com/cBioPortal/datahub/tree/master/seedDB).
 
 ## 3. Prepare Configuration Files
 
@@ -26,4 +21,3 @@ You will need the following configuration files.
 - [Download log4j.properties.EXAMPLE](../src/main/resources/log4j.properties.EXAMPLE)
 - [settings.xml Reference](Pre-Build-Steps.md#create-a-maven-settings-file)
 - [context.xml Reference](Deploying.md#set-up-the-database-connection-pool)
-- [gene_sets.txt](../core/src/main/resources/sample_data/gene_sets.txt)
diff --git a/docs/File-Formats.md b/docs/File-Formats.md
@@ -89,7 +89,7 @@ brca<TAB>Breast Invasive Carcinoma<TAB>breast,breast invasive<TAB>HotPink<TAB>Br
 
 The clinical data is used to capture both clinical attributes and the mapping between patient and sample ids. The software supports multiple samples per patient.  
 
-As of March 2016, the clinical file is split into a patient file and a clinical file. The *sample* file is required, whereas the *patient* file is optional.
+As of March 2016, the clinical file is split into a patient file and a clinical file. The *sample* file is required, whereas the *patient* file is optional. cBioPortal has specific functionality for a core set of patient and sample columns, but can also display custom columns (see section ["Custom columns in clinical data"](#custom-columns-in-clinical-data)).
 
 #### Meta files
 The two clinical metadata files (or just one metadata file if you choose to leave the *patient* file out) have to contain the following fields:
@@ -188,8 +188,8 @@ These columns, when provided, add additional information to the patient descript
 - **AGE**: Age at which the condition or disease was first diagnosed, in years (number)
 - **TUMOR_SITE**
 
-Optional attributes:
-- **Other Clinical Attribute Headers**: Clinical attribute headers are free-form. You can add any additional clinical attribute and cBioPortal will add them to the database. Be sure to provide the correct `'Datatype'`, as described above, for optimal search, sorting, filtering (in [clinical data tab](http://www.cbioportal.org/study?id=brca_tcga#clinical)) and display.
+Custom attributes:
+- **Custom Clinical Attribute Headers**: Any other custom attribute can be added as well. See section ["Custom columns in clinical data"](#custom-columns-in-clinical-data).
 
 ###### Example *patient* data file
 ```
@@ -210,7 +210,7 @@ The file containing the sample attributes has two **required** columns:
 
 By adding `PATIENT_ID` here, cBioPortal will map the given sample to this patient. This enables one to associate multiple samples to one patient. For example, a single patient may have had multiple biopsies, each of which has been genomically profiled. See [this example for a patient with multiple samples](http://www.cbioportal.org/patient?studyId=lgg_ucsf_2014&caseId=P04).
 
-The following columns are required if you want the [pan-cancer summary statistics tab in a pan-cancer study](http://www.cbioportal.org/index.do?cancer_study_list=cellline_ccle_broad&cancer_study_id=cellline_ccle_broad&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=cellline_ccle_broad_mutations&genetic_profile_ids_PROFILE_COPY_NUMBER_ALTERATION=cellline_ccle_broad_CNA&Z_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=cellline_ccle_broad_cnaseq&case_ids=&patient_case_select=sample&gene_set_choice=prostate-cancer%3A-ar-signaling-%2810-genes%29&gene_list=SOX9+RAN+TNK2+EP300+PXN+NCOA2+AR+NRIP1+NCOR1+NCOR2&clinical_param_selection=null&tab_index=tab_visualize&Action=Submit#pancancer_study_summary):
+The following columns are required for the pan-cancer summary statistics tab ([example](http://www.cbioportal.org/index.do?cancer_study_id=msk_impact_2017&Z_SCORE_THRESHOLD=2&RPPA_SCORE_THRESHOLD=2&data_priority=0&case_set_id=msk_impact_2017_cnaseq&gene_list=BRAF&geneset_list=+&tab_index=tab_visualize&Action=Submit&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=msk_impact_2017_mutations&genetic_profile_ids_PROFILE_COPY_NUMBER_ALTERATION=msk_impact_2017_cna)).
 - **CANCER_TYPE**: Cancer Type
 - **CANCER_TYPE_DETAILED**: Cancer Type Detailed, a sub-type of the specified CANCER_TYPE
 
@@ -226,6 +226,9 @@ The following columns additionally affect the [Timeline data](#timeline-data) vi
     - If set to `metastatic` or `metastasis`: red
     - If set to `primary` or otherwise: black
 
+Custom attributes:
+- **Custom Clinical Attribute Headers**: Any other custom attribute can be added as well. See section ["Custom columns in clinical data"](#custom-columns-in-clinical-data).
+
 ###### Example sample data file
 ```
 #Patient Identifier<TAB>Sample Identifier<TAB>Subtype<TAB>...
@@ -238,7 +241,7 @@ PATIENT_ID_2<TAB>SAMPLE_ID_2<TAB>Her2 enriched<TAB>...
 ...
 ```
 
-##### Other columns with specific functionality
+##### Columns with specific functionality
 These columns can be in either the patient or sample file.
 - **CANCER_TYPE**: Overrides study wide cancer type
 - **CANCER_TYPE_DETAILED**
@@ -253,8 +256,10 @@ These columns can be in either the patient or sample file.
 - **SERUM_PSA**
 - **DRIVER_MUTATIONS**
 
-##### Other columns without specific functionality
-You can add any additional columns with clinical data to either the patient or sample file. If correctly formatted with the 5-row header, cBioPortal will add them to the database. Be sure to provide the correct `Datatype`, as described above (for the header lines), for optimal search, sorting, filtering (in [clinical data tab](http://www.cbioportal.org/study?id=brca_tcga#clinical)) and visualization.
+##### Custom columns in clinical data
+cBioPortal supports custom columns with clinical data in either the patient or sample file. They should follow the previously described 5-row header format. Be sure to provide the correct `Datatype`, for optimal search, sorting, filtering (in [clinical data tab](http://www.cbioportal.org/study?id=brca_tcga#clinical)) and visualization.
+
+The Clinical Data Dictionary from MSKCC is used to normalize clinical data, and should be followed to make the clinical data comparable between studies. This dictionary provides a definition whether an attribute should be defined on the patient or sample level, as well as provides a name, description and datatype. The data curator can choose to ignore these proposed definitions, but not following this dictionary might make comparing data between studies more difficult. It should however not break any cBioPortal functionality. See GET /api/ at [http://oncotree.mskcc.org/cdd/swagger-ui.html#/](http://oncotree.mskcc.org/cdd/swagger-ui.html#/) for the data dictionary of all known clinical attributes.
 
 ##### Banned column names
 `MUTATION_COUNT` and `FRACTION_GENOME_ALTERED` are auto populated clinical attributes, and should therefore not be present in clinical data files.
@@ -513,7 +518,7 @@ Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
 ...
 ```
 
-An example data file which includes both Hugo_Symbo and Entrez_Gene_Id would look like (supported, but not recommended as it increases the chances of errors regarding [ambiguous Hugo symbols](Data-Loading-How-the-loader-deals-with-Hugo-symbols.md)):
+An example data file which includes both Hugo_Symbo and Entrez_Gene_Id would look like (supported, but not recommended as it increases the chances of errors regarding ambiguous gene symbols):
 ```
 Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
 ACAP3<TAB>116983<TAB>-0.005<TAB>-0.550<TAB>...
@@ -522,7 +527,7 @@ AGRN<TAB>375790<TAB>0.142<TAB>0.091<TAB>...
 ...
 ```
 
-An example data file with only Hugo_Symbol column (supported, but not recommended as it increases the chances of errors regarding [ambiguous Hugo symbols](Data-Loading-How-the-loader-deals-with-Hugo-symbols.md):
+An example data file with only Hugo_Symbol column (supported, but not recommended as it increases the chances of errors regarding ambiguous gene symbols):
 ```
 Hugo_Symbol<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
 ACAP3<TAB>-0.005<TAB>-0.550<TAB>...
@@ -1004,7 +1009,7 @@ This will lead to a timeline track for Lab_test with an additional subtrack spec
 
 
 ## Gistic Data
-Running GISTIC 2.0 on e.g. GenePattern not only provides the [Discrete Copy Number Data](#discrete-copy-number-data), but also provides an amp_genes and a del_genes file. These cannot be directly imported into cBioPortal, but first have to be converted to a different file format. Currently, there is no easy way available to do this. However, the cBioPortal team is aiming to make the necessary [cbioportal_pipelines](https://github.com/cBioPortal/cbioportal-pipelines/blob/master/importer/src/main/java/org/mskcc/cbio/importer/converter/internal/GisticGenesConverterImpl.java) functionality available via issue [#873](https://github.com/cBioPortal/cbioportal/issues/873). 
+Running GISTIC 2.0 on e.g. GenePattern not only provides the [Discrete Copy Number Data](#discrete-copy-number-data), but also provides an amp_genes and a del_genes file. These cannot be directly imported into cBioPortal, but first have to be converted to a different file format. An example can be found in the [ACC TCGA study](https://github.com/cBioPortal/datahub/blob/master/public/acc_tcga/data_gistic_genes_amp.txt) on cBioPortal Datahub.
 
 After uploading a gistic_amp and/or gistic_del file, a new button becomes available in the Enter Gene Set section, called "Select Genes from Recurrent CNAs (Gistic)". 
 
@@ -1156,7 +1161,7 @@ In both of these cases, the sample-profile matrix file does not need to be provi
 ## Gene Set Data
 A description of importing gene sets (which are required before loading gene set study) can be found [here](Import-Gene-Sets.md). This page also contains a decription to import gene set hierarchy data, which is required to show a hierarchical tree on the query page to select gene sets.
 
-cBioPortal supports GSVA scores and GSVA-like scores, such as ssGSEA. The [Gene Set Variation Analysis]([http://www.bioconductor.org/packages/release/bioc/html/GSVA.html]) method in R (GSVA, [Hänzelmann, 2013](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7)) can calculate several types of scores (specified with the `methods=` argument) and outputs a score between -1 and 1. The GSVA method also calculates a p-value per score using a boothstrapping method. 
+cBioPortal supports GSVA scores and GSVA-like scores, such as ssGSEA. The [Gene Set Variation Analysis](http://www.bioconductor.org/packages/release/bioc/html/GSVA.html) method in R (GSVA, [Hänzelmann, 2013](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7)) can calculate several types of scores (specified with the `methods=` argument) and outputs a score between -1 and 1. The GSVA method also calculates a p-value per score using a bootstrapping method.
 
 To import the GSVA(-like) data, a score and p-value data file are required. It is important that the dimensions of the score and p-value file are the same and that they contain the same gene sets and samples. Both data files require a meta file.
 
@@ -1194,7 +1199,7 @@ geneset_def_version: msigdb_6.1
 ### GSVA score data file
 The data file will be a simple tab separated format, similar to the expression data file: each sample is a column, each gene set a row, each cell contains the GSVA score for that sample x gene set combination.
 
-The first column is the `geneset_id` and contains the name of the gene sets. Gene set names should be formatted in uppercase. The other columns are sample columns: An additional column for each sample in the dataset using the sample id as the column header.
+The first column is `geneset_id` and contains the names of the gene sets. Gene set names should be formatted in uppercase. The other columns are sample columns: An additional column for each sample in the dataset using the sample id as the column header.
 
 The cells contain the GSVA(-like) score: which is real number, between -1.0 and 1.0, representing the score for the gene set in the respective sample, or NA when the score for the gene set in the respective sample could not be (or was not) calculated. Example with 2 gene sets and 3 samples: 
 
@@ -1238,7 +1243,7 @@ geneset_def_version: msigdb_6.1
 ### GSVA p-value data file
 The data file will be a simple tab separated format, similar to the score file: each sample is a column, each gene set a row, each cell contains the p-value for the score found for sample x gene set combination.
 
-The first column is the `geneset_id` and contains the name of the gene sets. Gene set names should be formatted in uppercase. The other columns are sample columns: An additional column for each sample in the dataset using the sample id as the column header.
+The first column is `geneset_id` and contains the names of the gene sets. Gene set names should be formatted in uppercase. The other columns are sample columns: An additional column for each sample in the dataset using the sample id as the column header.
 
 The cells contain the p-value for the GSVA score: A real number, between 0.0 and 1.0, representing the p-value for the GSVA score calculated for the gene set in the respective sample, or NA when the score for the gene is also NA. Example with 2 gene sets and 3 samples: 
 

diff --git a/docs/README.md b/docs/README.md
@@ -53,14 +53,12 @@ We also maintain an active [list of RFCs (Requests for Comments)](RFC-List.md) w
 * [Data Loading Overview](Data-Loading.md)
    * [Using the Dataset Validator](Using-the-dataset-validator.md)
    * [Using the metaImport Script](Using-the-metaImport-script.md)
-   * [Development, Debugging and Maintenance Using cbioportalImporter](Development,-debugging-and-maintenance-mode-using-cbioportalImporter.md)
    * [File Formats](File-Formats.md)
        * [Z-Score Normalization](Z-Score-normalization-script.md)
-   * [Data Loading: How the loader deals with Hugo Symbols](Data-Loading-How-the-loader-deals-with-Hugo-symbols.md) (TODO)    
-   * [Data Loading: What You Need To Change](Data-Loading-What-You-Need-To-Change.md)
-   * [Data Loading: Tips and Best Practices](Data-Loading-Tips-and-Best-Practices.md)
-   * [Importer Tool](Importer-Tool.md) (deprecated)
-   * [Importing Gene Sets](Import-Gene-Sets.md)
+   * [Maintaining Studies](Data-Loading-Maintaining-Studies.md)
+   * [For Developers](Data-Loading-For-Developers.md)
+   * [Tips and Best Practices](Data-Loading-Tips-and-Best-Practices.md)
+   * [Importing Gene Sets for Gene Set Scoring Data](Import-Gene-Sets.md)
 
 ### 5.2 Datasets
 * [Downloads](Downloads.md)

diff --git a/docs/Using-the-metaImport-script.md b/docs/Using-the-metaImport-script.md
@@ -75,4 +75,4 @@ This example imports the study to the localhost, creates an html report and show
 By adding `-o`, warnings will be overridden and import will start after validation.
 
 ## Development / debugging mode
-For developers and specific testing purposes, an extra script, cbioportalImporter.py, is available which imports data regardless of validation results. Check [this](Development,-debugging-and-maintenance-mode-using-cbioportalImporter.md) page for more information on how to use it.
+For developers and specific testing purposes, an extra script, cbioportalImporter.py, is available which imports data regardless of validation results. Check [this](Data-Loading-For-Developers.md) page for more information on how to use it.