Skip to content

FlatteningStats

Dian SUN edited this page Apr 23, 2020 · 1 revision

SCALPEL-Analysis: The Flattening Stats

In the project SCAPEL-Analysis, we have new statistics in order to check the result of the join operation, visualise the flattening stats and compare to previous Flattening operations using a Jupyter notebook.

In this article, I’ll show how to use these Flattening Stats API in an efficient way.

The tutorial assumes that you have got a valid Metadata JSON file, a file which is produced by SCALPEL-Flattening.

1. Add SCLAPEL-Analysis to the PythonPath

SCALPEL-Analysis is not yet available through PIP or Conda channels to be installed. However, as is the case for any Python library, it is pretty straight forward to add and use.

Before, proceeding you will need to download SCALPEL-Analysis as a zip from here. Put the zip wherever you judge suitable for you. Please make sure that you meet all the requirements explicited here.

I have downloaded the zip file, and put it under the path /home/user/builds/dist/scalpel.zip.

There is two ways of doing it:

  1. Permanently add a directory to PYTHONPATH environmment variable. This will allow you to add it once and for all.
  2. Add it through sys import as shown below.
import sys
project_path = '/home/user/builds/dist/scalpel.zip'
sys.path.append(project_path)

2. Load a Metadata JSON file

As stated in the first comment cell, you will need to import the Metadata JSON.

Flattening produces a metadata file in the Json format that contains the paths of data output. That JSON will be used as input in the Flattening Stats API. Thus, first of all, one has to read the metadata file generated by Flattening job (run using SCALPEL-Flattening) and load the Flat tables that you need in the Jupyter notebook.

The following cell shows how(I'll use DCIR cohorte as an example).

from scalpel.flattening.flat_table_collection import FlatTableCollection

with open("metadata_flattening_2020_03_27_12_22_48.json", "r") as f:
    data_collection = FlatTableCollection.from_json(f.read())

//show single tables from a flat table
data_collection.single_table_names_from_flat_table("DCIR")

//fetch a flat table from data collection
dcir = data_collection.get("DCIR")

3. Confidence Level

The columns which will participate the following extraction(Using in SCALPEL-Extraction) should be confirmed by their confidence level. The objective is to check if the joins run well in the current single tables.

from scalpel.stats.flattening_confidence_degree import plot_flat_table_confidence_degree

//calculate and show the result of the confidence degree of DCIR corhorte
plot_flat_table_confidence_degree(plt.figure(figsize=(12, 8)), dcir, show=True, show_func=display)

png

4. Flattening Stats

In the Flattening Stats API, we provide several stats to represent statistical information of flat table.

Method name Visualisation description
plot_patient_events_each_year_on_months This method is used to visualize the 'patient events each year on months stat' in seaborn context
plot_patients_each_year_on_months This method is used to visualize the 'patients each year on months stat' in seaborn context
plot_patient_events_on_years This method is used to visualize the 'patient events on years stat' int seaborn context
plot_patients_on_years This method is used to visualize the 'patients on years stat' in seaborn context
from scalpel.stats.flattening_stat import plot_patient_events_each_year_on_months 

//shows the number of patient events on month from 2010 to 2014
plot_patient_events_each_year_on_months(plt.figure(figsize=(12, 8)), dcir, years=[2010, 2011, 2012, 2013, 2014])

png

5. Flattening History

You are probably interesting to know the difference in the history of flattening. To do this, you should add a parameter save_path when calling flattening stat and then use the history API that we supply.

Method name Visualisation description
compare_stats_patient_events_on_months This method is used to compare histories of patient events on months
compare_stats_patients_on_months This method is used to compare histories of patients on months
compare_stats_patient_events_on_years This method is used to compare histories of patient events on years
compare_stats_patients_on_years This method is used to compare histories of patients on years
// show and save the result of stat "patient events each year on months" 
plot_patient_events_each_year_on_months(plt.gcf(), dcir, save_path="/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months", years=[2010,2011,2012,2013,2014])

//image that we have 2 history of stat
his = {"A":"/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months", "B":"/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months"}
from scalpel.stats.flattening_stat_history import compare_stats_patient_events_on_months,
//compare histories of patient events on months.
compare_stats_patient_events_on_months(plt.gcf(), his, show=True, show_func=display)

png