-
Notifications
You must be signed in to change notification settings - Fork 3
FlatteningStats
In the project SCAPEL-Analysis, we have new statistics in order to check the result of the join operation, visualise the flattening stats and compare to previous Flattening operations using a Jupyter notebook.
In this article, I’ll show how to use these Flattening Stats API in an efficient way.
The tutorial assumes that you have got a valid Metadata JSON file, a file which is produced by SCALPEL-Flattening.
SCALPEL-Analysis is not yet available through PIP or Conda channels to be installed. However, as is the case for any Python library, it is pretty straight forward to add and use.
Before, proceeding you will need to download SCALPEL-Analysis as a zip from here. Put the zip wherever you judge suitable for you. Please make sure that you meet all the requirements explicited here.
I have downloaded the zip file, and put it under the path /home/user/builds/dist/scalpel.zip
.
There is two ways of doing it:
- Permanently add a directory to PYTHONPATH environmment variable. This will allow you to add it once and for all.
- Add it through
sys
import as shown below.
import sys
project_path = '/home/user/builds/dist/scalpel.zip'
sys.path.append(project_path)
As stated in the first comment cell, you will need to import the Metadata JSON.
Flattening produces a metadata file in the Json format that contains the paths of data output. That JSON will be used as input in the Flattening Stats API. Thus, first of all, one has to read the metadata file generated by Flattening job (run using SCALPEL-Flattening) and load the Flat tables that you need in the Jupyter notebook.
The following cell shows how(I'll use DCIR cohorte as an example).
from scalpel.flattening.flat_table_collection import FlatTableCollection
with open("metadata_flattening_2020_03_27_12_22_48.json", "r") as f:
data_collection = FlatTableCollection.from_json(f.read())
//show single tables from a flat table
data_collection.single_table_names_from_flat_table("DCIR")
//fetch a flat table from data collection
dcir = data_collection.get("DCIR")
The columns which will participate the following extraction(Using in SCALPEL-Extraction) should be confirmed by their confidence level. The objective is to check if the joins run well in the current single tables.
from scalpel.stats.flattening_confidence_degree import plot_flat_table_confidence_degree
//calculate and show the result of the confidence degree of DCIR corhorte
plot_flat_table_confidence_degree(plt.figure(figsize=(12, 8)), dcir, show=True, show_func=display)
In the Flattening Stats API, we provide several stats to represent statistical information of flat table.
Method name | Visualisation description |
---|---|
plot_patient_events_each_year_on_months | This method is used to visualize the 'patient events each year on months stat' in seaborn context |
plot_patients_each_year_on_months | This method is used to visualize the 'patients each year on months stat' in seaborn context |
plot_patient_events_on_years | This method is used to visualize the 'patient events on years stat' int seaborn context |
plot_patients_on_years | This method is used to visualize the 'patients on years stat' in seaborn context |
from scalpel.stats.flattening_stat import plot_patient_events_each_year_on_months
//shows the number of patient events on month from 2010 to 2014
plot_patient_events_each_year_on_months(plt.figure(figsize=(12, 8)), dcir, years=[2010, 2011, 2012, 2013, 2014])
You are probably interesting to know the difference in the history of flattening. To do this, you should add a parameter save_path when calling flattening stat and then use the history API that we supply.
Method name | Visualisation description |
---|---|
compare_stats_patient_events_on_months | This method is used to compare histories of patient events on months |
compare_stats_patients_on_months | This method is used to compare histories of patients on months |
compare_stats_patient_events_on_years | This method is used to compare histories of patient events on years |
compare_stats_patients_on_years | This method is used to compare histories of patients on years |
// show and save the result of stat "patient events each year on months"
plot_patient_events_each_year_on_months(plt.gcf(), dcir, save_path="/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months", years=[2010,2011,2012,2013,2014])
//image that we have 2 history of stat
his = {"A":"/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months", "B":"/user/ds/CNAM412/flattening/stat/dcir/patient_events_on_months"}
from scalpel.stats.flattening_stat_history import compare_stats_patient_events_on_months,
//compare histories of patient events on months.
compare_stats_patient_events_on_months(plt.gcf(), his, show=True, show_func=display)
SCALPEL3