Releases: X-DataInitiative/SCALPEL-Extraction
Featuring 2.1 pre-release
This release contains:
N-level exposures and Prescription
Featuring 2.0 pre-release
This includes:
- Refactoring of all Event Extractors.
- Bulk main where all Events are extracted from Source.
This has a backward compatibility with previous studies.
An important documentation effort is needed before beta release.
Featuring 1.1
Pureconfig compatible release.
This release uses Pureconfig configuration used all over the package.
Cumulative Cox 1.0.0
Important: To run this release, it is necessary to use a version of DCIR that contains the column ER_PHA_F_PHA_ACT_QSN
(with "_" instead of ".") and the line 36 of the file filtering/implicits/package.scala
should be changed from:
.extract(path).persist().where(col("`ER_PHA_F.PHA_ACT_QSN`") <= upperBoundIrphaQuantity && col("`ER_PHA_F.PHA_ACT_QSN`")>0)
to:
.extract(path).where(col("ER_PHA_F_PHA_ACT_QSN") <= upperBoundIrphaQuantity && col("ER_PHA_F_PHA_ACT_QSN")>0)
Note: The uploaded jar already has this change, but the source code doesn't.
MLPP Featuring 1.4.0
Same as previous, with a new filter added for removing patients who didn't have a target cancer within the study period.
A new entry was added to the config file:
mlpp_parameters = {
...
exposures = {
...
filter_never_sick_patients = false
}
}
MLPP Featuring 1.3.0
Same as previous except for two main changes:
- Allows lists for
bucket_size
andlag_count
parameters - Added a new parameter:
include_death_bucket
, which determines if the bucket in which a patient died should be filled with zeroes in the final matrix (if false) or not
The final deafult mlpp_parameters config object is:
mlpp_parameters = {
bucket_size = [30] # in days
lag_count = [10]
min_timestamp = [2006, 1, 1]
max_timestamp = [2009, 12, 31, 23, 59, 59]
include_death_bucket = false
exposures = {
min_purchases = 1
start_delay = 0
purchases_window = 0
only_first = false
filter_lost_patients = false
filter_diagnosed_patients = true
diagnosed_patients_threshold = 0
filter_delayed_entries = true
delayed_entry_threshold = 12
}
}
MLPP Featuring 1.2.0
Code run at the CNAM to get MLPP features.
Steps to run the featuring:
1) Run the jar with spark-submit. Example:
spark-submit \
--executor-memory 110G \
--class fr.polytechnique.cmap.cnam.filtering.mlpp.MLPPMain \
./SNIIRAM-flattening-assembly.jar conf=./mlpp_config.conf env=cnam
Where mlpp_config.conf
is the custom configuration file.
2) The csv features will be written to the path found in mlpp_config.conf
under the key mlpp_features
, so a cal to hdfs get is needed. Example:
mkdir mlpp && cd mlpp
hdfs dfs -get /shared/mlpp_features/csv/*
3) Copy the MLPP_featuring.py script to the same directory of the local features and run it. example:
cp MLPP_featuring.py mlpp && cd mlpp
python MLPP_featuring.py
Cox Experiment
The attached results.zip file contains the results of the cox model (5.txt files + 1 .R file) run at CNAM using the fr.polytechnique.cmap.cnam.filtering.cox.CoxMain
class with 4 different configuration changes compared to src/main/resources/config/filtering-default.conf
file as follows:
- cox_parameters.exposures.Start delay = 0
- cox_parameters.exposures.Min Purchases = 1
- cox_parameters.Follow-up delay = 4 months
- cox_parameters.Follow-up delay = 2 months
using the R script cox_pio.R
And it yielded the corresponding result files:
- startDelay0Result.txt
- minPurchase1Result.txt
- followupDelay4Result.txt
- followupDelay2Result.txt
- coxDefaultResult.txt (Result without any changes in the config file)
MLPP Featuring 1.1.0
Code run at the CNAM to get MLPP features.
Steps to run the featuring:
1) Run the jar with spark-submit. Example:
spark-submit \
--executor-memory 110G \
--class fr.polytechnique.cmap.cnam.filtering.mlpp.MLPPProvisoryMain \
./SNIIRAM-flattening-assembly-1.0.jar cnam 10 30
Note: the expected arguements are, respectively, environment
, lagCount
and bucketSize
(in days)
2) The csv features will be written to /shared/mlpp_features/<broad|narrow>/csv/
, so a cal to hdfs get is needed. Example:
mkdir mlpp_broad && cd mlpp_broad
hdfs dfs -get /shared/mlpp_features/broad/csv/*
3) Copy the MLPP_featuring.py script to the same directory of the local features and run it. example:
cp MLPP_featuring.py mlpp_broad && cd mlpp_broad
python MLPP_featuring.py
Note: results.tar contains the results of the longitudinal multinomial model implemented in MLPP-147. The archive contains an HTML extraction of the notebook used to produce the results, and the coefficients obtained for several parameters. The coefficients were saved in text files using numpy.savetxt