Skip to content

Commit

Permalink
Merge pull request #21 from X-DataInitiative/rename-library
Browse files Browse the repository at this point in the history
Naming & doc prior to release
  • Loading branch information
MaryanMorel authored Oct 17, 2019
2 parents f26cb5c + 4f26c28 commit 8ad569e
Show file tree
Hide file tree
Showing 65 changed files with 1,214 additions and 747 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
command: |
eval "$(pyenv init -)"
pyenv local 3.5.3
cat /dev/null | python -m nose --with-coverage --cover-package=src/exploration/core --cover-package=src/exploration/loaders --cover-package=src/exploration/flattening
cat /dev/null | python -m nose --with-coverage --cover-package=scalpel/core --cover-package=scalpel/loaders --cover-package=scalpel/flattening
- run:
name: Run coverage
Expand Down
8 changes: 4 additions & 4 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[run]
branch = True
source = src/exploration
source = scalpel

[report]
exclude_lines =
Expand All @@ -11,6 +11,6 @@ exclude_lines =
ignore_errors = True
omit =
tests/*
src/libs/*
src/stats/*
src/study/*
scalpel/libs/*
scalpel/stats/*
scalpel/study/*
12 changes: 7 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
.idea/


\.idea/

src/libs/
scalpel/libs/

test\.py

*.pyc

*.log

\.coverage
.coverage

spark-warehouse
scalpel/__pycache__/
scalpel/core/__pycache__/
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
hooks:
- id: nosetests
name: nosetests
entry: bash -ec 'nosetests --with-coverage --cover-package=src/exploration/core --cover-package=src/exploration/loaders --cover-package=src/exploration/flattening'
entry: bash -ec 'nosetests --with-coverage --cover-package=scalpel/core --cover-package=scalpel/drivers --cover-package=scalpel/flattening'
language: system
files: \.py$
- repo: local
Expand Down
4 changes: 3 additions & 1 deletion CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Contributors

The _SNIIRAM-exploration_ package was initially implemented by researchers, developers, and PhD students at [CMAP](http://www.cmap.polytechnique.fr/?lang=en).
The _SCALPEL-Analysis_ package was initially implemented by researchers, developers, and PhD students at [CMAP](http://www.cmap.polytechnique.fr/?lang=en).

## List of Contributors

- Youcef Sebiat
- Maryan Morel
- Dinh Phong Nguyen
2 changes: 1 addition & 1 deletion LICENSE.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
BSD 3-Clause License

Copyright (c) 2018, The SNIIRAM-exploration developers
Copyright (c) 2019, The SCALPEL-Analysis developers
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand Down
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# License: BSD 3 clause

help:
@echo "clean - remove all build, test, coverage and Python artifacts"
@echo "clean-pyc - remove Python file artifacts"
Expand Down Expand Up @@ -50,5 +52,5 @@ test:

build: clean
mkdir ./dist
zip -x main.py -x \*libs\* -r ./dist/exploration.zip .
zip -x main.py -x \*libs\* -r ./dist/scalpel.zip .
cd ./src/libs && zip -r ../../dist/libs.zip .
271 changes: 249 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,262 @@
# SNIIRAM-exploration
[![CircleCI](https://circleci.com/gh/X-DataInitiative/SCALPEL-Analysis/tree/master.svg?style=shield&circle-token=77551e927f0d9f66b6c4755743d2cb7f5753395c)](https://circleci.com/gh/X-DataInitiative/SCALPEL-Analysis)
[![codecov](https://codecov.io/gh/X-DataInitiative/SCALPEL-Analysis/branch/master/graph/badge.svg?token=f78o8HzmAl)](https://codecov.io/gh/X-DataInitiative/SCALPEL-Analysis)
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
![Version](https://img.shields.io/github/v/release/X-DataInitiative/SCALPEL-Analysis?include_prereleases)

Library that offers util abstractions to explore data extracted
using SNIIRAM-featuring.
# SCALPEL-Analysis

Clone this repo and add it to the path to use it in notebooks.
SCALPEL-Analysis is a Library part of the SCALPEL3 framework resulting from a research Partnership between [École Polytechnique](https://www.polytechnique.edu/en) &
[Caisse Nationale d'Assurance Maladie](https://assurance-maladie.ameli.fr/qui-sommes-nous/fonctionnement/organisation/cnam-tete-reseau)
started in 2015 by [Emmanuel Bacry](http://www.cmap.polytechnique.fr/~bacry/) and [Stéphane Gaïffas](https://stephanegaiffas.github.io/).
Since then, many research engineers and PhD students developped and used this framework
to do research on SNDS data, the full list of contributors is available in [CONTRIBUTORS.md](CONTRIBUTORS.md).
This library is based on [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html). It provides
useful abstractions easing cohort data analysis and manipulation. While it can be used
as a standalone, it expects inputs formatted as the data resulting from
SCALPEL-Extraction concept extraction, that is, a metadata.json file, tracking the
cohorts data on disk or on HDFS:

## Requirements
```json
{
"operations" : [ {
"name" : "base_population",
"inputs" : [ "DCIR", "MCO", "IR_BEN_R", "MCO_CE" ],
"output_type" : "patients",
"output_path" : "/some/path/to/base_population/data",
"population_path" : ""
}, {
"name" : "drug_dispenses",
"inputs" : [ "DCIR", "MCO", "MCO_CE" ],
"output_type" : "acts",
"output_path" : "/some/path/to/drug_dispenses/data",
"population_path" : "/some/path/to/drug_dispenses/patients"
}, ... ]
}
```

This needs python 3.5.3 or above.
where:

Make sure that you have a requierments-dev based active environnement.
- `name` contains the cohort name
- `inputs` indicates the data sources used to compute this cohort
- `ouput_type` indicates if the cohort contains only `patients` or some event type (can be custom)
- `output_path` contains the path to a parquet file containing the data
- When `output_type` is not `patients`, `output_path` is used to store events. In this case,
`population_path` points to a parquet file containing data on the population.

conda create -n exploration python=3.5.3
pip install -r requirements-dev.txt
In our example, the input DataFrames contain data in parquet format. If we import this
data with PySpark and output it as strings, it should look like this :

## Running tests
On your dev environnement, just launch the following command in the root of the project:
```
base_population/data
+---------+------+-------------------+-------------------+
|patientID|gender| birthDate| deathDate|
+---------+------+-------------------+-------------------+
| Alice| 2|1934-07-27 00:00:00| null|
| Bob| 1|1951-05-01 00:00:00| null|
| Carole| 2|1942-01-12 00:00:00| null|
| Chuck| 1|1933-10-03 00:00:00|2011-06-20 00:00:00|
| Craig| 1|1943-07-27 00:00:00|2012-12-10 00:00:00|
| Dan| 1|1971-10-07 00:00:00| null|
| Erin| 2|1924-01-12 00:00:00| null|
+---------+------+-------------------+-------------------+
```

nosetests

## Development
```
drug_dispenses/data
+---------+--------+-------+-----+------+-------------------+-------------------+
|patientID|category|groupID|value|weight| start| end|
+---------+--------+-------+-----+------+-------------------+-------------------+
| Alice|exposure| null|DrugA| 1.0|2013-08-08 00:00:00|2013-10-07 00:00:00|
| Alice|exposure| null|DrugB| 1.0|2012-09-11 00:00:00|2012-12-30 00:00:00|
| Alice|exposure| null|DrugC| 1.0|2013-01-23 00:00:00|2013-03-24 00:00:00|
| Carole|exposure| null|DrugB| 1.0|2010-01-25 00:00:00|2010-12-13 00:00:00|
| Dan|exposure| null|DrugA| 1.0|2012-11-29 00:00:00|2013-01-28 00:00:00|
| Erin|exposure| null|DrugC| 1.0|2010-09-09 00:00:00|2011-01-17 00:00:00|
| Eve|exposure| null|DrugA| 1.0|2010-04-30 00:00:00|2010-08-02 00:00:00|
+---------+--------+-------+-----+------+-------------------+-------------------+
```

```
drug_dispenses/patients
+---------+
|patientID|
+---------+
| Alice|
| Carole|
| Dan|
| Erin|
| Eve|
+---------+
```

In these tables,

* `patientID` is a string identifying patients
* `gender` is an int indicating gender (1 for male, 2 for female ; we use the same coding as SNDS's)
* `birthDate` and `deathDate` are datetime, `deathDate` can be null
* `category` a string, used to indicate event types (drug purchase, act, drug exposure, etc.). It can be custom.
* `groupID` is a string. It is a "free" field, which is often used to perform aggregations. For example, you can use it to
indicate drug ATC classes.
* `value` is a string, used to indicate the precise nature of the event. For example, it can
contain the CIP13 code of a drug or a ICD10 code of a disease.
* `weight` is a float, it can be used to represent quantitative information tied to the event,
such as the number of purchased boxes for drug purchase events

An event is defined by the tuple `(patientID, category, groupID, value, weight, start, end)`.
`category`, `groupID`, `value` and `weight` are flexible fields, you can fill them with
the data which best suits your needs.

Note that the set of subjects present in `population` and `drug_dispenses` do not need to be exactly the same.

### Loading data into Cohorts
One can either create cohorts manually:

```python
from pyspark.sql import SparkSession
from scalpel.core.cohort import Cohort

spark = SparkSession.builder.appName('SCALPEL-Analysis-example').getOrCreate()
events = spark.read.parquet('/some/path/to/drug_dispenses/data')
subjects = spark.read.parquet('/some/path/to/drug_dispenses/patients')
drug_dispense_cohort = Cohort('drug_dispenses',
'Cohort of subjects having drug dispenses events',
subjects,
events)
```

or read import all the cohorts from a metadata.json file:

```python
from scalpel.core.cohort_collection import CohortCollection
cc = CohortCollection.from_json('/path/to/metadata.json')
print(cc.cohorts_names) # Should print ['base_population', 'drug_dispenses']
drug_dispenses_cohort = cc.get('drug_dispenses')
base_population_cohort = cc.get('base_population')
# To access cohort data:
drug_dispenses_cohort.subjects
drug_dispenses_cohort.events
```

## Cohort manipulation

Cohorts can be manipulated easily, thanks to algebraic manipulations:

```python
# Subjects in base population who have drug dispenses
study_cohort = base_population_cohort.intersection(drug_dispenses_cohort)
# Subjects in base population who have no drug dispenses
study_cohort = base_population_cohort.difference(drug_dispenses_cohort)
# All the subjects either in base population or who have drug dispenses
study_cohort = base_population_cohort.union(drug_dispenses_cohort)
```

Note that these operations are not commutative, as
`base_population_cohort.union(drug_dispenses_cohort)` is not equivalent to
`drug_dispenses_cohort.union(base_population_cohort)`. Indeed, for now, these
operations are based on `cohort.subjects`. It means that `foo` will not contain events,
are there are no events in `base_population`, while `bar` will contain the events
derived from `drug_dispenses_cohort`.

We plan to extend these manipulation in a near future to allow performing operations on
subjects and events in a single line of code.

## CohortFlow
`CohortFlow` objects can be used to track the evolution of a study population during the
cohort design process. Let us assume that you have a `CohortCollection` containing
`base_population`, `exposed`, `cases`, respectively containing the base population of
your study, the subjects exposed to some drugs and their exposure events, the subjects
having some disease and their disease events.

`CohortFlow` allows you to check changes in your population structure when while working
on your cohort:

```python
import matplotlib.pyplot as plt
from scalpel.stats.patients import distribution_by_gender_age_bucket
from scalpel.core.cohort_flow import CohortFlow

ordered_cohorts = [exposed, cases]

flow = CohortFlow(ordered_cohorts)
# We use 'extract_patients' as the base population
steps = flow.compute_steps(base_population)

for cohort in flow.steps:
figure = plt.figure(figsize=(8, 4.5))
distribution_by_gender_age_bucket(cohort=cohort, figure=figure)
plt.show()
```

In this example, `CohortFlow` computes iteratively the intersection between the base
cohort (`base_population`) and the cohorts in `ordered_cohort`, resulting in three
steps:

* `base_population` : all subjects
* `base_population.intersection(exposed)` : exposed subjects
* `base_population.intersection(exposed).intersection(cases)` : exposed subjects who
are cases

Calling `distribution_by_gender_age_bucket` at each step allows us to track any change
in demographics induced by restricting the subjects to the exposed cases.

Many more plotting and statistical logging available in `scalpel.stats` can be used the
same way.

## Installation
Clone this repo and add it to the `PYTHONPATH` to use it in scripts or notebooks. To add
the library temporarily to your `PYTHONPATH`, just add

import sys
sys.path.append('/path/to/the/SCALPEL-Analysis')

at the beginning of your scripts.

> **Important remark** : This software is currently in alpha stage. It should be fairly stable,
> but the API might still change and the documentation is partial. We are currently doing our best
> to improve documentation coverage as quickly as possible.
### Requirements

Python 3.6.5 or above and libraries listed in
[requirements.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements.txt).

To create a virtual environment with `conda` and install the requirements, just run

conda create -n <env name> python=3.5.3
pip install -r requirements.txt

## Citation

If you use a library part of _SCALPEL3_ in a scientific publication, we would appreciate citations. You can use the following bibtex entry:

@article{2019arXiv191007045,
author = {{Bacry}, E and {Ga{\"{i}}ffas}, S. and {Leroy}, F. and {Morel}, M. and {Nguyen}, D. P. and {Sebiat}, Y. and {Sun}, D.}
title = {{SCALPEL3: a scalable open-source library for healthcare claims databases}},
journal = {ArXiv e-prints},
eprint = {1910.07045},
url = {http://arxiv.org/abs/1910.07045},
year = 2019,
month = oct
}

## Contributing
The development cycle is opinionated. Each time you commit, git will
launch four checks before it allows you to finish your commit:
1. Black: we encourage you to install it and integrate to your dev
tool such as Pycharm. Check this [link](https://github.com/ambv/black). We massively encourage
to use it with Pycharm as it will automatically
2. Flake8: enforces some extra checks.
3. Testing using Nosetests.
1. We use [black](https://github.com/ambv/black) to format the code.
We encourage you to install it and integrate to your code editor or IDE.
2. Some extra checks are done using Flake8
3. Testing with Nosetests
4. Coverage checks if the minimum coverage is ensured.

After cloning, you have to run in the root of the repo:
To activate the pre-commit hook, you just have to install the
[requirements-dev.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements-dev.txt)
dependencies and to run:

source activate exploration
pre-commit install
source activate <env name>
cd SCALPEL-Analysis
pre-commit install

To launch the tests, just run

cd SCALPEL-Analysis
nosetests
1 change: 1 addition & 0 deletions scalpel/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# License: BSD 3 clause
1 change: 1 addition & 0 deletions scalpel/core/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# License: BSD 3 clause
Loading

0 comments on commit 8ad569e

Please sign in to comment.