Skip to content

Commit

Permalink
Simplify influenza data conversion script (#72)
Browse files Browse the repository at this point in the history
  • Loading branch information
twallema authored Jun 18, 2024
1 parent 41e7a57 commit 2e69d7b
Show file tree
Hide file tree
Showing 9 changed files with 299 additions and 326 deletions.
14 changes: 7 additions & 7 deletions docs/influenza_1718.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,20 @@ located in `~/tutorials/influenza_1718/`.

## Data

Data on the weekly incidence of visits to General Practictioners (GP) are made publically available by the Belgian Scientific Institute of Public Health (Sciensano). These data were retrieved from the "End of season" report on Influenza in Belgium (see `data/raw/Influenza 2017-2018 End of Season_NL.pdf`). Using [Webplotdigitizer](https://automeris.io/WebPlotDigitizer/), the weekly number of GP visits in the different age groups were extracted (see `data/raw/dataset_influenza_1718.csv`). Then, the script `data_conversion.py` was used to convert the *raw* weekly incidence of Influenza cases in Belgium (per 100K inhabitats) during the 2017-2018 Influenza season into a better suited format. The weekly incidence was first converted to the absolute GP visits count by multipying with the demographics. Then, it was assumed that the weekly incidence was the sum of seven equal counts throughout the week preceding the data collection, and hence the weekly data were divided by seven. The formatted data are located in `data/interim/data_influenza_1718_format.csv`.
Data on the weekly incidence of visits to General Practictioners (GP) for Influenza-like illness (ILI) are made publically available by the Belgian Scientific Institute of Public Health (Sciensano). These data were retrieved from the "End of season" report on Influenza in Belgium (see `data/raw/Influenza 2017-2018 End of Season_NL.pdf`). Using [Webplotdigitizer](https://automeris.io/WebPlotDigitizer/), the weekly number of GP visits in the different age groups were extracted (see `data/raw/ILI_weekly_1718.csv`). Then, the script `data_conversion.py` was used to convert the *raw* weekly incidence of Influenza cases in Belgium (per 100K inhabitats) during the 2017-2018 Influenza season into a better suited format. The week numbers in the raw dataset were replaced with the date of that week's Thursday, as an approximation of the midpoint of the week. Further, the number of GP visits per 100K inhabitants was converted to the absolute number of GP visits. The formatted data are located in `data/interim/ILI_weekly_100K.csv` and `data/interim/ILI_weekly_100K.csv`.

![data](/_static/figs/influenza_1718/data.png)

The data are loaded in our calibration script `~/tutorials/influenza_1718/calibration.py` as a `pd.DataFrame` with a `pd.Multiindex`. The `time`/`date` axis is obligatory. The other index names and values are the same as the model's dimensions and coordinates. In this way, pySODM recognizes how model prediction and dataset must be aligned.
The absolute weekly number of GP visits data are loaded in our calibration script `~/tutorials/influenza_1718/calibration.py` as a `pd.DataFrame` with a `pd.Multiindex`. The weekly number of GP visits is divided by seven to approximate the daily incidence at the week's midpoint (which we'll use to calibrate our model). The `time`/`date` axis in the `pd.DataFrame` is obligatory. The other index names and values are the same as the model's dimensions and coordinates. In this way, pySODM recognizes how model prediction and dataset must be aligned.

```bash
date age_group
2017-11-27 (0, 5] 15.373303
(5, 15] 13.462340
(15, 65] 409.713333
(65, 120] 33.502705
2017-12-01 (0, 5] 15.727305
(5, 15] 13.240385
(15, 65] 407.778693
(65, 120] 32.379271
...
2018-05-07 (0, 5] 0.000000
2018-05-11 (0, 5] 0.000000
(5, 15] 0.000000
(15, 65] 0.000000
(65, 120] 0.000000
Expand Down
4 changes: 2 additions & 2 deletions tutorials/influenza_1718/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,6 @@ First, by running the script `data_conversion.py`, the user converts the *raw* w

### Interim

+ `data_influenza_1718_format.csv`: Daily incidence of Influenza cases in Belgium during the 2017-2018 Influenza season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.

+ `ILI_weekly_100K.csv`: Weekly incidence of GP visits for Influenza-like illness per 100K inhabitants in Belgium during the 2017-2018 season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Dates are the reported week number's midpoint. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.

+ `ILI_weekly_ABS.csv`: Weekly incidence of GP visits for Influenza-like illness in Belgium during the 2017-2018 season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Dates are the reported week number's midpoint. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.
50 changes: 22 additions & 28 deletions tutorials/influenza_1718/calibration.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,41 +32,41 @@

tau = 0.50 # Timestep of Tau-Leaping algorithm
alpha = 0.03 # Overdispersion factor (based on COVID-19)
end_calibration = pd.Timestamp('2018-03-01') # Enddate of calibration
end_calibration = '2018-03-01' # Enddate of calibration
identifier = 'twallema_2018-03-01' # Give any output of this script an ID
n_pso = 3 # Number of PSO iterations
multiplier_pso = 36 # PSO swarm size
n_mcmc = 50 # Number of MCMC iterations
multiplier_mcmc = 36 # Total number of Markov chains = number of parameters * multiplier_mcmc
n_pso = 30 # Number of PSO iterations
multiplier_pso = 10 # PSO swarm size
n_mcmc = 500 # Number of MCMC iterations
multiplier_mcmc = 10 # Total number of Markov chains = number of parameters * multiplier_mcmc
print_n = 100 # Print diagnostics every print_n iterations
discard = 1000 # Discard first `discard` iterations as burn-in
discard = 50 # Discard first `discard` iterations as burn-in
thin = 10 # Thinning factor emcee chains
n = 100 # Repeated simulations used in visualisations
processes = int(os.getenv('SLURM_CPUS_ON_NODE', mp.cpu_count()/2)) # Automatically use half the number of available threads (typically corresponds to number of physical CPU cores)
processes = int(os.getenv('SLURM_CPUS_ON_NODE', mp.cpu_count())) # Retrieve CPU count

###############
## Load data ##
###############

# Load case data
data = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/data_influenza_1718_format.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
data = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/ILI_weekly_ABS.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
data = data.squeeze()
# Load case data per 100K
data_100K = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/data_influenza_1718_format_100K.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
data_100K = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/ILI_weekly_100K.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
data_100K = data_100K.squeeze()
# Re-insert pd.IntervalIndex (pd.IntervalIndex is always loaded as a string..)
age_groups = pd.IntervalIndex.from_tuples([(0,5),(5,15),(15,65),(65,120)], closed='left')
iterables = [data.index.get_level_values('DATE').unique(), age_groups]
names = ['date', 'age_group']
index = pd.MultiIndex.from_product(iterables, names=names)
df_influenza = pd.Series(index=index, name='CASES', data=data.values)
df_influenza_100K = pd.Series(index=index, name='CASES', data=data_100K.values)
df_influenza = pd.Series(index=index, name='CASES', data=data.values)/7 # convert weekly cumulative to daily week midpoint
df_influenza_100K = pd.Series(index=index, name='CASES', data=data_100K.values)/7 # convert weekly cumulative to daily week midpoint
# Extract start and enddate
start_date = df_influenza.index.get_level_values('date').unique()[8]
end_date = df_influenza.index.get_level_values('date').unique()[-1]
start_calibration = start_date
# Hardcode Belgian demographics
initN = pd.Series(index=age_groups, data=np.array([606938, 1328733, 7352492, 2204478]))
start_visualisation = df_influenza.index.get_level_values('date').unique()[8].strftime("%Y-%m-%d")
end_visualisation = df_influenza.index.get_level_values('date').unique()[-1].strftime("%Y-%m-%d")
start_calibration = start_visualisation
# Hardcode Belgian demographics (Jan 1, 2018)
initN = pd.Series(index=age_groups, data=[620914, 1306826, 7317774, 2130556])

################
## Load model ##
Expand Down Expand Up @@ -160,7 +160,7 @@
# Assign results to model
model.parameters = assign_theta(model.parameters, pars, theta)
# Simulate model
out = model.sim([start_calibration, end_date], samples={}, N=n, tau=tau, output_timestep=1)
out = model.sim([start_calibration, end_visualisation], samples={}, N=n, tau=tau, output_timestep=1)
# Add poisson obervational noise
out = add_negative_binomial_noise(out, alpha)
# Visualize
Expand All @@ -169,7 +169,7 @@
for id, age_class in enumerate(df_influenza.index.get_level_values('age_group').unique()):
# Data
axs[id].scatter(df_influenza[start_calibration:end_calibration].index.get_level_values('date').unique(), df_influenza.loc[slice(start_calibration,end_calibration), age_class], color='black', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='observed')
axs[id].scatter(df_influenza[end_calibration:end_date].index.get_level_values('date').unique(), df_influenza.loc[slice(end_calibration,end_date), age_class], color='red', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='unobserved')
axs[id].scatter(df_influenza[end_calibration:end_visualisation].index.get_level_values('date').unique(), df_influenza.loc[slice(end_calibration,end_visualisation), age_class], color='red', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='unobserved')
# Model trajectories
for i in range(n):
axs[id].plot(out['date'],out['Im_inc'].sel(age_group=age_class).isel(draws=i), color='blue', alpha=0.05, linewidth=1)
Expand All @@ -195,18 +195,12 @@
# Perturbate previously obtained estimate
ndim, nwalkers, pos = perturbate_theta(theta, pert=0.30*np.ones(len(theta)), multiplier=multiplier_mcmc, bounds=expanded_bounds)
# Write some usefull settings to a pickle file (no pd.Timestamps or np.arrays allowed!)
settings={'start_calibration': start_calibration.strftime("%Y-%m-%d"), 'end_calibration': end_calibration.strftime("%Y-%m-%d"),
settings={'start_calibration': start_calibration, 'end_calibration': end_calibration,
'n_chains': nwalkers, 'starting_estimate': list(theta), 'labels': expanded_labels, 'tau': tau}
# Sample n_mcmc iterations
sampler = run_EnsembleSampler(pos, n_mcmc, identifier, objective_function, objective_function_kwargs={'simulation_kwargs': {'warmup': 0, 'tau':tau}},
fig_path=fig_path, samples_path=samples_path, print_n=print_n, backend=None, processes=processes, progress=True,
settings_dict=settings)
# Sample 5*n_mcmc more
for i in range(3):
backend = emcee.backends.HDFBackend(os.path.join(os.getcwd(),samples_path+identifier+'_BACKEND_'+run_date+'.hdf5'))
sampler = run_EnsembleSampler(pos, n_mcmc, identifier, objective_function, objective_function_kwargs={'simulation_kwargs': {'warmup': 0, 'tau':tau}},
fig_path=fig_path, samples_path=samples_path, print_n=print_n, backend=backend, processes=processes, progress=True,
settings_dict=settings)
settings_dict=settings)
# Generate a sample dictionary and save it as .json for long-term storage
# Have a look at the script `emcee_sampler_to_dictionary.py`, which does the same thing as the function below but can be used while your MCMC is running.
samples_dict = emcee_sampler_to_dictionary(samples_path, identifier, discard=discard, thin=thin)
Expand All @@ -229,7 +223,7 @@ def draw_fcn(param_dict, samples_dict):
param_dict['f_ud'] = np.array([slice[idx] for slice in samples_dict['f_ud']])
return param_dict
# Simulate model
out = model.sim([start_date, end_date], N=n, tau=tau, output_timestep=1, samples=samples_dict, draw_function=draw_fcn, processes=processes)
out = model.sim([start_visualisation, end_visualisation], N=n, tau=tau, output_timestep=1, samples=samples_dict, draw_function=draw_fcn, processes=processes)
# Add negative binomial observation noise
out_noise = add_negative_binomial_noise(out, alpha)

Expand All @@ -243,7 +237,7 @@ def draw_fcn(param_dict, samples_dict):
for id, age_class in enumerate(df_influenza_100K.index.get_level_values('age_group').unique()):
# Data
axs[id].plot(df_influenza_100K[start_calibration:end_calibration].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(start_calibration,end_calibration), age_class]*7, color='black', marker='o', label='Observed data')
axs[id].plot(df_influenza_100K[end_calibration-pd.Timedelta(days=7):end_date].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(end_calibration-pd.Timedelta(days=7),end_date), age_class]*7, color='red', marker='o', label='Unobserved data')
axs[id].plot(df_influenza_100K[pd.Timestamp(end_calibration)-pd.Timedelta(days=7):end_visualisation].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(pd.Timestamp(end_calibration)-pd.Timedelta(days=7),end_visualisation), age_class]*7, color='red', marker='o', label='Unobserved data')
# Model trajectories
axs[id].plot(out['date'],out['Im_inc'].sel(age_group=age_class).mean(dim='draws')/initN.loc[age_class]*100000*7, color='black', linestyle='--', alpha=0.7, linewidth=1, label='Model mean')
axs[id].fill_between(out_noise['date'].values,out_noise['Im_inc'].sel(age_group=age_class).quantile(dim='draws', q=0.025)/initN.loc[age_class]*100000*7, out_noise['Im_inc'].sel(age_group=age_class).quantile(dim='draws', q=0.975)/initN.loc[age_class]*100000*7, color='black', alpha=0.15, label='Model 95% CI')
Expand Down
129 changes: 129 additions & 0 deletions tutorials/influenza_1718/data/interim/ILI_weekly_100K.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
DATE,AGE,CASES_100K
2017-10-06,"[0, 5)",3.53073360514691
2017-10-06,"[5, 15)",6.31047999020029
2017-10-06,"[15, 65)",14.5889596615352
2017-10-06,"[65, 120)",11.9210107267218
2017-10-13,"[0, 5)",16.9737694017181
2017-10-13,"[5, 15)",8.69528973038268
2017-10-13,"[15, 65)",28.0775650709766
2017-10-13,"[65, 120)",11.5929095140596
2017-10-20,"[0, 5)",13.9357952104024
2017-10-20,"[5, 15)",5.64212566811102
2017-10-20,"[15, 65)",16.6851618535429
2017-10-20,"[65, 120)",3.32475895497555
2017-10-27,"[0, 5)",35.7180701621351
2017-10-27,"[5, 15)",2.54339199296919
2017-10-27,"[15, 65)",21.9104774626053
2017-10-27,"[65, 120)",18.8986298493355
2017-11-03,"[0, 5)",4.56246232077729
2017-11-03,"[5, 15)",19.2275807263673
2017-11-03,"[15, 65)",13.894810397062
2017-11-03,"[65, 120)",3.22926973845097
2017-11-10,"[0, 5)",17.8943881440409
2017-11-10,"[5, 15)",16.5611955617146
2017-11-10,"[15, 65)",15.2280029793883
2017-11-10,"[65, 120)",7.22884748542992
2017-11-17,"[0, 5)",35.225891714284
2017-11-17,"[5, 15)",17.8943881440409
2017-11-17,"[15, 65)",21.8939658910203
2017-11-17,"[65, 120)",16.5611955617146
2017-11-24,"[0, 5)",21.8939658910203
2017-11-24,"[5, 15)",0.562884573797874
2017-11-24,"[15, 65)",29.8931213849783
2017-11-24,"[65, 120)",29.8931213849783
2017-12-01,"[0, 5)",17.7304964539012
2017-12-01,"[5, 15)",7.09219858156075
2017-12-01,"[15, 65)",39.0070921985816
2017-12-01,"[65, 120)",10.6382978723409
2017-12-08,"[0, 5)",92.1985815602839
2017-12-08,"[5, 15)",63.8297872340427
2017-12-08,"[15, 65)",42.5531914893618
2017-12-08,"[65, 120)",10.6382978723409
2017-12-15,"[0, 5)",17.7304964539012
2017-12-15,"[5, 15)",81.5602836879434
2017-12-15,"[15, 65)",46.0992907801419
2017-12-15,"[65, 120)",10.6382978723409
2017-12-22,"[0, 5)",0.0
2017-12-22,"[5, 15)",21.2765957446813
2017-12-22,"[15, 65)",67.3758865248228
2017-12-22,"[65, 120)",60.2836879432625
2017-12-29,"[0, 5)",138.297872340426
2017-12-29,"[5, 15)",3.5460992907806
2017-12-29,"[15, 65)",78.0141843971633
2017-12-29,"[65, 120)",21.2765957446813
2018-01-05,"[0, 5)",297.872340425532
2018-01-05,"[5, 15)",81.1914893617022
2018-01-05,"[15, 65)",209.219858156029
2018-01-05,"[65, 120)",85.10638297872359
2018-01-12,"[0, 5)",329.787234042553
2018-01-12,"[5, 15)",134.751773049646
2018-01-12,"[15, 65)",177.304964539007
2018-01-12,"[65, 120)",63.8297872340427
2018-01-19,"[0, 5)",404.255319148936
2018-01-19,"[5, 15)",336.879432624114
2018-01-19,"[15, 65)",265.957446808511
2018-01-19,"[65, 120)",88.6524822695037
2018-01-26,"[0, 5)",475.177304964539
2018-01-26,"[5, 15)",716.312056737589
2018-01-26,"[15, 65)",326.241134751773
2018-01-26,"[65, 120)",67.3758865248228
2018-02-02,"[0, 5)",599.290780141844
2018-02-02,"[5, 15)",656.028368794327
2018-02-02,"[15, 65)",460.992907801419
2018-02-02,"[65, 120)",134.751773049646
2018-02-09,"[0, 5)",797.872340425532
2018-02-09,"[5, 15)",1028.36879432624
2018-02-09,"[15, 65)",581.560283687943
2018-02-09,"[65, 120)",209.219858156028
2018-02-16,"[0, 5)",1230.49645390071
2018-02-16,"[5, 15)",535.460992907801
2018-02-16,"[15, 65)",620.567375886525
2018-02-16,"[65, 120)",223.404255319149
2018-02-23,"[0, 5)",556.737588652482
2018-02-23,"[5, 15)",570.921985815603
2018-02-23,"[15, 65)",748.22695035461
2018-02-23,"[65, 120)",297.872340425532
2018-03-02,"[0, 5)",943.262411347518
2018-03-02,"[5, 15)",819.148936170213
2018-03-02,"[15, 65)",684.397163120568
2018-03-02,"[65, 120)",283.687943262411
2018-03-09,"[0, 5)",982.2695035461
2018-03-09,"[5, 15)",1244.68085106383
2018-03-09,"[15, 65)",769.503546099291
2018-03-09,"[65, 120)",304.964539007092
2018-03-16,"[0, 5)",921.985815602837
2018-03-16,"[5, 15)",329.787234042553
2018-03-16,"[15, 65)",372.340425531915
2018-03-16,"[65, 120)",237.588652482269
2018-03-23,"[0, 5)",613.475177304965
2018-03-23,"[5, 15)",861.702127659575
2018-03-23,"[15, 65)",223.404255319149
2018-03-23,"[65, 120)",195.035460992908
2018-03-30,"[0, 5)",507.092198581561
2018-03-30,"[5, 15)",170.212765957447
2018-03-30,"[15, 65)",78.0141843971633
2018-03-30,"[65, 120)",106.382978723404
2018-04-06,"[0, 5)",106.382978723404
2018-04-06,"[5, 15)",70.921985815603
2018-04-06,"[15, 65)",24.8226950354615
2018-04-06,"[65, 120)",28.3687943262412
2018-04-13,"[0, 5)",99.2907801418442
2018-04-13,"[5, 15)",35.4609929078015
2018-04-13,"[15, 65)",14.184397163121
2018-04-13,"[65, 120)",10.6382978723409
2018-04-20,"[0, 5)",0.0
2018-04-20,"[5, 15)",0.0
2018-04-20,"[15, 65)",10.6382978723409
2018-04-20,"[65, 120)",0.0
2018-04-27,"[0, 5)",0.0
2018-04-27,"[5, 15)",0.0
2018-04-27,"[15, 65)",3.5460992907806
2018-04-27,"[65, 120)",0.0
2018-05-04,"[0, 5)",0.0
2018-05-04,"[5, 15)",0.0
2018-05-04,"[15, 65)",3.5460992907806
2018-05-04,"[65, 120)",3.5460992907806
2018-05-11,"[0, 5)",0.0
2018-05-11,"[5, 15)",0.0
2018-05-11,"[15, 65)",0.0
2018-05-11,"[65, 120)",0.0
Loading

0 comments on commit 2e69d7b

Please sign in to comment.