Simplify influenza data conversion script (#72)

twallema · Jun 18, 2024 · 2e69d7b · 2e69d7b
1 parent 41e7a57
commit 2e69d7b
Show file tree

Hide file tree

Showing 9 changed files with 299 additions and 326 deletions.
diff --git a/docs/influenza_1718.md b/docs/influenza_1718.md
@@ -15,20 +15,20 @@ located in `~/tutorials/influenza_1718/`.
 
 ## Data
 
-Data on the weekly incidence of visits to General Practictioners (GP) are made publically available by the Belgian Scientific Institute of Public Health (Sciensano). These data were retrieved from the "End of season" report on Influenza in Belgium (see `data/raw/Influenza 2017-2018 End of Season_NL.pdf`). Using [Webplotdigitizer](https://automeris.io/WebPlotDigitizer/), the weekly number of GP visits in the different age groups were extracted (see `data/raw/dataset_influenza_1718.csv`). Then, the script `data_conversion.py` was used to convert the *raw* weekly incidence of Influenza cases in Belgium (per 100K inhabitats) during the 2017-2018 Influenza season into a better suited format. The weekly incidence was first converted to the absolute GP visits count by multipying with the demographics. Then, it was assumed that the weekly incidence was the sum of seven equal counts throughout the week preceding the data collection, and hence the weekly data were divided by seven. The formatted data are located in `data/interim/data_influenza_1718_format.csv`.
+Data on the weekly incidence of visits to General Practictioners (GP) for Influenza-like illness (ILI) are made publically available by the Belgian Scientific Institute of Public Health (Sciensano). These data were retrieved from the "End of season" report on Influenza in Belgium (see `data/raw/Influenza 2017-2018 End of Season_NL.pdf`). Using [Webplotdigitizer](https://automeris.io/WebPlotDigitizer/), the weekly number of GP visits in the different age groups were extracted (see `data/raw/ILI_weekly_1718.csv`). Then, the script `data_conversion.py` was used to convert the *raw* weekly incidence of Influenza cases in Belgium (per 100K inhabitats) during the 2017-2018 Influenza season into a better suited format. The week numbers in the raw dataset were replaced with the date of that week's Thursday, as an approximation of the midpoint of the week. Further, the number of GP visits per 100K inhabitants was converted to the absolute number of GP visits. The formatted data are located in `data/interim/ILI_weekly_100K.csv` and `data/interim/ILI_weekly_100K.csv`.
 
 ![data](/_static/figs/influenza_1718/data.png)
 
-The data are loaded in our calibration script `~/tutorials/influenza_1718/calibration.py` as a `pd.DataFrame` with a `pd.Multiindex`. The `time`/`date` axis is obligatory. The other index names and values are the same as the model's dimensions and coordinates. In this way, pySODM recognizes how model prediction and dataset must be aligned.
+The absolute weekly number of GP visits data are loaded in our calibration script `~/tutorials/influenza_1718/calibration.py` as a `pd.DataFrame` with a `pd.Multiindex`. The weekly number of GP visits is divided by seven to approximate the daily incidence at the week's midpoint (which we'll use to calibrate our model). The `time`/`date` axis in the `pd.DataFrame` is obligatory. The other index names and values are the same as the model's dimensions and coordinates. In this way, pySODM recognizes how model prediction and dataset must be aligned.
 
 ```bash
 date        age_group
-2017-11-27  (0, 5]        15.373303
-            (5, 15]       13.462340
-            (15, 65]     409.713333
-            (65, 120]     33.502705
+2017-12-01  (0, 5]        15.727305
+            (5, 15]       13.240385
+            (15, 65]     407.778693
+            (65, 120]     32.379271
                             ...    
-2018-05-07  (0, 5]         0.000000
+2018-05-11  (0, 5]         0.000000
             (5, 15]        0.000000
             (15, 65]       0.000000
             (65, 120]      0.000000

diff --git a/tutorials/influenza_1718/README.md b/tutorials/influenza_1718/README.md
@@ -30,6 +30,6 @@ First, by running the script `data_conversion.py`, the user converts the *raw* w
 
 ### Interim
 
-+ `data_influenza_1718_format.csv`: Daily incidence of Influenza cases in Belgium  during the 2017-2018 Influenza season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.
-
++ `ILI_weekly_100K.csv`:  Weekly incidence of GP visits for Influenza-like illness per 100K inhabitants in Belgium during the 2017-2018 season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Dates are the reported week number's midpoint. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.
 
++ `ILI_weekly_ABS.csv`: Weekly incidence of GP visits for Influenza-like illness in Belgium during the 2017-2018 season. Data available for four age groups: [0,5(, [5,15(, [15,65(, [65,120(. Dates are the reported week number's midpoint. Generated from `dataset_influenza_1718.csv` by executing the data conversion script `data_conversion.py`.
diff --git a/tutorials/influenza_1718/calibration.py b/tutorials/influenza_1718/calibration.py
@@ -32,41 +32,41 @@
 
 tau = 0.50                                      # Timestep of Tau-Leaping algorithm
 alpha = 0.03                                    # Overdispersion factor (based on COVID-19)
-end_calibration = pd.Timestamp('2018-03-01')    # Enddate of calibration
+end_calibration = '2018-03-01'                  # Enddate of calibration
 identifier = 'twallema_2018-03-01'              # Give any output of this script an ID
-n_pso = 3                                       # Number of PSO iterations
-multiplier_pso = 36                             # PSO swarm size
-n_mcmc = 50                                    # Number of MCMC iterations
-multiplier_mcmc = 36                            # Total number of Markov chains = number of parameters * multiplier_mcmc
+n_pso = 30                                      # Number of PSO iterations
+multiplier_pso = 10                             # PSO swarm size
+n_mcmc = 500                                    # Number of MCMC iterations
+multiplier_mcmc = 10                            # Total number of Markov chains = number of parameters * multiplier_mcmc
 print_n = 100                                   # Print diagnostics every print_n iterations
-discard = 1000                                  # Discard first `discard` iterations as burn-in
+discard = 50                                    # Discard first `discard` iterations as burn-in
 thin = 10                                       # Thinning factor emcee chains
 n = 100                                         # Repeated simulations used in visualisations
-processes = int(os.getenv('SLURM_CPUS_ON_NODE', mp.cpu_count()/2)) # Automatically use half the number of available threads (typically corresponds to number of physical CPU cores) 
+processes = int(os.getenv('SLURM_CPUS_ON_NODE', mp.cpu_count())) # Retrieve CPU count
 
 ###############
 ## Load data ##
 ###############
 
 # Load case data
-data = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/data_influenza_1718_format.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
+data = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/ILI_weekly_ABS.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
 data = data.squeeze()
 # Load case data per 100K
-data_100K = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/data_influenza_1718_format_100K.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
+data_100K = pd.read_csv(os.path.join(os.path.dirname(__file__),'data/interim/ILI_weekly_100K.csv'), index_col=[0,1], parse_dates=True, date_format='%Y-%m-%d')
 data_100K = data_100K.squeeze()
 # Re-insert pd.IntervalIndex (pd.IntervalIndex is always loaded as a string..)
 age_groups = pd.IntervalIndex.from_tuples([(0,5),(5,15),(15,65),(65,120)], closed='left')
 iterables = [data.index.get_level_values('DATE').unique(), age_groups]
 names = ['date', 'age_group']
 index = pd.MultiIndex.from_product(iterables, names=names)
-df_influenza = pd.Series(index=index, name='CASES', data=data.values)
-df_influenza_100K = pd.Series(index=index, name='CASES', data=data_100K.values)
+df_influenza = pd.Series(index=index, name='CASES', data=data.values)/7 # convert weekly cumulative to daily week midpoint 
+df_influenza_100K = pd.Series(index=index, name='CASES', data=data_100K.values)/7 # convert weekly cumulative to daily week midpoint 
 # Extract start and enddate
-start_date = df_influenza.index.get_level_values('date').unique()[8]
-end_date = df_influenza.index.get_level_values('date').unique()[-1] 
-start_calibration = start_date
-# Hardcode Belgian demographics
-initN = pd.Series(index=age_groups, data=np.array([606938, 1328733, 7352492, 2204478]))  
+start_visualisation = df_influenza.index.get_level_values('date').unique()[8].strftime("%Y-%m-%d")
+end_visualisation = df_influenza.index.get_level_values('date').unique()[-1].strftime("%Y-%m-%d")
+start_calibration = start_visualisation
+# Hardcode Belgian demographics (Jan 1, 2018)
+initN = pd.Series(index=age_groups, data=[620914, 1306826, 7317774, 2130556])
 
 ################
 ## Load model ##
@@ -160,7 +160,7 @@
     # Assign results to model
     model.parameters = assign_theta(model.parameters, pars, theta)
     # Simulate model
-    out = model.sim([start_calibration, end_date], samples={}, N=n, tau=tau, output_timestep=1)
+    out = model.sim([start_calibration, end_visualisation], samples={}, N=n, tau=tau, output_timestep=1)
     # Add poisson obervational noise
     out = add_negative_binomial_noise(out, alpha)
     # Visualize
@@ -169,7 +169,7 @@
     for id, age_class in enumerate(df_influenza.index.get_level_values('age_group').unique()):
         # Data
         axs[id].scatter(df_influenza[start_calibration:end_calibration].index.get_level_values('date').unique(), df_influenza.loc[slice(start_calibration,end_calibration), age_class], color='black', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='observed')
-        axs[id].scatter(df_influenza[end_calibration:end_date].index.get_level_values('date').unique(), df_influenza.loc[slice(end_calibration,end_date), age_class], color='red', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='unobserved')
+        axs[id].scatter(df_influenza[end_calibration:end_visualisation].index.get_level_values('date').unique(), df_influenza.loc[slice(end_calibration,end_visualisation), age_class], color='red', alpha=0.3, linestyle='None', facecolors='None', s=60, linewidth=2, label='unobserved')
         # Model trajectories
         for i in range(n):
             axs[id].plot(out['date'],out['Im_inc'].sel(age_group=age_class).isel(draws=i), color='blue', alpha=0.05, linewidth=1)
@@ -195,18 +195,12 @@
     # Perturbate previously obtained estimate
     ndim, nwalkers, pos = perturbate_theta(theta, pert=0.30*np.ones(len(theta)), multiplier=multiplier_mcmc, bounds=expanded_bounds)
     # Write some usefull settings to a pickle file (no pd.Timestamps or np.arrays allowed!)
-    settings={'start_calibration': start_calibration.strftime("%Y-%m-%d"), 'end_calibration': end_calibration.strftime("%Y-%m-%d"),
+    settings={'start_calibration': start_calibration, 'end_calibration': end_calibration,
               'n_chains': nwalkers, 'starting_estimate': list(theta), 'labels': expanded_labels, 'tau': tau}
     # Sample n_mcmc iterations
     sampler = run_EnsembleSampler(pos, n_mcmc, identifier, objective_function,  objective_function_kwargs={'simulation_kwargs': {'warmup': 0, 'tau':tau}},
                                     fig_path=fig_path, samples_path=samples_path, print_n=print_n, backend=None, processes=processes, progress=True,
-                                    settings_dict=settings)                           
-    # Sample 5*n_mcmc more
-    for i in range(3):
-        backend = emcee.backends.HDFBackend(os.path.join(os.getcwd(),samples_path+identifier+'_BACKEND_'+run_date+'.hdf5'))
-        sampler = run_EnsembleSampler(pos, n_mcmc, identifier, objective_function, objective_function_kwargs={'simulation_kwargs': {'warmup': 0, 'tau':tau}},
-                                    fig_path=fig_path, samples_path=samples_path, print_n=print_n, backend=backend, processes=processes, progress=True,
-                                    settings_dict=settings)                                                         
+                                    settings_dict=settings)                                                                               
     # Generate a sample dictionary and save it as .json for long-term storage
     # Have a look at the script `emcee_sampler_to_dictionary.py`, which does the same thing as the function below but can be used while your MCMC is running.
     samples_dict = emcee_sampler_to_dictionary(samples_path, identifier, discard=discard, thin=thin)
@@ -229,7 +223,7 @@ def draw_fcn(param_dict, samples_dict):
         param_dict['f_ud'] = np.array([slice[idx] for slice in samples_dict['f_ud']])
         return param_dict
     # Simulate model
-    out = model.sim([start_date, end_date], N=n, tau=tau, output_timestep=1, samples=samples_dict, draw_function=draw_fcn, processes=processes)
+    out = model.sim([start_visualisation, end_visualisation], N=n, tau=tau, output_timestep=1, samples=samples_dict, draw_function=draw_fcn, processes=processes)
     # Add negative binomial observation noise
     out_noise = add_negative_binomial_noise(out, alpha)
 
@@ -243,7 +237,7 @@ def draw_fcn(param_dict, samples_dict):
     for id, age_class in enumerate(df_influenza_100K.index.get_level_values('age_group').unique()):
         # Data
         axs[id].plot(df_influenza_100K[start_calibration:end_calibration].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(start_calibration,end_calibration), age_class]*7, color='black', marker='o', label='Observed data')
-        axs[id].plot(df_influenza_100K[end_calibration-pd.Timedelta(days=7):end_date].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(end_calibration-pd.Timedelta(days=7),end_date), age_class]*7, color='red', marker='o', label='Unobserved data')
+        axs[id].plot(df_influenza_100K[pd.Timestamp(end_calibration)-pd.Timedelta(days=7):end_visualisation].index.get_level_values('date').unique(), df_influenza_100K.loc[slice(pd.Timestamp(end_calibration)-pd.Timedelta(days=7),end_visualisation), age_class]*7, color='red', marker='o', label='Unobserved data')
         # Model trajectories
         axs[id].plot(out['date'],out['Im_inc'].sel(age_group=age_class).mean(dim='draws')/initN.loc[age_class]*100000*7, color='black', linestyle='--', alpha=0.7, linewidth=1, label='Model mean')
         axs[id].fill_between(out_noise['date'].values,out_noise['Im_inc'].sel(age_group=age_class).quantile(dim='draws', q=0.025)/initN.loc[age_class]*100000*7, out_noise['Im_inc'].sel(age_group=age_class).quantile(dim='draws', q=0.975)/initN.loc[age_class]*100000*7, color='black', alpha=0.15, label='Model 95% CI')

diff --git a/tutorials/influenza_1718/data/interim/ILI_weekly_100K.csv b/tutorials/influenza_1718/data/interim/ILI_weekly_100K.csv
@@ -0,0 +1,129 @@
+DATE,AGE,CASES_100K
+2017-10-06,"[0, 5)",3.53073360514691
+2017-10-06,"[5, 15)",6.31047999020029
+2017-10-06,"[15, 65)",14.5889596615352
+2017-10-06,"[65, 120)",11.9210107267218
+2017-10-13,"[0, 5)",16.9737694017181
+2017-10-13,"[5, 15)",8.69528973038268
+2017-10-13,"[15, 65)",28.0775650709766
+2017-10-13,"[65, 120)",11.5929095140596
+2017-10-20,"[0, 5)",13.9357952104024
+2017-10-20,"[5, 15)",5.64212566811102
+2017-10-20,"[15, 65)",16.6851618535429
+2017-10-20,"[65, 120)",3.32475895497555
+2017-10-27,"[0, 5)",35.7180701621351
+2017-10-27,"[5, 15)",2.54339199296919
+2017-10-27,"[15, 65)",21.9104774626053
+2017-10-27,"[65, 120)",18.8986298493355
+2017-11-03,"[0, 5)",4.56246232077729
+2017-11-03,"[5, 15)",19.2275807263673
+2017-11-03,"[15, 65)",13.894810397062
+2017-11-03,"[65, 120)",3.22926973845097
+2017-11-10,"[0, 5)",17.8943881440409
+2017-11-10,"[5, 15)",16.5611955617146
+2017-11-10,"[15, 65)",15.2280029793883
+2017-11-10,"[65, 120)",7.22884748542992
+2017-11-17,"[0, 5)",35.225891714284
+2017-11-17,"[5, 15)",17.8943881440409
+2017-11-17,"[15, 65)",21.8939658910203
+2017-11-17,"[65, 120)",16.5611955617146
+2017-11-24,"[0, 5)",21.8939658910203
+2017-11-24,"[5, 15)",0.562884573797874
+2017-11-24,"[15, 65)",29.8931213849783
+2017-11-24,"[65, 120)",29.8931213849783
+2017-12-01,"[0, 5)",17.7304964539012
+2017-12-01,"[5, 15)",7.09219858156075
+2017-12-01,"[15, 65)",39.0070921985816
+2017-12-01,"[65, 120)",10.6382978723409
+2017-12-08,"[0, 5)",92.1985815602839
+2017-12-08,"[5, 15)",63.8297872340427
+2017-12-08,"[15, 65)",42.5531914893618
+2017-12-08,"[65, 120)",10.6382978723409
+2017-12-15,"[0, 5)",17.7304964539012
+2017-12-15,"[5, 15)",81.5602836879434
+2017-12-15,"[15, 65)",46.0992907801419
+2017-12-15,"[65, 120)",10.6382978723409
+2017-12-22,"[0, 5)",0.0
+2017-12-22,"[5, 15)",21.2765957446813
+2017-12-22,"[15, 65)",67.3758865248228
+2017-12-22,"[65, 120)",60.2836879432625
+2017-12-29,"[0, 5)",138.297872340426
+2017-12-29,"[5, 15)",3.5460992907806
+2017-12-29,"[15, 65)",78.0141843971633
+2017-12-29,"[65, 120)",21.2765957446813
+2018-01-05,"[0, 5)",297.872340425532
+2018-01-05,"[5, 15)",81.1914893617022
+2018-01-05,"[15, 65)",209.219858156029
+2018-01-05,"[65, 120)",85.10638297872359
+2018-01-12,"[0, 5)",329.787234042553
+2018-01-12,"[5, 15)",134.751773049646
+2018-01-12,"[15, 65)",177.304964539007
+2018-01-12,"[65, 120)",63.8297872340427
+2018-01-19,"[0, 5)",404.255319148936
+2018-01-19,"[5, 15)",336.879432624114
+2018-01-19,"[15, 65)",265.957446808511
+2018-01-19,"[65, 120)",88.6524822695037
+2018-01-26,"[0, 5)",475.177304964539
+2018-01-26,"[5, 15)",716.312056737589
+2018-01-26,"[15, 65)",326.241134751773
+2018-01-26,"[65, 120)",67.3758865248228
+2018-02-02,"[0, 5)",599.290780141844
+2018-02-02,"[5, 15)",656.028368794327
+2018-02-02,"[15, 65)",460.992907801419
+2018-02-02,"[65, 120)",134.751773049646
+2018-02-09,"[0, 5)",797.872340425532
+2018-02-09,"[5, 15)",1028.36879432624
+2018-02-09,"[15, 65)",581.560283687943
+2018-02-09,"[65, 120)",209.219858156028
+2018-02-16,"[0, 5)",1230.49645390071
+2018-02-16,"[5, 15)",535.460992907801
+2018-02-16,"[15, 65)",620.567375886525
+2018-02-16,"[65, 120)",223.404255319149
+2018-02-23,"[0, 5)",556.737588652482
+2018-02-23,"[5, 15)",570.921985815603
+2018-02-23,"[15, 65)",748.22695035461
+2018-02-23,"[65, 120)",297.872340425532
+2018-03-02,"[0, 5)",943.262411347518
+2018-03-02,"[5, 15)",819.148936170213
+2018-03-02,"[15, 65)",684.397163120568
+2018-03-02,"[65, 120)",283.687943262411
+2018-03-09,"[0, 5)",982.2695035461
+2018-03-09,"[5, 15)",1244.68085106383
+2018-03-09,"[15, 65)",769.503546099291
+2018-03-09,"[65, 120)",304.964539007092
+2018-03-16,"[0, 5)",921.985815602837
+2018-03-16,"[5, 15)",329.787234042553
+2018-03-16,"[15, 65)",372.340425531915
+2018-03-16,"[65, 120)",237.588652482269
+2018-03-23,"[0, 5)",613.475177304965
+2018-03-23,"[5, 15)",861.702127659575
+2018-03-23,"[15, 65)",223.404255319149
+2018-03-23,"[65, 120)",195.035460992908
+2018-03-30,"[0, 5)",507.092198581561
+2018-03-30,"[5, 15)",170.212765957447
+2018-03-30,"[15, 65)",78.0141843971633
+2018-03-30,"[65, 120)",106.382978723404
+2018-04-06,"[0, 5)",106.382978723404
+2018-04-06,"[5, 15)",70.921985815603
+2018-04-06,"[15, 65)",24.8226950354615
+2018-04-06,"[65, 120)",28.3687943262412
+2018-04-13,"[0, 5)",99.2907801418442
+2018-04-13,"[5, 15)",35.4609929078015
+2018-04-13,"[15, 65)",14.184397163121
+2018-04-13,"[65, 120)",10.6382978723409
+2018-04-20,"[0, 5)",0.0
+2018-04-20,"[5, 15)",0.0
+2018-04-20,"[15, 65)",10.6382978723409
+2018-04-20,"[65, 120)",0.0
+2018-04-27,"[0, 5)",0.0
+2018-04-27,"[5, 15)",0.0
+2018-04-27,"[15, 65)",3.5460992907806
+2018-04-27,"[65, 120)",0.0
+2018-05-04,"[0, 5)",0.0
+2018-05-04,"[5, 15)",0.0
+2018-05-04,"[15, 65)",3.5460992907806
+2018-05-04,"[65, 120)",3.5460992907806
+2018-05-11,"[0, 5)",0.0
+2018-05-11,"[5, 15)",0.0
+2018-05-11,"[15, 65)",0.0
+2018-05-11,"[65, 120)",0.0