-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data & Code] Predicting poverty and wealth from mobile phone metadata #11
Labels
Comments
# Script inputs:
# 1) Training file (implict from config)
# 2) Which objectives we wish to train (must have mapping in config file)
# 3) Any options : e.g tree based, specific features
# 4) It will generate both a model as well as a score file telling you how the model does
# import graphlab as gl
# import graphlab.aggregate as agg
import csv
#import ConfigParser
import sys
import datetime
import time
import math
import datetime
import numpy as np
import os.path
import code
from scipy.stats import mode
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import linear_model
from sklearn.model_selections import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import PredefinedSplit
from sklearn import metrics
from sklearn.svm import SVR
from sklearn import svm
from sklearn import tree
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn import preprocessing
from sklearn.feature_selection import RFE
sys.path.append('../libs/')
import ConfigLoader
import TrainingLib
import FeatureNamerLib
#config section
configFile = sys.argv[1]
cf = ConfigLoader.setupGraphLabEnv(configFile)
train_dir = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","train_rel_home") +"/"
train_file = train_dir + cf.get("Filepath","train_file_trainingset")
features_dir = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","features_rel_home") +"/"
#features_dir = features_dir+"/CDRV"+str(cf.getint("RawCDR","version"))+"_FeaturesV"+str(cf.getint("Features","version"))+"/"
features_file = features_dir +cf.get("Filepath","features_file_final")
districtWeights_file = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","aux_rel_home") +"/" + cf.get("Filepath","aux_file_districtWeights")
trainFeatures_excludeList = cf.get("Train","train_excludeprefix_list").split(",")
trainFeatures_exclusiveList = cf.get("Train","train_exclusiveFeatures_list").split(",")
if len(trainFeatures_exclusiveList) == 1 and trainFeatures_exclusiveList[0] =='':
trainFeatures_exclusiveList =[]
if len(trainFeatures_excludeList) == 1 and trainFeatures_excludeList[0] =='':
trainFeatures_excludeList =[]
train_runId = cf.get("Train","train_experimentId")
trainObj_regList = cf.get("Train","train_objreg_list").split(",")
trainObj_boolList = cf.get("Train","train_objbool_list").split(",")
train_option_windsorFeats = cf.getint("Train","train_featswindsor_int")
train_option_doLog = cf.getboolean("Train","train_logFeats_bool")
train_option_doQuad = cf.getboolean("Train","train_quadFeats_bool")
train_idCol = "Id"
run_anon = True #This is for the case where we don't have the ids (for privacy reaons)
if run_anon:
train_idCol = trainObj_regList[0]
#now read in modes to run in
analysis_singler2= "singler2" in sys.argv
analysis_sampleSize = "samplesize" in sys.argv
analysis_roc = "roc" in sys.argv
do_predict = "predict" in sys.argv
train_min_file = train_dir +"trainingSet_"+train_runId+"_mins.csv"
#train_preproc_file =
normalPrefixes =["CDR","SMS","Int","Fea","fea"] #The last two are in case we want to run with anonymized feature names
# PART 1: Loading in the features correctly
#1.A: in roder to use logarthmic transforms of features we need to compute the minimum of all features
if not os.path.exists(train_min_file):
print "***Calculating new minimum values ***\n"
TrainingLib.writeMins(features_file,train_min_file)
wg_minMap = TrainingLib.readMins(train_min_file)
#1.B: now load in the training set
#Loading in the file and getting all the features cleaned up, preprocessed etc.
rawFeats, trainHeaders, trainIndices, trainIds =TrainingLib.processFile(train_file, normalPrefixes, trainFeatures_excludeList, train_idCol,
fileDelimiter= ",", exclusiveList = trainFeatures_exclusiveList)
wg_mins = [wg_minMap[header] for header in trainHeaders]
distCol = TrainingLib.readDistrictAssign(train_file, "district")
dWeightMap = TrainingLib.makeDistWeightMap(districtWeights_file)
reWeigh_rowCols, reWeight_crossVal = TrainingLib.sampleWithRepl(distCol, dWeightMap)
rawFeats = rawFeats[reWeigh_rowCols]
#Now we apply the config defind cleaning/transforms etc
allFeats, unMeans, unStds, unWinds,unLogWinds,unQuadWinds = TrainingLib.featureCleanAndTrans(rawFeats, wg_mins,
train_option_doLog, train_option_doQuad,train_option_windsorFeats)
allHeaders = trainHeaders
if train_option_doLog:
allHeaders = allHeaders + ["Log " + s for s in trainHeaders]
if train_option_doQuad:
allHeaders = allHeaders + ["Quad " + s for s in trainHeaders]
#1.c: defining some basic training search params, algs etc.
crossVal_method = reWeight_crossVal
reg_method = [linear_model.ElasticNetCV, linear_model.ElasticNet,"ElasticNet"]
reg_alpha = [1000.0,100.0,10.0,1.0,0.1,0.01, 0.001, 0.0001, 0.00001, 0.000001]
reg_alphaLevel2 = True
reg_metric = "r2"
class_method = ["l1"] #Note that we always use log regression, so you jsut need to deifn the reg mehod
class_alpha = [0.001,0.01,0.1,1,10,100,1000]
class_alphaLevel2 = False
class_metric = "roc_auc"
#Part 2: Nowe move onto training. Part A consists in picking the objective, loading and cleaning it depending on whether its normal regression or logisitc regressiom
for objName in (trainObj_regList+trainObj_boolList):
dummyFeats, dummyHeaders, dummyIndices, dummyIds, objs =TrainingLib.processFile(train_file, [], trainFeatures_excludeList, train_idCol,objectiveName=objName)
currObj_reg = objName in trainObj_regList
#preparing the different training methods if normal regression or logistic regression
if currObj_reg:
objs = TrainingLib.cleanData(objs, train_option_windsorFeats)
currAlphas = reg_alpha
currMethodCV = reg_method[0](alphas = currAlphas, cv = crossVal_method)
currMethod = reg_method[1](max_iter=2000)
curr_alphaLevel2 = reg_alphaLevel2
currMetric = reg_metric
else:
objs = TrainingLib.cleanData(objs, 0)
objs = TrainingLib.processBoolObj(objName, objs)
currAlphas=class_alpha
currMethod = LogisticRegression(penalty=class_method[0], tol=0.01, class_weight="auto")
currMethodCV = LogisticRegressionCV(Cs = class_alpha, penalty=class_method[0], tol=0.01, class_weight="auto",cv = crossVal_method,solver='liblinear', max_iter=1000, scoring = class_metric)
curr_alphaLevel2=class_alphaLevel2
currMetric = class_metric
objs = objs[reWeigh_rowCols]
#Analysis operations (as defined in command line)
#investigating performance on smaller subsets of train data
if analysis_sampleSize and currObj_reg:
TrainingLib.doSubset(allFeats, objs, crossVal_method, train_dir+"sampleSize.csv")
continue
#generate R2 for single features to see which ones are best
if analysis_singler2:
if os.path.isfile(train_dir +objName+"_feature_r2.csv"):
continue
featR2= {}
reps =10
for i in range(0, len(trainHeaders)):
runSum = 0
for rep in range(0, reps):
newFeats = np.transpose(np.array([allFeats[:,i]]))
if currObj_reg:
score = np.mean(cross_validation.cross_val_score(linear_model.LinearRegression(), newFeats, objs,cv=crossVal_method, scoring="r2"))
else:
score = np.mean(cross_validation.cross_val_score(linear_model.LogisticRegression(), newFeats, objs,cv=crossVal_method, scoring="roc_auc"))
runSum += score
#print trainHeaders[i] +":" + `runSum/reps`
featR2[trainHeaders[i]] = runSum/reps
with open(train_dir+objName+"_feature_r2.csv", 'wb') as f:
writer = csv.writer(f)
for trainHeader in trainHeaders:
writer.writerow([trainHeader,`featR2[trainHeader]`])
continue
#2.B: Now the training params and data is set up we find the optimal reg params and get the train/test scores
(finalModel, finalModel_trainScore,finalModel_testScore) = TrainingLib.runTraining(allFeats, objs, currMethod,
currMethodCV, crossVal_method,
curr_alphaLevel2, currObj_reg,
currMetric)
if currObj_reg:
finalParam = finalModel.alpha
else:
finalParam = finalModel.C
print 'Objective: ' + objName
print 'Test score: ' + `finalModel_testScore`
print 'Train score: ' + `finalModel_trainScore`
print 'Regularization parameter: ' + `finalParam`
print 'Num Feats: ' + `np.count_nonzero(finalModel.coef_)`
#Analysis to generate roc curves for boolean predictions
#THis analysis has to be done AFTER the model is trained
if analysis_roc and not currObj_reg:
trainFeats = allFeats[crossVal_method.test_fold != 0]
trainObjs = objs[crossVal_method.test_fold != 0]
testFeats = allFeats[crossVal_method.test_fold == 0]
testObjs = objs[crossVal_method.test_fold == 0]
train_bestModel_roc = finalModel
train_bestModel_roc.fit(trainFeats,trainObjs)
logReg_probs = train_bestModel_roc.predict_proba(testFeats)
fpr, tpr, thresholds = metrics.roc_curve(testObjs,logReg_probs[:,1])
output = np.transpose(np.append(np.array([fpr]),np.array([tpr]),0))
np.savetxt(train_dir+objName+"_roc.csv", output,delimiter=",", fmt ='%3f')
if not do_predict:
continue
#Part 2C:Now we store the important model infromation that will enable us to repredict stuff
#code.interact(local=locals())
finalModel_colIndices = np.nonzero(finalModel.coef_)[0]
finalModel_bestHeaders = [allHeaders[i] for i in finalModel_colIndices]
finalModel_bestHeaders_repl = [word.replace("Log ","") for word in finalModel_bestHeaders]
finalModel_bestHeaders_repl= [word.replace("Quad ","") for word in finalModel_bestHeaders_repl]
#Part 3: Finally we load back in the all the features and make the prediction
wg_rawFeats, wg_headers, wg_indices, wg_ids = TrainingLib.processFile(features_file, normalPrefixes, trainFeatures_excludeList, "numId",
includeList = finalModel_bestHeaders_repl)
#Handle the transforms and cleans
wg_logIndices = [i for i in range(0,len(finalModel_bestHeaders )) if finalModel_bestHeaders[i].startswith("Log ")]
wg_quadIndices =[i for i in range(0,len(finalModel_bestHeaders )) if finalModel_bestHeaders[i].startswith("Quad ")]
wg_normIndices = [i for i in range(0,len(finalModel_bestHeaders )) if i not in wg_quadIndices
and i not in wg_logIndices]
wg_logHeaders = np.array(wg_headers)[wg_logIndices]
wg_logMins = [wg_minMap[s] for s in wg_logHeaders]
TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_normIndices, 0, func_transform=None)
TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_logIndices, 0, func_transform=TrainingLib.logTransform, wgMins = list(wg_logMins))
TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_quadIndices, 0, func_transform=TrainingLib.quadTransform)
unAllWinds = np.concatenate([unWinds, unLogWinds], axis = 1)
unAllWinds = np.concatenate([unAllWinds, unQuadWinds], axis= 1)
TrainingLib.windsor_apply(wg_rawFeats, unAllWinds[:,finalModel_colIndices])
wg_rawFeats = TrainingLib.removeLowVar(wg_rawFeats,t=1e-9)
unMeans_curr = np.array([unMeans])[:,finalModel_colIndices]
unStds_curr = np.array([unStds])[:,finalModel_colIndices]
for i in range(0, unStds_curr.shape[0]):
if unStds_curr[0][i] == 0:
unStds_curr[0][i] = 1
wg_allFeats = (wg_rawFeats-unMeans_curr)/unStds_curr
finalModel.coef_ = finalModel.coef_[finalModel_colIndices]
if currObj_reg:
wg_prediction = finalModel.predict(wg_allFeats)
wg_out = np.transpose(np.array([wg_ids,wg_prediction]))
else:
wg_predictionprob =finalModel.predict_proba(wg_allFeats)[:,1]
wg_out = np.transpose(np.array([wg_ids,wg_predictionprob]))
TrainingLib.outputPredictions(wg_out, train_dir+train_runId+"_"+objName+"_predictions.csv", stdevs = (currObj_reg))
if analysis_singler2:
FeatureNamerLib.joinR2Preds(trainObj_regList+trainObj_boolList, train_dir)
|
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
matplotlib.style.use('ggplot')
df = pd.read_csv('./modeling/data/train/PhoneSurveyFeatsHash.csv')
df.head()
X = df.drop(['Unnamed: 0',
'district',
'own_electric',
'many_radio',
'many_tv',
'own_fridge',
'many_bike',
'many_moto',
'type_toilet',
'hh_adults',
'hh_children',
'wealthActual',
'wealthPredicted'], axis = 1)
y = df['wealthActual']
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
train_size=0.8, test_size = 0.2)
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Predicting poverty and wealth from mobile phone metadata
Joshua Blumenstock, Gabriel Cadamuro, Robert On
Mobile phone data were supplied by an anonymous service provider in Rwanda and are not available for distribution.
All other data and code, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2). https://www.openicpsr.org/openicpsr/project/100144/version/V5/view
INTRODUCTION
The code and data required to replicate the findings in the paper are organized into three top-level directories:
/modeling: Used to construct features from call records, and to perform supervised learning on the phone survey sample
/validation: Used to aggregate and spatially join phone and DHS data
/analysis: Used to construct most figures and tables
SYSTEM REQUIREMENTS
* Python running sklearn 0.16.1 and graphlab 1.6.1
* JVM
* Spark
* SBT
* Edit paths in run_local.sh
* Additional details on individual files, including dependencies, is provided in this readme for convenience:
CODE OVERVIEW
Note: ICPSR does not allow certain file extensions, these are automatically replaced by .txt and should be changed back
/analysis
/analysis/fig1.R
/modeling/data/train/ROC.csv
/analysis/fig3.R
/validation/data/out/clusters2010.csv
/analysis/figS2.R
/analysis/figS3.R
/analysis/figS8.R
/validation/data/out/districts2007.csv
/analysis/rwmap/District_Boundary_2012/
/analysis/figS4/code/figS4.R
/analysis/figS4/data/in/trainFeats.csv
/analysis/figS4/data/in/trainObj.csv
/modeling
Note: all python files in /modeling/code/scripts require a link to a config file as the first command line argument
/modeling/code/scripts/PhoneSurveyPreProc.py
survey population weights
/modeling/data/preprocessing/survey_weights.csv
/modeling/data/preprocessing/survey_means.csv
/modeling/data/preprocessing/survey_stds.csv
PCA vectors, used to construct projections later
/modeling/code/scripts/FeatureGenerator.py
Performs feature engineering on original CDR files using graphlab. DFA structure specified in config file
/modeling/data/cdr/sms_cdr.csv
/modeling/data/cdr/intl_cdr.csv
Original call detail records (redacted)
/modeling/data/cdr/allIds.csv
List of all ids appearing in any CDR (redacted)
/modeling/code/libs/AggregateFeatureLib.py
/modeling/code/libs/FeatureLib.py
/modeling/code/libs/LocalNetworkLib.py
/modeling/data/train/PhoneSurveyFeats.csv
contains survey responses and CDR features for all survey respondents, including actual and predicted wealth composite (redacted).
/modeling/data/train/PhoneSurveyFeatsHash.csv
same as above, but includes original data and masks the call metrics for each subscriber.
Can be used for replication of e.g. Fig 1.
/modeling/code/scripts/TrainModel.py
Supervised learning models fit using cross-validation on training set, and out-of-sample predictions generated. Utilizes CLI’s to generate additional analysis files used to construct figures (see next to each output file for the command line required)
/modeling/data/train/PhoneSurveyFeatsHash.csv
/modeling/data/features/finalFeatures.csv
/modeling/data/aux/districtWeights.csv
/modeling/code/libs/TrainLib.py
/modeling/code/libs/FeatureNamerLib.py
/modeling/data/train/allPredictions.csv
Predicted values of wealth and assets for all mobile subscribers
/modeling/data/train/_roc.csv [with arg = “roc”]
Data used to generate ROC curves, based on predictions
/modeling/data/train/feature_r2.csv [with arg = “singler2”]
Bivariate R2 values from a regression of wealth index on each individual CDR feature. Reports AUC values for binary assets
/modeling/data/train/sampleSize.csv [with arg = “samplesize”]
Model performance as a function of sample size
/validation/
/validation/code/run_joint.sh
Joins CDR predictions with DHS data to validate predictions. Script first reads in asset and wealth predictions from CDR data and DHS data. Then reads location data from CDR towers and DHS cluster locations. Performs spatial join of wealth imputed from CDR locations to wealth recorded in DHS locations. Aggregates to district and cluster level
data/in/DhsRwandaClusters10.csv - 2010 DHS Cluster Data
data/in/DhsRwandaHH0708.tsv - 2007 DHS Household Survey Data
data/in/DhsRwandaHH10.tsv - 2010 DHS Household Survey Data
Note that Dhs* data is available at dhsprogram.com, these files only contain the header
data/in/Towers.csv - Cell tower lat/long coordinates (redacted)
data/in/*_HourlyCOG.csv - Approximate location of each subscriber (redacted)
modeling/data/train/allPredictions.csv - see above
/validation/data/out/districts2010.csv
/validation/data/out/clusters2010.csv
/validation/data/out/districts2007.csv
/validation/data/out/clusters2007.csv
The text was updated successfully, but these errors were encountered: