-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtitanic-survival.html
424 lines (376 loc) · 48.4 KB
/
titanic-survival.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
<!DOCTYPE html>
<html>
<head>
<title>Titanic survival prediction - datawerk</title>
<meta charset="utf-8" />
<link href="https://buhrmann.github.io/theme/css/bootstrap-custom.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/pygments.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/style.css" rel="stylesheet" />
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<link rel="shortcut icon" type="image/png" href="https://buhrmann.github.io/theme/css/logo.png">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta name="author" contents="Thomas Buhrmann"/>
<meta name="keywords" contents="datawerk, R,kaggle,titanic,report,classification,"/>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-56071357-1', 'auto');
ga('send', 'pageview');
</script> </head>
<body>
<div class="wrap">
<div class="container-fluid">
<div class="header">
<div class="container">
<nav class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="https://buhrmann.github.io">
<!-- <span class="fa fa-pie-chart navbar-logo"></span> datawerk -->
<span class="navbar-logo"><img src="https://buhrmann.github.io/theme/css/logo.png" style=""></img></span>
</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<!--<li><a href="https://buhrmann.github.io/archives.html">Archives</a></li>-->
<li><a href="https://buhrmann.github.io/posts.html">Blog</a></li>
<li><a href="https://buhrmann.github.io/pages/cv.html">Interactive CV</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Reports<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/p2p-loans.html">Interest rates on <span class="caps">P2P</span> loans</a>
</li>
<li >
<a href="https://buhrmann.github.io/activity-data.html">Categorisation of inertial activity data</a>
</li>
<li >
<a href="https://buhrmann.github.io/titanic-survival.html">Titanic survival prediction</a>
</li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Apps<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/elegans.html">C. elegans connectome explorer</a>
</li>
<li >
<a href="https://buhrmann.github.io/dash+.html">Dash+ visualization of running data</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
</div>
</div><!-- header -->
</div><!-- container-fluid -->
<div class="container main-content">
<div class="row row-centered">
<div class="col-centered col-max col-min col-sm-12 col-md-10 col-lg-10 main-content">
<section id="content" class="article content">
<header>
<span class="entry-title-info">Oct 23 · <a href="https://buhrmann.github.io/category/reports.html">Reports</a></span>
<h2 class="entry-title entry-title-tight">Titanic survival prediction</h2>
</header>
<div class="entry-content">
<p>In this report I will provide an overview of my solution to <a href="http://www.kaggle.com">kaggle’s</a> <a href="https://www.kaggle.com/c/titanic-gettingStarted">“Titanic” competition</a>. The aim of this competition is to predict the survival of passengers aboard the titanic using information such as a passenger’s gender, age or socio-economic status. I will explain my data munging process, explore the available predictor variables, and compare a number of different classification algorithms in terms of their prediction performance. All analysis presented here was performed in R. The corresponding source code is available on <a href="https://github.com/synergenz/kaggle/tree/master/titanic">github</a>.</p>
<figure>
<img src="/images/titanic/titanic.jpg" alt="Titanic"/>
</figure>
<h3>Data munging</h3>
<p>The <a href="https://www.kaggle.com/c/titanic-gettingStarted/data">data set</a> provided by kaggle contains 1309 records of passengers aboard the titanic at the time it sunk. Each record contains 11 variables describing the corresponding person: survival (yes/no), class (1 = Upper, 2 = Middle, 3 = Lower), name, gender and age; the number of siblings and spouses aboard, the number of parents and children aboard, the ticket number, the fare paid, a cabin number, and the port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton). Of the 1309 records 1068 include the label, thus constituting the training set, while a different subset of size 418 does not include the label and is used by kaggle for assessing the accuracy of the predictions submitted.</p>
<p>To facilitate the training of classifiers for the prediction of survival, and for purposes of presentation, the data was preprocessed in the following way. All categorical variables were treated as factors (ordered where appropriate, e.g. in the case of class). From each passenger’s name her title was extracted and added as a new predictor variable. </p>
<div class="highlight"><pre><span></span>data<span class="o">$</span>title <span class="o">=</span> <span class="kp">sapply</span><span class="p">(</span>data<span class="o">$</span>name<span class="p">,</span> FUN<span class="o">=</span><span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span> <span class="kp">strsplit</span><span class="p">(</span>x<span class="p">,</span> split<span class="o">=</span><span class="s">'[,.]'</span><span class="p">)[[</span><span class="m">1</span><span class="p">]][</span><span class="m">2</span><span class="p">]})</span>
data<span class="o">$</span>title <span class="o">=</span> <span class="kp">sub</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> data<span class="o">$</span>title<span class="p">)</span>
</pre></div>
<p>This resulted in a factor with a great number of different levels, many of which could be considered similar in terms of implied societal status. To simplify matters the following levels were combined: ‘Mme’, ‘Mlle’, ‘Ms’ were re-assigned to the level ‘Miss’; ‘Capt’, ‘Col’, ‘Don’, ‘Major’, ‘Sir’ and ‘Dr’ as titles of male nobility to the level ‘Sir’; and ‘Dona’, ‘Lady’, ‘the Countess’ and ‘Jonkheer’ as titles of female nobility to the level ‘Lady’.</p>
<p>The number of all family members aboard was combined into a single family size variable. In addition, a categorical variable was formed from this data by assigning records to three approx. equally sized levels of ‘singles’, ‘small’ and ‘big’ families. Also, another factor was added aimed at <em>uniquely</em> identifying big families. To this send each passenger’s surname was combined with the corresponding family size (resulting e.g. in the factor level “11Sage”), but such that families smaller than a certain number (n=4) were all assigned the level “small”.</p>
<p>Age information was missing for many records (about 20%). Since age can be hypothesised to correlate well with such information as a person’s title (e.g. “Master” was used to refer politely to young children), this data was imputed using a random forest (essentially a bagged decision tree) trained to predict age from the remaining variables:</p>
<div class="highlight"><pre><span></span>agefit <span class="o">=</span> rpart<span class="p">(</span>age <span class="o">~</span> pclass <span class="o">+</span> sex <span class="o">+</span> sibsp <span class="o">+</span> parch <span class="o">+</span> fare <span class="o">+</span> embarked <span class="o">+</span> title <span class="o">+</span> familysize<span class="p">,</span> data<span class="o">=</span>data<span class="p">[</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>data<span class="o">$</span>age<span class="p">),],</span> method<span class="o">=</span><span class="s">"anova"</span><span class="p">)</span>
data<span class="o">$</span>age<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>data<span class="o">$</span>age<span class="p">)]</span> <span class="o">=</span> predict<span class="p">(</span>agefit<span class="p">,</span> data<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>data<span class="o">$</span>age<span class="p">),</span> <span class="p">])</span>
</pre></div>
<p>From the imputed age variable a factor was constructed indicating whether or not a passenger is a “child” (age < 16).</p>
<p>The fare variable contained 18 missing values (17 fares with a value of 0 and one <span class="caps">NA</span>), which were imputed using a decision tree analogous to the above method for the age variable. Since this variable was far from normally distributed (which might violate some algorithm’s assumptions), another factor was created splitting the fare into 3 approx. equally distributed levels.</p>
<p>Cabin and tickets information was sparse, i.e. missing for most passengers, and not considered for further analysis or as predictors for classification. The embarkation variable contained a single missing value, for which was substituted the majority value (Southampton).</p>
<p>All of the above transformations were performed on the joined train and test data, which was thereafter split again into the original two sets.</p>
<p>In summary, the processed data set contains the following features. 5 unordered factors: gender, port of embarkation, title, child and family id. 3 ordered factors: class, family size category, fare category. And three numerical predictors: age, fare price and family size (of which only age is approx. normal distributed).</p>
<h3>Data exploration</h3>
<p>Some <a href="http://en.wikipedia.org/wiki/RMS_Titanic">background information</a> about the titanic disaster might prove useful to formulate hypotheses about the type of people more probable to have survived, i.e. those more likely to have had access to lifeboats. The ship only carried enough lifeboats for slightly more than half the number of people on board (and many were launched half-full). In this respect, the most significant aspect of the rescue effort was the “women and children first” policy followed in the majority of life boat loadings. Additionally, those on the upper decks (i.e. those in the upper classes) had easier access to lifeboats, not the least because of closer physical proximity than the lower decks. It should thus not come as a surprise that survival was heavily skewed towards women, children and in general those of the upper class.</p>
<p>As a first step let’s look at survival rates as a function of each factor variable in the training set, shown in Figure 1.
<figure >
<img src="/images/titanic/facBars.png" alt="Survival vs Factors" />
<img src="/images/titanic/isChildBars.png" alt="Survival vs Child"/>
<img src="/images/titanic/titleBars.png" alt="Survival vs Title"/>
<figcaption class="capCenter">Figure 1: Proportion of survivors as a function of several categorical predictors. Blue:survived, red: perished. For the title variable, proportions are relative to each level. For the remaining variables overall proportions are displayed. </figcaption>
</figure></p>
<p>Clearly, male passengers were at a huge disadvantage. They were about 5 times more likely to die than to survive. In contrast, female passengers were almost 3 times more likely to survive than to die. Next, while 1st class passengers were more likely to survive, chances were tilted badly against 3nd class passengers (in the 2nd class the chance was about equal). While a difference in survival rate can also be seen depending on the port of embarkation, the variable is so highly imbalanced that these differences could be spurious. In regards to family size, singles were much more likely to die than to survive. However, this balance is affected highly by the fact that of the 537 singles 411 were male and only 126 female. The gender thus confounds this family size level. When considering only non-singles we see a slight effect of larger families size leading to lower probability of survival. The fare variable essentially mirrors the class variable. Those who paid more for their ticket (and thus probably of a higher socio-economical status) are somewhat more likely to survive than to perish, while passengers with the cheapest tickets were much more probable to die. The title variable mostly confirms the earlier trends. Passengers with female titles (Lady, Miss, Mrs), as well as young passengers (Master) are more likely to survive than adult male passengers (Mr, Sir, Reverend). And amongst the male adults, those of nobility (Sir) had a better chance of survival than “common” travellers (Mr). A slight effect of age on survival can also be seen in the “is child” variable (most children survived, while most adults died), but the number of children was relatively low overall. </p>
<p>The numeric variables further support the trend observed in the corresponding factors, as can be seen in Figure 2 below.
<figure>
<img src="/images/titanic/expContVar.png" alt="Numerical predictors" />
<figcaption class="capCenter">Figure 2: Survival distributions for numerical predictors (red=survived, blue=died). Left: A box plot of fair price, y axis is log-scaled. Right: density estimate of survival vs age. </figcaption>
</figure>
Those that survived travelled on a more expensive ticket on average than those who died. And for young children we see a peak in the probability of survival.</p>
<p>To develop some intuition about the importance of the different predictors and how they might be used by a classifier it may help to train a simple decision tree on the data, which is a model easy to interpret. Let’s start by sticking mostly to the original predictors (not including non-normal variables converted to factors, nor engineered variables like the title):</p>
<div class="highlight"><pre><span></span>dc1 <span class="o">=</span> rpart<span class="p">(</span>survived <span class="o">~</span> pclass <span class="o">+</span> sex <span class="o">+</span> age <span class="o">+</span> familysize <span class="o">+</span> fare <span class="o">+</span> embarked<span class="p">,</span> data<span class="o">=</span>train<span class="p">,</span> method<span class="o">=</span><span class="s">"class"</span><span class="p">)</span>
</pre></div>
<p>A tree trained on the remaining predictors is shown below in Figure 3.</p>
<figure>
<img src="/images/titanic/dectree1.png" alt="Decision tree 1"/>
<figcaption class="capCenter">Figure 3: Decision tree predicting survival. Each node displays its survival prediction (yes=blue, no=red), the probability of belonging to each class conditioned on the node (sum to one within node), as well as the percentage of observations in each node (sum to 100% across leaves). </figcaption>
</figure>
<p>The resulting decision tree should not be surprising. Without any further information (at the root node) the classifier always predicts that a passenger would not survive, which is of course correct given that 62% of all passengers died while only 38% survived. Next, the tree splits on the gender variable. For male passengers over the age of 13 the classifier predicts death, while children are more likely to survive, unless they belong to a large family. On the female branch, those belonging to the upper class are predicted to survive. Those in the third class, in contrast, are predicted to survive only when they belong to a relatively small family (size < 4.5) and are under the age of 36. Those older, or member of a bigger family are more probable to have died. The fare and embarkation variables are not used in the final tree. Since we already know that fare correlates strongly with class, and since embarkation is strongly imbalanced, this is not surprising. “Factorised” variables derived from non-uniformly distributed predictors (fare category, family size category and “is child”) are not required in the training of the tree, as it automatically determines the best level at which to split the variables.</p>
<p>How about the engineered variables of a passenger’s title and familyId? One possible problem here is that these factors contain relatively many levels. Decision trees split nodes by information gain, and this measure in decision trees is biased in favour of attributes with more levels. Regular trees will therefore often produce results with those categorical variables dominating others. However, biased predictor selection can be avoided using Conditional Inference Trees (ctrees), which will be employed later when more methodologically exploring different classifiers.</p>
<p>As a last step, we compare the distribution of variables from the train and the test set, to avoid potential surprises arising from imbalanced splits of the data. Instead of pulling out and displaying here all tables for the categorical variables in both sets, we first use a chi-square test to single out those categorical variables whose levels are differently distributed:</p>
<div class="highlight"><pre><span></span>factabs <span class="o">=</span> <span class="kp">lapply</span><span class="p">(</span>varnames<span class="p">[</span>facvars<span class="p">],</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span> <span class="kt">data.frame</span><span class="p">(</span><span class="kp">cbind</span><span class="p">(</span><span class="kp">table</span><span class="p">(</span>train<span class="p">[,</span>x<span class="p">]),</span> <span class="kp">table</span><span class="p">(</span>test<span class="p">[,</span> x<span class="p">])))})</span>
pvals <span class="o">=</span> <span class="kp">sapply</span><span class="p">(</span>faccomp<span class="p">,</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> chisq.test<span class="p">(</span>x<span class="p">)</span><span class="o">$</span>p.value<span class="p">)</span>
faccomp<span class="p">[[</span><span class="kp">which</span><span class="p">(</span>pvals<span class="o"><</span><span class="m">0.05</span><span class="p">)]]</span>
</pre></div>
<p>Only the embarkation shows a slight but apparently significant difference between the train and test set, with the difference in the proportions of people embarked in Cherbourg vs. Southhamption being slightly less pronounced in the test set (C=0.188, S=0.725 in the training set, and C=0.244, S=0.646 in the test set). Since the overall tendency is preserved we assume this difference will not affect the quality of our following predictions. Comparing five-number summaries for the numerical variables showed no further differences in distribution between the train and test sets.</p>
<h3>Classifier training</h3>
<p>I decided to use the caret package in R to train and compare a variety of different models. I should note that finding a better way to preprocess, engineer and extend the available data is often more important than small improvements gained from using a better classifier. However, I suspect that since the titanic data set is very small and consists mostly of categorical variables, and since I know of no way to collect more data on the problem (without cheating), some classifiers might in this particular case perform better than others.</p>
<p>The caret package provides a unified interface for training of a large number of different learning algorithms, including options for validating learners using cross-validation (and related validation techniques), which can be used simultaneously for the tuning of model-specific hyper-parameters. My overall approach will be this: first I train a number of classifiers using repeated cross-validation to estimate their prediction accuracy. Next I create ensembles of these classifiers and compare their accuracy to that of individual classifiers. Lastly, I choose the best (individual or ensemble) classifier to create predictions for the kaggle competition. Usually, I would maintain a hold out set for validation and comparison of the various hypertuned algorithms. Because the data set is already small, however, I decided to try and rely on the results from repeated cross-validation (10 folds, 10 repeats). It might nevertheless be insightful to at least compare the cross-validated metrics (using the full data set) to those measured on a holdout set, even when ultimately training the final classifier on the whole training set. We’ll start by training with 20% of data reserved for the validation set.</p>
<p>Here’s my approach to more or less flexibly building a set of different classifiers using caret:</p>
<div class="highlight"><pre><span></span>rseed <span class="o">=</span> <span class="m">42</span>
scorer <span class="o">=</span> <span class="s">'ROC'</span> <span class="c1"># 'ROC' or "Accuracy'</span>
summarizor <span class="o">=</span> <span class="kr">if</span><span class="p">(</span>scorer <span class="o">==</span> <span class="s">'Accuracy'</span><span class="p">)</span> defaultSummary <span class="kr">else</span> twoClassSummary
selector <span class="o">=</span> <span class="s">"best"</span> <span class="c1"># "best" or "oneSE"</span>
folds <span class="o">=</span> <span class="m">10</span>
repeats <span class="o">=</span> <span class="m">10</span>
pp <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">"center"</span><span class="p">,</span> <span class="s">"scale"</span><span class="p">)</span>
cvctrl <span class="o">=</span> trainControl<span class="p">(</span>method<span class="o">=</span><span class="s">"repeatedcv"</span><span class="p">,</span> number<span class="o">=</span>folds<span class="p">,</span> repeats<span class="o">=</span>repeats<span class="p">,</span> p<span class="o">=</span><span class="m">0.8</span><span class="p">,</span>
summaryFunction<span class="o">=</span>summarizor<span class="p">,</span> selectionFunction<span class="o">=</span>selector<span class="p">,</span> classProbs<span class="o">=</span><span class="bp">T</span><span class="p">,</span>
savePredictions<span class="o">=</span><span class="bp">T</span><span class="p">,</span> returnData<span class="o">=</span><span class="bp">T</span><span class="p">,</span>
index<span class="o">=</span>createMultiFolds<span class="p">(</span>trainset<span class="o">$</span>survived<span class="p">,</span> k<span class="o">=</span>folds<span class="p">,</span> times<span class="o">=</span>repeats<span class="p">))</span>
</pre></div>
<p>First, use a random seed to make results repeatable! Next we select whether to optimise prediction accuracy or the area under the <span class="caps">ROC</span> curve, and the number of folds for cross-validation and the number of times to repeat the validation. Some algorithms require normalised data, which means centering and scaling here. Lastly, setup the training control structure expected by the caret package. Next we set up a number of formulas to be used by the classifiers:</p>
<div class="highlight"><pre><span></span>fmla0 <span class="o">=</span> survived <span class="o">~</span> pclass <span class="o">+</span> sex <span class="o">+</span> age
fmla1 <span class="o">=</span> survived <span class="o">~</span> pclass <span class="o">+</span> sex <span class="o">+</span> age <span class="o">+</span> fare <span class="o">+</span> embarked <span class="o">+</span> familysizefac <span class="o">+</span> title
<span class="kc">...</span>
fmla <span class="o">=</span> fmla1
</pre></div>
<p>No surprise here. Caret accepts parameter grids over which to search for hyperparameters. Here we set these up for our selected algorithms and combine them in a list along with additional model parameters expected by caret (such as a string identifying the type of model etc):</p>
<div class="highlight"><pre><span></span>glmnetgrid <span class="o">=</span> <span class="kp">expand.grid</span><span class="p">(</span><span class="m">.</span>alpha <span class="o">=</span> <span class="kp">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0.1</span><span class="p">),</span> <span class="m">.</span>lambda <span class="o">=</span> <span class="kp">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0.1</span><span class="p">))</span>
<span class="kc">...</span>
rfgrid <span class="o">=</span> <span class="kt">data.frame</span><span class="p">(</span><span class="m">.</span>mtry <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
configs <span class="o">=</span> <span class="kt">list</span><span class="p">()</span>
configs<span class="o">$</span>glmnet <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>method<span class="o">=</span><span class="s">"glmnet"</span><span class="p">,</span> tuneGrid<span class="o">=</span>glmnetgrid<span class="p">,</span> preProcess<span class="o">=</span>pp<span class="p">)</span>
<span class="kc">...</span>
configs<span class="o">$</span>rf <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>method<span class="o">=</span><span class="s">"rf"</span><span class="p">,</span> tuneGrid<span class="o">=</span>rfgrid<span class="p">,</span> preProcess<span class="o">=</span><span class="kc">NULL</span><span class="p">,</span> ntree<span class="o">=</span><span class="m">2000</span><span class="p">)</span>
</pre></div>
<p>Now that we have a list of training algorithms along with their required parameters, it’s just a matter of looping over it to train the corresponding classifiers:</p>
<div class="highlight"><pre><span></span>arg <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>form <span class="o">=</span> fmla<span class="p">,</span> data <span class="o">=</span> trainset<span class="p">,</span> trControl <span class="o">=</span> cvctrl<span class="p">,</span> metric <span class="o">=</span> scorer<span class="p">)</span>
models <span class="o">=</span> <span class="kt">list</span><span class="p">()</span>
<span class="kp">set.seed</span><span class="p">(</span>rseed<span class="p">)</span>
<span class="kr">for</span><span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span><span class="kp">length</span><span class="p">(</span>configs<span class="p">))</span>
<span class="p">{</span>
models<span class="p">[[</span>i<span class="p">]]</span> <span class="o">=</span> <span class="kp">do.call</span><span class="p">(</span><span class="s">"train.formula"</span><span class="p">,</span> <span class="kt">c</span><span class="p">(</span>arg<span class="p">,</span> configs<span class="p">[[</span>i<span class="p">]]))</span>
<span class="p">}</span>
<span class="kp">names</span><span class="p">(</span>models<span class="p">)</span> <span class="o">=</span> <span class="kp">sapply</span><span class="p">(</span>models<span class="p">,</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> x<span class="o">$</span>method<span class="p">)</span>
</pre></div>
<p>Let’s look at some comparisons of the individual classifiers (Table 1):</p>
<figure >
<div class="figCenter">
<TABLE class="table">
<TR> <TH> </TH> <TH> glmnet </TH> <TH> rf </TH> <TH> gbm </TH> <TH> ada </TH> <TH> svmRadial </TH> <TH> cforest </TH> <TH> blackboost </TH> <TH> earth </TH> <TH> gamboost </TH> <TH> bayesglm </TH> </TR>
<TR> <TD align="right"> train </TD> <TD align="right"> 0.838 </TD> <TD align="right"> 0.870 </TD> <TD align="right"> 0.891 </TD> <TD align="right"> 0.891 </TD> <TD align="right"> 0.850 </TD> <TD align="right"> 0.853 </TD> <TD align="right"> 0.843 </TD> <TD align="right"> 0.838 </TD> <TD align="right"> 0.842 </TD> <TD align="right"> 0.832 </TD> </TR>
<TR> <TD align="right"> val </TD> <TD align="right"> 0.808 </TD> <TD align="right"> 0.797 </TD> <TD align="right"> 0.853 </TD> <TD align="right"> 0.825 </TD> <TD align="right"> 0.825 </TD> <TD align="right"> 0.808 </TD> <TD align="right"> 0.802 </TD> <TD align="right"> 0.797 </TD> <TD align="right"> 0.785 </TD> <TD align="right"> 0.808 </TD> </TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 1: Accuracy of individual classifiers on training and validation set.</figcaption>
</figure>
<p>The ada and gbm classifiers seems to do best in terms of accuracy, on both the training as well as the validation set, followed by the svm. However, since we have used the area under the <span class="caps">ROC</span> curve as the optimized metric it might be more informative to drill down into how the classifiers perform in terms of <span class="caps">ROC</span>, specificity and sensitivity. </p>
<figure>
<img src="/images/titanic/Roc.png" alt="Dot plot of ROC metrics for individual classifiers obtained from resamples created during cross-validation."/>
<figcaption class="capCenter">Figure 4: Dot plot of <span class="caps">ROC</span> metrics for individual classifiers estimated from resampled data (10 repeats of 10-fold cross-validation). </figcaption>
</figure>
<p>Figure 4 uses the resample results from cross-validation to display means and 95% confidence intervals for the shown metrics. We note that though gbm and ada had the best accuracy on the validation set, there are other models that seem to find a better trade-off between sensitivity and specificity, at least as estimated on the resampled data. More specifically, gbm, ada and svm show relatively high sensitivity, but low specificity. The generalized linear and additive models (glm, gam) seem to do better. Also, while the svm has high accuracy on the validation set and high sensitivity (recall) in cross-validation, i.e. is good at identifying the survivors, it performs worst amongst all classifiers in correctly identifying those who died (specificity).</p>
<p>Finally, let’s create ensembles from the individual models and compare their <span class="caps">ROC</span> performance to the models on the validation set. Two ensembles are created with the help of Zach Mayer’s <a href="https://github.com/zachmayer/caretEnsemble">caretEnsemble</a> package (itself based on a paper by <a href="http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf">Caruana et al. 2004</a>): the first employs a greedy forward selection of individual models to incrementally add those to the ensemble that minimize the ensemble’s chosen error metric. The ensemble’s predictions are then essentially a weighted average of the individual predictions. The second ensemble simply trains a new caret model of choice using the matrix of individual model predictions as features (in this case I use a generalized linear model), also known as a “stack”.</p>
<figure>
<div style="display:table">
<TABLE class="table">
<TR> <TH> </TH> <TH> glmnet </TH> <TH> gbm </TH> <TH> LogitBoost </TH> <TH> earth </TH> <TH> blackboost </TH> <TH> bayesglm </TH> <TH> gamboost </TH> <TH> svmRadial </TH> <TH> ada </TH> </TR>
<TR> <TD align="right"> weight </TD> <TD align="right"> 0.387 </TD> <TD align="right"> 0.292 </TD> <TD align="right"> 0.203 </TD> <TD align="right"> 0.080 </TD> <TD align="right"> 0.019 </TD> <TD align="right"> 0.009 </TD> <TD align="right"> 0.007 </TD> <TD align="right"> 0.002 </TD> <TD align="right"> 0.001 </TD> </TR>
</TABLE>
<TABLE class="table">
<TR> <TH> </TH> <TH> earth </TH> <TH> gamboost </TH> <TH> blackboost </TH> <TH> cforest </TH> <TH> bayesglm </TH> <TH> glmnet </TH> <TH> svmRadial </TH> <TH> rf </TH> <TH> greedyEns </TH> <TH> ada </TH> <TH> gbm </TH> <TH> linearEns </TH> </TR>
<TR> <TD align="right"> <span class="caps">ROC</span> </TD> <TD align="right"> 0.836 </TD> <TD align="right"> 0.846 </TD> <TD align="right"> 0.846 </TD> <TD align="right"> 0.858 </TD> <TD align="right"> 0.861 </TD> <TD align="right"> 0.862 </TD> <TD align="right"> 0.862 </TD> <TD align="right"> 0.865 </TD> <TD align="right"> 0.873 </TD> <TD align="right"> 0.876 </TD> <TD align="right"> 0.878 </TD> <TD align="right"> 0.879 </TD> </TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 2: Top: classifier weights determined by the greedy ensemble. Bottom: <span class="caps">ROC</span> measured on validation set for individual and ensemble classifiers.</figcaption>
</figure>
<p>On the unseen validation set we notice once again that ada and gbm perform best amongst the individual classifiers, not only in terms of accuracy as demonstrated above, but also in terms of the area under the <span class="caps">ROC</span> curve. Both, however, are outperformed slightly by the stacked ensemble (linearEns). </p>
<p>Finally, let’s compare the performances on the validation set to those obtained from cross-validated training on the whole training set. Table 3 summarises corresponding metrics for all classifiers:</p>
<figure>
<div>
<TABLE class="table">
<TR> <TH> </TH> <TH> glmnet </TH> <TH> rf </TH> <TH> gbm </TH> <TH> ada </TH> <TH> svmRadial </TH> <TH> cforest </TH> <TH> blackboost </TH> <TH> earth </TH> <TH> gamboost </TH> <TH> bayesglm </TH> <TH> linearEns </TH> <TH> greedyEns </TH> </TR>
<TR> <TD align="right"> <span class="caps">ROC</span> </TD> <TD align="right"> 0.871 </TD> <TD align="right"> 0.875 </TD> <TD align="right"> 0.877 </TD> <TD align="right"> 0.875 </TD> <TD align="right"> 0.864 </TD> <TD align="right"> 0.871 </TD> <TD align="right"> 0.866 </TD> <TD align="right"> 0.869 </TD> <TD align="right"> 0.873 </TD> <TD align="right"> 0.870 </TD> <TD align="right"> 0.880 </TD> <TD align="right"> 0.878 </TD> </TR>
<TR> <TD align="right"> Sens </TD> <TD align="right"> 0.879 </TD> <TD align="right"> 0.910 </TD> <TD align="right"> 0.892 </TD> <TD align="right"> 0.897 </TD> <TD align="right"> 0.923 </TD> <TD align="right"> 0.908 </TD> <TD align="right"> 0.890 </TD> <TD align="right"> 0.883 </TD> <TD align="right"> 0.876 </TD> <TD align="right"> 0.871 </TD> <TD align="right"> 0.894 </TD> <TD align="right"> </TD> </TR>
<TR> <TD align="right"> Spec </TD> <TD align="right"> 0.750 </TD> <TD align="right"> 0.699 </TD> <TD align="right"> 0.743 </TD> <TD align="right"> 0.739 </TD> <TD align="right"> 0.678 </TD> <TD align="right"> 0.701 </TD> <TD align="right"> 0.723 </TD> <TD align="right"> 0.721 </TD> <TD align="right"> 0.740 </TD> <TD align="right"> 0.752 </TD> <TD align="right"> 0.733 </TD> <TD align="right"> </TD> </TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 3: Area under the <span class="caps">ROC</span> curve, sensitivity and specificity of all models estimated in 10 repeats of 10-fold cross-validation after training on the whole data set (sens and spec are not calculated automatically by the greedy ensemble) . </figcaption>
</figure>
<p>The results seem to confirm our finding from predictions on the validation set. After training on the whole data set ada and gbm exhibit the best cross-validated <span class="caps">ROC</span> measures, but the ensemble classifiers do even better.</p>
<h3>Conclusions</h3>
<p>Based on an assessment of the area under the <span class="caps">ROC</span> curve, on both a validation subset of the data as well as repeated cross-validation on the whole set, boosted classification trees (<a href="http://dept.stat.lsa.umich.edu/~gmichail/ada_final.pdf">ada</a> and <a href="http://gradientboostedmodels.googlecode.com/git/gbm/inst/doc/gbm.pdf">gbm</a>) seem to perform best amongst single classifiers on the titanic data set. Ensembles built using a range of different classifiers, in particular in the form of a stack, lead to a small but seemingly consistent improvement over the performance of individual classifiers. I therefore chose to submit the predictions of the generalized linear stack. Interestingly, this did not lead to my best submission score. The ensemble has an accuracy of 0.78947 on the public leaderboard, i.e. on the part of the test set used to score different submissions. In comparison, I’ve also trained a single forest of conditional inference trees using the familyid information as an additional predictor, which obtained an accuracy score of 0.81818 and ended up much higher on the leaderboard. Now, kaggle leaderboard position in itself <a href="http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/">doesn’t always correlate well</a> with final performance on the whole test set, essentially because of overfitting to the leaderboard if many submissions are made and models selected on the basis of achieved position. Nevertheless, before the end of the competition it might be worth comparing the above classifiers and ensembles with different formulas (combinations of predictors, including family identifiers). Another option is to perform the full training again with accuracy rather than <span class="caps">AUC</span> as the optimized metric, which is the one used to assess predictions by kaggle in this competition. However, as have commented many kagglers involved in past competitions, it is probably better to rely on one’s own cross-validation scores, rather than potentially overinflated leaderboard scores, to predict a model’s final success.</p>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</div><!-- /.entry-content -->
<footer class="post-info">
Published on <span class="published">October 23, 2014</span><br>
Written by <span class="author">Thomas Buhrmann</span><br>
Posted in <span class="label label-default"><a href="https://buhrmann.github.io/category/reports.html">Reports</a></span>
~ Tagged
<span class="label label-default"><a href="https://buhrmann.github.io/tag/r.html">R</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/kaggle.html">kaggle</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/titanic.html">titanic</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/report.html">report</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></span>
</footer><!-- /.post-info -->
</section>
<div class="blogItem">
<h2>Comments</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'datawerk';
var disqus_title = 'Titanic survival prediction';
var disqus_identifier = "titanic-survival.html";
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
//dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] ||
document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>
Please enable JavaScript to view the
<a href="http://disqus.com/?ref_noscript=datawerk">
comments powered by Disqus.
</a>
</noscript>
</div>
</div>
</div><!-- row-->
</div><!-- container -->
<!-- <div class="push"></div> -->
</div> <!-- wrap -->
<div class="container-fluid aw-footer">
<div class="row-centered">
<div class="col-sm-3 col-sm-offset-1">
<h4>Author</h4>
<ul class="list-unstyled my-list-style">
<li><a href="http://www.ias-research.net/people/thomas-buhrmann/">Academic Home</a></li>
<li><a href="http://github.com/synergenz">Github</a></li>
<li><a href="http://www.linkedin.com/in/thomasbuhrmann">LinkedIn</a></li>
<li><a href="https://secure.flickr.com/photos/syngnz/">Flickr</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Categories</h4>
<ul class="list-unstyled my-list-style">
<li><a href="https://buhrmann.github.io/category/academia.html">Academia (4)</a></li>
<li><a href="https://buhrmann.github.io/category/data-apps.html">Data Apps (2)</a></li>
<li><a href="https://buhrmann.github.io/category/data-posts.html">Data Posts (9)</a></li>
<li><a href="https://buhrmann.github.io/category/reports.html">Reports (3)</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Tags</h4>
<ul class="tagcloud">
<li class="tag-4"><a href="https://buhrmann.github.io/tag/shiny.html">shiny</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/networks.html">networks</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sql.html">sql</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/hadoop.html">hadoop</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/mongodb.html">mongodb</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/visualization.html">visualization</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/smcs.html">smcs</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sklearn.html">sklearn</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/tf-idf.html">tf-idf</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/r.html">R</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/sna.html">sna</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/nosql.html">nosql</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/svm.html">svm</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/java.html">java</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/hive.html">hive</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/scraping.html">scraping</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/lda.html">lda</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/kaggle.html">kaggle</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/exploratory.html">exploratory</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/titanic.html">titanic</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/python.html">python</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/random-forest.html">random forest</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/text.html">text</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/big-data.html">big data</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/report.html">report</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/regression.html">regression</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/graph.html">graph</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/d3.html">d3</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/neo4j.html">neo4j</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/flume.html">flume</a></li>
</ul>
</div>
</div>
</div>
<!-- JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function($)
{
$("div.collapseheader").click(function () {
$header = $(this).children("span").first();
$codearea = $(this).children(".input_area");
$codearea.slideToggle(500, function () {
$header.text(function () {
return $codearea.is(":visible") ? "Collapse Code" : "Expand Code";
});
});
});
// $(window).resize(function(){
// var footerHeight = $('.aw-footer').outerHeight();
// var stickFooterPush = $('.push').height(footerHeight);
// $('.wrap').css({'marginBottom':'-' + footerHeight + 'px'});
// });
// $(window).resize();
// $(window).bind("load resize", function() {
// var footerHeight = 0,
// footerTop = 0,
// $footer = $(".aw-footer");
// positionFooter();
// function positionFooter() {
// footerHeight = $footer.height();
// footerTop = ($(window).scrollTop()+$(window).height()-footerHeight)+"px";
// console.log(footerHeight, footerTop);
// console.log($(document.body).height()+footerHeight, $(window).height());
// if ( ($(document.body).height()+footerHeight) < $(window).height()) {
// $footer.css({ position: "absolute" }).css({ top: footerTop });
// console.log("Positioning absolute");
// }
// else {
// $footer.css({ position: "static" });
// console.log("Positioning static");
// }
// }
// $(window).scroll(positionFooter).resize(positionFooter);
// });
});
</script>
</body>
</html>