improve accuracy #145

ClimbsRocks · 2015-12-14T17:28:14Z

there's a bug somewhere that's killing the accuracy.

scores on homesite and telstra are lagging dramatically behind what others are getting with xgboost alone.

my initial reaction was that we needed to tweak the parameters we're tuning for each algorithm, but i don't think that alone would justify the huge expanse between others scores and mine.

i have a feeling it's something in data-formatter.

i am introducing overfitting at the moment by calculating summary statistics on the entire dataset, rather than on each fold for cross-validation.
this is particularly true for the groupBy columns. i think it's probably alright for the imputing missing values script, but groupBy is probably introducing a lot of overfitting.

there could also just be a bug in data-formatter somewhere. in particular, check that train and test have the same columns in the same order. they should, but with the flexibility of having or not having the output column or any ignored column, in the test dataset, we might be off by 1.

if possible, look into calculating stats on each cv fold individually. this would apply just to groupBy i guess.

steps:

manually run xgboost on the raw dataset.
manually run xgboost on the results from data-formatter
this should help us narrow down whether the error is coming from data-formatter or xgboost
run again and remove groupBy

ClimbsRocks · 2015-12-14T21:26:26Z

yeah, assuming it's something in data-formatter, just follow the standard debugging process: comment out the parts we think might be introducing the error, run it, and see if it does any better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve accuracy #145

improve accuracy #145

ClimbsRocks commented Dec 14, 2015

ClimbsRocks commented Dec 14, 2015

improve accuracy #145

improve accuracy #145

Comments

ClimbsRocks commented Dec 14, 2015

ClimbsRocks commented Dec 14, 2015