You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
there's a bug somewhere that's killing the accuracy.
scores on homesite and telstra are lagging dramatically behind what others are getting with xgboost alone.
my initial reaction was that we needed to tweak the parameters we're tuning for each algorithm, but i don't think that alone would justify the huge expanse between others scores and mine.
i have a feeling it's something in data-formatter.
i am introducing overfitting at the moment by calculating summary statistics on the entire dataset, rather than on each fold for cross-validation.
this is particularly true for the groupBy columns. i think it's probably alright for the imputing missing values script, but groupBy is probably introducing a lot of overfitting.
there could also just be a bug in data-formatter somewhere. in particular, check that train and test have the same columns in the same order. they should, but with the flexibility of having or not having the output column or any ignored column, in the test dataset, we might be off by 1.
if possible, look into calculating stats on each cv fold individually. this would apply just to groupBy i guess.
steps:
manually run xgboost on the raw dataset.
manually run xgboost on the results from data-formatter
this should help us narrow down whether the error is coming from data-formatter or xgboost
run again and remove groupBy
The text was updated successfully, but these errors were encountered:
yeah, assuming it's something in data-formatter, just follow the standard debugging process: comment out the parts we think might be introducing the error, run it, and see if it does any better.
there's a bug somewhere that's killing the accuracy.
scores on homesite and telstra are lagging dramatically behind what others are getting with xgboost alone.
my initial reaction was that we needed to tweak the parameters we're tuning for each algorithm, but i don't think that alone would justify the huge expanse between others scores and mine.
i have a feeling it's something in data-formatter.
i am introducing overfitting at the moment by calculating summary statistics on the entire dataset, rather than on each fold for cross-validation.
this is particularly true for the groupBy columns. i think it's probably alright for the imputing missing values script, but groupBy is probably introducing a lot of overfitting.
there could also just be a bug in data-formatter somewhere. in particular, check that train and test have the same columns in the same order. they should, but with the flexibility of having or not having the output column or any ignored column, in the test dataset, we might be off by 1.
if possible, look into calculating stats on each cv fold individually. this would apply just to groupBy i guess.
steps:
this should help us narrow down whether the error is coming from data-formatter or xgboost
The text was updated successfully, but these errors were encountered: