You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using spatialRF::rf_importance()) were showing up as strongly negative in the standard $variable.importance score.
Some details
I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.
My (naive) understanding is that the class.weights argument of ranger() is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., using spatialRF::case_weights()) passes that information along to case.weights instead of class.weights.
I am having a hard time understanding how case.weights and class.weights are being used in ranger() but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and using case.weights to try to account for class imbalance.
Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!
[edit: I'm pasting in some of my investigation code in case that's useful...]
library(spatialRF)
library(ranger)
plant_richness_df$response_binomial <- ifelse(
plant_richness_df$richness_species_vascular > 5000,
1,
0
)
case.wgts <- spatialRF::case_weights(data = plant_richness_df,
dependent.variable.name = "response_binomial")
predictor.variable.names <- colnames(plant_richness_df)[5:21]
# Regression problem with binary response and using case.weights
fm1 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = plant_richness_df[["response_binomial"]],
data = plant_richness_df,
classification = FALSE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm1$variable.importance))
# Classification problem with a factor as response variable, and using case.weights
fm2 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = TRUE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm2$variable.importance))
# Probability estimation problem with a factor as response variable, and using case.weights
fm3 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm3$variable.importance))
# Probability estimation with a factor as response variable, and using class.weights
fm4 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)
as.data.frame(sort(fm4$variable.importance))
# Probability estimation with a factor as response variable, and using both class.weights and case.weights
fm5 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)
as.data.frame(sort(fm5$variable.importance))
# spatialRF
fm6 <- spatialRF::rf(data = plant_richness_df,
dependent.variable.name = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)
as.data.frame(sort(fm6$variable.importance))
as.data.frame(sort(fm1$variable.importance))
# spatialRF
fm7 <- spatialRF::rf(data = plant_richness_df,
dependent.variable.name = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)
as.data.frame(sort(fm7$variable.importance)) # the {spatialRF} version creates the same model as fm1
as.data.frame(sort(fm1$variable.importance))
The text was updated successfully, but these errors were encountered:
tl;dr
I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using
spatialRF::rf_importance()
) were showing up as strongly negative in the standard$variable.importance
score.Some details
I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.
My (naive) understanding is that the
class.weights
argument ofranger()
is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., usingspatialRF::case_weights()
) passes that information along tocase.weights
instead ofclass.weights
.I am having a hard time understanding how
case.weights
andclass.weights
are being used inranger()
but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and usingcase.weights
to try to account for class imbalance.Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!
[edit: I'm pasting in some of my investigation code in case that's useful...]
The text was updated successfully, but these errors were encountered: