Regression with random forests plot predicted vs observed #456

fconstancias · 2025-01-22T14:13:30Z

Many thanks for developing this package! I would like to predict gingival inflammation marker together with salivary microbiota. I have been using trans_classifier() which is super cool. https://chiliubio.github.io/microeco_tutorial/model-based-class.html#trans_classifier-class

I have first binned my inflammation variable into categories (High, medium, low) and it seems it can do a good job.
Is there any easy way to extract which taxa are associated with which groups? Is it possible to add additional metadata for the model, in addition to the taxa/OTU (e.g., blood marker, age, ...) ?

I would like to avoid using the classification and directly try to predict the inflammation variable? I can perform logistic_regression with the rf method of trans_classifier() and was able to obtain IncNodePurity scores and %IncMSE for the relevant taxa.

Is there a way to plot the predicted vs observed values?
Is it possible to also include additional non microbiota data into the model (e.g., blood marker, age, ...) ?
How can I assess if the variable (taxa) selected by the model are positively or negatively associated with the variable?

Many thanks!

Flo

The text was updated successfully, but these errors were encountered:

ChiLiubio · 2025-01-23T07:36:44Z

Hi @fconstancias
Thank you for the appreciation!
For "Is there any easy way to extract which taxa are associated with which groups", I think an easy way is to use trans_abund or trans_diff class to show the abundance.

# first prepare  the input data
library(magrittr)
library(microeco)
tmp <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
tmp$tidy_dataset()
tmp$sample_table <- data.frame(tmp$sample_table, env_data_16S[rownames(tmp$sample_table), ])

# classification
t1 <- trans_classifier$new(dataset = tmp, y.response = "Saline", x.predictors = "Genus")
t1$cal_preProcess(method = c("center", "scale", "nzv"))
t1$cal_feature_sel(boruta.maxRuns = 300, boruta.pValue = 0.01)
t1$cal_split(prop.train = 3/4)
t1$set_trainControl()
t1$cal_train()
t1$cal_predict()
t1$cal_feature_imp(rf_feature_sig = TRUE)
t1$plot_feature_imp(show_sig_group = TRUE, coord_flip = FALSE, width = 0.6, add_sig = TRUE, group_aggre = FALSE, group_two_sep = FALSE)

# select the taxonomic abundance
tmp2 <- clone(tmp)
tmp2$taxa_abund$Genus %<>% .[rownames(t1$res_feature_imp)[1:20], ]
# trans_abund
t2 <- trans_abund$new(dataset = tmp2, taxrank = "Genus", ntaxa = 15)
t2$plot_box(group = "Saline", xtext_angle = 30)
# or trans_diff
t2 <- trans_diff$new(dataset = tmp2, method = "rf", group = "Saline", taxa_level = "Genus")
t2$plot_diff_abund(plot_type = "barerrorbar", y_start = -0.8, errorbar_addpoint = FALSE, bar_alpha = 0.6)

If the response variable is numeric class, the model will automatically use regression instead of classification.
I will consider adding a way to compare the predicted and observed values. Here is the step to merge them into a table.

# regression  pH
t1 <- trans_classifier$new(dataset = tmp, y.response = "pH", x.predictors = "Genus")
t1$cal_split(prop.train = 3/4)
t1$set_trainControl()
t1$cal_train()
t1$res_train
t1$cal_predict()
t1$cal_feature_imp()

predict_value <- t1$res_predict
obs_value <- t1$data_test[names(predict_value), 1]
predict_obs <- data.frame(predict = predict_value, observe = obs_value)
predict_obs

To include additional non microbiota data into the model, the best way is to directly adding them into the prepared data (i.e. data_feature) before the cal_split function step.

t1 <- trans_classifier$new(dataset = tmp, y.response = "pH", x.predictors = "Genus")
View(t1$data_feature)
# please merge them into t1$data_feature

For 'assess if the variable (taxa) selected by the model are positively or negatively associated with the variable?', I think it will be easy to use trans_env class to perform a correlation.

# sort and select some taxa from last pH part
top_20 <- rownames(t1$res_feature_imp[order(t1$res_feature_imp[, 1], decreasing = TRUE), , drop = FALSE])[1:20]

t2 <- trans_env$new(dataset = tmp, add_data = env_data_16S[, 4:11])
t2$cal_cor(use_data = "other", other_taxa = gsub("`", "", top_20))
# use t2$res_cor or show them
t2$plot_cor()

Please feel free to tell me if a step is not clear.

Best,
Chi

fconstancias · 2025-01-24T14:08:32Z

That's excellent. Thanks!

fconstancias · 2025-01-28T10:00:43Z

sort and select some taxa from last pH part

top_20 <- rownames(t1$res_feature_imp[order(t1$res_feature_imp[, 1], decreasing = TRUE), , drop = FALSE])[1:20]

t2 <- trans_env$new(dataset = tmp, add_data = env_data_16S[, 4:11])
t2$cal_cor(use_data = "other", other_taxa = gsub("`", "", top_20))

use t2$res_cor or show them

t2$plot_cor()

I think partial dependence plots e.g., using the pdp package could be a more direct way to extract this information from the random forest model.

also see:
https://www.rdocumentation.org/packages/pdp/versions/0.8.1/topics/partial
https://microbiome.github.io/course_2022_FindingPheno/supervised-learning.html

ChiLiubio · 2025-01-29T05:51:27Z

Hi @fconstancias
Thank you very much for your sharing. Indeed, the method using the pdp package for partial dependence plot is more straightforward and detailed. I will consider how to better integrate it into the pipeline. It seems that the two approaches are not conflicting, because analyzing the correlations using all the features directly can also reveal the patterns of changes with just one time. Then, applying the aforementioned method to some selected features of interest in-depth is also feasible.

Best,
Chi

ChiLiubio added the good first issue Good for newcomers label Jan 23, 2025

fconstancias closed this as completed Jan 24, 2025

fconstancias reopened this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression with random forests plot predicted vs observed #456

Regression with random forests plot predicted vs observed #456

fconstancias commented Jan 22, 2025

ChiLiubio commented Jan 23, 2025

fconstancias commented Jan 24, 2025

fconstancias commented Jan 28, 2025

sort and select some taxa from last pH part

use t2$res_cor or show them

ChiLiubio commented Jan 29, 2025

Regression with random forests plot predicted vs observed #456

Regression with random forests plot predicted vs observed #456

Comments

fconstancias commented Jan 22, 2025

ChiLiubio commented Jan 23, 2025

fconstancias commented Jan 24, 2025

fconstancias commented Jan 28, 2025

sort and select some taxa from last pH part

use t2$res_cor or show them

ChiLiubio commented Jan 29, 2025