Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression with random forests plot predicted vs observed #456

Open
fconstancias opened this issue Jan 22, 2025 · 4 comments
Open

Regression with random forests plot predicted vs observed #456

fconstancias opened this issue Jan 22, 2025 · 4 comments
Labels
good first issue Good for newcomers

Comments

@fconstancias
Copy link

Hi @ChiLiubio,

Many thanks for developing this package! I would like to predict gingival inflammation marker together with salivary microbiota. I have been using trans_classifier() which is super cool. https://chiliubio.github.io/microeco_tutorial/model-based-class.html#trans_classifier-class

I have first binned my inflammation variable into categories (High, medium, low) and it seems it can do a good job.
Is there any easy way to extract which taxa are associated with which groups? Is it possible to add additional metadata for the model, in addition to the taxa/OTU (e.g., blood marker, age, ...) ?

I would like to avoid using the classification and directly try to predict the inflammation variable? I can perform logistic_regression with the rf method of trans_classifier() and was able to obtain IncNodePurity scores and %IncMSE for the relevant taxa.

  • Is there a way to plot the predicted vs observed values?
  • Is it possible to also include additional non microbiota data into the model (e.g., blood marker, age, ...) ?
  • How can I assess if the variable (taxa) selected by the model are positively or negatively associated with the variable?

Many thanks!

Flo

@ChiLiubio
Copy link
Owner

Hi @fconstancias
Thank you for the appreciation!
For "Is there any easy way to extract which taxa are associated with which groups", I think an easy way is to use trans_abund or trans_diff class to show the abundance.

# first prepare  the input data
library(magrittr)
library(microeco)
tmp <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
tmp$tidy_dataset()
tmp$sample_table <- data.frame(tmp$sample_table, env_data_16S[rownames(tmp$sample_table), ])

# classification
t1 <- trans_classifier$new(dataset = tmp, y.response = "Saline", x.predictors = "Genus")
t1$cal_preProcess(method = c("center", "scale", "nzv"))
t1$cal_feature_sel(boruta.maxRuns = 300, boruta.pValue = 0.01)
t1$cal_split(prop.train = 3/4)
t1$set_trainControl()
t1$cal_train()
t1$cal_predict()
t1$cal_feature_imp(rf_feature_sig = TRUE)
t1$plot_feature_imp(show_sig_group = TRUE, coord_flip = FALSE, width = 0.6, add_sig = TRUE, group_aggre = FALSE, group_two_sep = FALSE)

# select the taxonomic abundance
tmp2 <- clone(tmp)
tmp2$taxa_abund$Genus %<>% .[rownames(t1$res_feature_imp)[1:20], ]
# trans_abund
t2 <- trans_abund$new(dataset = tmp2, taxrank = "Genus", ntaxa = 15)
t2$plot_box(group = "Saline", xtext_angle = 30)
# or trans_diff
t2 <- trans_diff$new(dataset = tmp2, method = "rf", group = "Saline", taxa_level = "Genus")
t2$plot_diff_abund(plot_type = "barerrorbar", y_start = -0.8, errorbar_addpoint = FALSE, bar_alpha = 0.6)

If the response variable is numeric class, the model will automatically use regression instead of classification.
I will consider adding a way to compare the predicted and observed values. Here is the step to merge them into a table.

# regression  pH
t1 <- trans_classifier$new(dataset = tmp, y.response = "pH", x.predictors = "Genus")
t1$cal_split(prop.train = 3/4)
t1$set_trainControl()
t1$cal_train()
t1$res_train
t1$cal_predict()
t1$cal_feature_imp()

predict_value <- t1$res_predict
obs_value <- t1$data_test[names(predict_value), 1]
predict_obs <- data.frame(predict = predict_value, observe = obs_value)
predict_obs

To include additional non microbiota data into the model, the best way is to directly adding them into the prepared data (i.e. data_feature) before the cal_split function step.

t1 <- trans_classifier$new(dataset = tmp, y.response = "pH", x.predictors = "Genus")
View(t1$data_feature)
# please merge them into t1$data_feature

For 'assess if the variable (taxa) selected by the model are positively or negatively associated with the variable?', I think it will be easy to use trans_env class to perform a correlation.

# sort and select some taxa from last pH part
top_20 <- rownames(t1$res_feature_imp[order(t1$res_feature_imp[, 1], decreasing = TRUE), , drop = FALSE])[1:20]

t2 <- trans_env$new(dataset = tmp, add_data = env_data_16S[, 4:11])
t2$cal_cor(use_data = "other", other_taxa = gsub("`", "", top_20))
# use t2$res_cor or show them
t2$plot_cor()

Please feel free to tell me if a step is not clear.

Best,
Chi

@ChiLiubio ChiLiubio added the good first issue Good for newcomers label Jan 23, 2025
@fconstancias
Copy link
Author

That's excellent. Thanks!

@fconstancias
Copy link
Author

sort and select some taxa from last pH part

top_20 <- rownames(t1$res_feature_imp[order(t1$res_feature_imp[, 1], decreasing = TRUE), , drop = FALSE])[1:20]

t2 <- trans_env$new(dataset = tmp, add_data = env_data_16S[, 4:11])
t2$cal_cor(use_data = "other", other_taxa = gsub("`", "", top_20))

use t2$res_cor or show them

t2$plot_cor()

I think partial dependence plots e.g., using the pdp package could be a more direct way to extract this information from the random forest model.

also see:
https://www.rdocumentation.org/packages/pdp/versions/0.8.1/topics/partial
https://microbiome.github.io/course_2022_FindingPheno/supervised-learning.html

@fconstancias fconstancias reopened this Jan 28, 2025
@ChiLiubio
Copy link
Owner

Hi @fconstancias
Thank you very much for your sharing. Indeed, the method using the pdp package for partial dependence plot is more straightforward and detailed. I will consider how to better integrate it into the pipeline. It seems that the two approaches are not conflicting, because analyzing the correlations using all the features directly can also reveal the patterns of changes with just one time. Then, applying the aforementioned method to some selected features of interest in-depth is also feasible.

Best,
Chi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants