-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
type_density()
using the new type_data
/type_draw
system
#243
Comments
Nice. Was just thinking about this! The existing density logic is quite complicated, because we make special allowances for |
Yes, I think that small amount of breakage is a good idea while the package is still young. We want to start with strong standards. |
I've run up against this issue a few times lately, so will take a crack at it before submitting v0.3.0 to CRAN. (If it turns out to be too complicated for solving in a reasonable time, I'll just submit the new release without it. But hopefully there's a simple solution.) |
@zeileis a question for you (and picking up part of a discussion originally started in #271): Q: How do you think the joint bandwidth should be calculated for grouped density plots? To my mind there are four options:
Here's a quick manual implementation of each of these four options, using an pals = palette.colors(4, "R4")
# Plot area
plot(density(mtcars$wt), ylim = c(0, 1.1), type = "n")
grid()
# Option A
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x), col = pals[1]))
# Option B
bb = density(mtcars$wt)$bw
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = bb), col = pals[2]))
# Option C
cc = mean(sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw))
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = cc), col = pals[3], lty = 2))
# Option D
dd = weighted.mean(
sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw),
sapply(split(mtcars$wt, mtcars$am), length)
)
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = dd), col = pals[4], lty = 4))
# Legend
legend(
"topleft",
legend = c("A) individual", "B) full", "C) mean", "D) weighted.mean"),
lty = c(1,1,4,2),
col = pals
) Right now, we are using (C), but I'm not sure that it is optimal. FWIW, this is the equivalent ggplot2 result. It looks like they're just taking the simplest approach (A) and not using any kind of joint bandwidth. library(ggplot2)
theme_set(theme_linedraw())
ggplot(mtcars, aes(wt, group = am)) +
geom_density() +
scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1.1)) +
theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linetype = "32")) |
Thanks for the write-up and apologies for the 12" version of Simon & Garfunkel playing over the last weeks. (You know...The Sound of Silence...) I took some time off over the holidays. Coming back to this issue now I played around with your examples above as well as some artificial data I simulated from various normal distributions with different means and/or variances. Based on this my impression is that only A or D should be used. B is problematic in settings where I have subgroups with similar variances but (very) different means. Computing the bandwidth from the joint sample will lead to inflated values, e.g.,
This will dampen the density curve as you can also see in the red line for your C is clearly inferior to D when the subgroups have very different sizes because small groups might change the bandwidth estimates too much. For A vs. D I think that A is probably the safer default. In cases where the bandwidth from the different groups are roughly similar, strategy D will stabilize the bandwidth estimates. But in cases where the bandwidth is very different (e.g., because variances are very different), D might be a poor compromise and using separate bandwidths from A would be better. So, in short, my current thinking is that A should be the default but it would be nice if D were easy to do as well. |
@zeileis love the thoroughness. Very, very good stuff. |
Thinking some more about this: B may be a useful strategy when the groups overlap a lot, i.e., all have similar means and variances. So adding this as another option might also be nice to have sometimes... |
For posterity, the dev version of tinyplot now defaults to (A), but makes options (B) and (D) available to users via the library(tinyplot)
tinytheme("basic")
tinyplot(~wt | factor(am), data = mtcars, type = "density") # "none" (default)
tinyplot_add(type = type_density(joint.bw = "full"), lty = 2) # full dataset
tinyplot_add(type = type_density(joint.bw = "owm"), lty = 3) # obs-weighted mean
legend("topright", c("(A) None", "(B) Full", "(D) OWM"), lty = 1:3, title = "Joint BW") Created on 2025-01-08 with reprex v2.1.1 |
Density plots are coded differently than all other types.
I won't have time to convert them now, but opening this issue because it would eventually be a good idea to bring them in conformity.
The text was updated successfully, but these errors were encountered: