Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type_density() using the new type_data/type_draw system #243

Closed
vincentarelbundock opened this issue Nov 7, 2024 · 9 comments · Fixed by #284
Closed

type_density() using the new type_data/type_draw system #243

vincentarelbundock opened this issue Nov 7, 2024 · 9 comments · Fixed by #284

Comments

@vincentarelbundock
Copy link
Collaborator

Density plots are coded differently than all other types.

I won't have time to convert them now, but opening this issue because it would eventually be a good idea to bring them in conformity.

@grantmcdermott
Copy link
Owner

Nice. Was just thinking about this!

The existing density logic is quite complicated, because we make special allowances for tinyplot(density(x)). I'm not sure that this exception is worth it anymore, so it might be worth axing and enforcing tinyplot(x, type = "density") to maintain internal simplicity. Will investigate when I have time.

@vincentarelbundock
Copy link
Collaborator Author

Yes, I think that small amount of breakage is a good idea while the package is still young. We want to start with strong standards.

@grantmcdermott
Copy link
Owner

I've run up against this issue a few times lately, so will take a crack at it before submitting v0.3.0 to CRAN.

(If it turns out to be too complicated for solving in a reasonable time, I'll just submit the new release without it. But hopefully there's a simple solution.)

@grantmcdermott
Copy link
Owner

@zeileis a question for you (and picking up part of a discussion originally started in #271):

Q: How do you think the joint bandwidth should be calculated for grouped density plots?

To my mind there are four options:

  • (A) Don't do any joint bandwidth calculation (i.e., just use the individual bandwidths calculated independently for each sub group).
  • (B) Re-use the same bandwidth that we would obtain from the full dataset.
  • (C) Take the mean of the individual bandwidths.
  • (D) Take the weighted mean of the individual bandwidths (where the weights indicate relative number of observations in each sub group).

Here's a quick manual implementation of each of these four options, using an mtcars demo example.

pals = palette.colors(4, "R4")

# Plot area
plot(density(mtcars$wt), ylim = c(0, 1.1), type = "n")
grid()

# Option A
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x), col = pals[1]))

# Option B
bb = density(mtcars$wt)$bw
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = bb), col = pals[2]))

# Option C
cc = mean(sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw))
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = cc), col = pals[3], lty = 2))

# Option D
dd = weighted.mean(
  sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw),
  sapply(split(mtcars$wt, mtcars$am), length)
)
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = dd), col = pals[4], lty = 4))

# Legend
legend(
  "topleft",
  legend = c("A) individual", "B) full", "C) mean", "D) weighted.mean"),
  lty = c(1,1,4,2),
  col = pals
)

Right now, we are using (C), but I'm not sure that it is optimal.

FWIW, this is the equivalent ggplot2 result. It looks like they're just taking the simplest approach (A) and not using any kind of joint bandwidth.

library(ggplot2)
theme_set(theme_linedraw())

ggplot(mtcars, aes(wt, group = am)) +
  geom_density() +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1.1)) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linetype = "32"))

@zeileis
Copy link
Collaborator

zeileis commented Jan 8, 2025

Thanks for the write-up and apologies for the 12" version of Simon & Garfunkel playing over the last weeks. (You know...The Sound of Silence...) I took some time off over the holidays.

Coming back to this issue now I played around with your examples above as well as some artificial data I simulated from various normal distributions with different means and/or variances. Based on this my impression is that only A or D should be used.

B is problematic in settings where I have subgroups with similar variances but (very) different means. Computing the bandwidth from the joint sample will lead to inflated values, e.g.,

set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100, mean = 10)
list(x1 = x1, x2 = x2, joint = c(x1, x2)) |> sapply(bw.nrd0)
##        x1        x2     joint 
## 0.3170624 0.3080127 1.5674067 

This will dampen the density curve as you can also see in the red line for your mtcars example.

C is clearly inferior to D when the subgroups have very different sizes because small groups might change the bandwidth estimates too much.

For A vs. D I think that A is probably the safer default. In cases where the bandwidth from the different groups are roughly similar, strategy D will stabilize the bandwidth estimates. But in cases where the bandwidth is very different (e.g., because variances are very different), D might be a poor compromise and using separate bandwidths from A would be better.

So, in short, my current thinking is that A should be the default but it would be nice if D were easy to do as well.

@grantmcdermott
Copy link
Owner

Excellent, thanks for the deep dive @zeileis! I'll aim to incorporate those changes into #284 when I get a chance.

@vincentarelbundock
Copy link
Collaborator Author

@zeileis love the thoroughness. Very, very good stuff.

@zeileis
Copy link
Collaborator

zeileis commented Jan 9, 2025

Thinking some more about this: B may be a useful strategy when the groups overlap a lot, i.e., all have similar means and variances. So adding this as another option might also be nice to have sometimes...

@grantmcdermott
Copy link
Owner

For posterity, the dev version of tinyplot now defaults to (A), but makes options (B) and (D) available to users via the type_density(joint.bw = <arg>) argument.

library(tinyplot)
tinytheme("basic")

tinyplot(~wt | factor(am), data = mtcars, type = "density")   # "none" (default)
tinyplot_add(type = type_density(joint.bw = "full"), lty = 2) # full dataset
tinyplot_add(type = type_density(joint.bw = "owm"), lty = 3)  # obs-weighted mean

legend("topright", c("(A) None", "(B) Full", "(D) OWM"), lty = 1:3, title = "Joint BW")

Created on 2025-01-08 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants