`type_density()` using the new `type_data`/`type_draw` system #243

vincentarelbundock · 2024-11-07T22:07:09Z

Density plots are coded differently than all other types.

I won't have time to convert them now, but opening this issue because it would eventually be a good idea to bring them in conformity.

grantmcdermott · 2024-11-07T22:11:15Z

Nice. Was just thinking about this!

The existing density logic is quite complicated, because we make special allowances for tinyplot(density(x)). I'm not sure that this exception is worth it anymore, so it might be worth axing and enforcing tinyplot(x, type = "density") to maintain internal simplicity. Will investigate when I have time.

vincentarelbundock · 2024-11-08T01:03:27Z

Yes, I think that small amount of breakage is a good idea while the package is still young. We want to start with strong standards.

grantmcdermott · 2024-11-12T17:01:32Z

I've run up against this issue a few times lately, so will take a crack at it before submitting v0.3.0 to CRAN.

(If it turns out to be too complicated for solving in a reasonable time, I'll just submit the new release without it. But hopefully there's a simple solution.)

grantmcdermott · 2024-12-21T17:59:50Z

@zeileis a question for you (and picking up part of a discussion originally started in #271):

Q: How do you think the joint bandwidth should be calculated for grouped density plots?

To my mind there are four options:

(A) Don't do any joint bandwidth calculation (i.e., just use the individual bandwidths calculated independently for each sub group).
(B) Re-use the same bandwidth that we would obtain from the full dataset.
(C) Take the mean of the individual bandwidths.
(D) Take the weighted mean of the individual bandwidths (where the weights indicate relative number of observations in each sub group).

Here's a quick manual implementation of each of these four options, using an mtcars demo example.

pals = palette.colors(4, "R4")

# Plot area
plot(density(mtcars$wt), ylim = c(0, 1.1), type = "n")
grid()

# Option A
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x), col = pals[1]))

# Option B
bb = density(mtcars$wt)$bw
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = bb), col = pals[2]))

# Option C
cc = mean(sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw))
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = cc), col = pals[3], lty = 2))

# Option D
dd = weighted.mean(
  sapply(split(mtcars$wt, mtcars$am), \(x) density(x)$bw),
  sapply(split(mtcars$wt, mtcars$am), length)
)
lapply(split(mtcars$wt, mtcars$am), \(x) lines(density(x, bw = dd), col = pals[4], lty = 4))

# Legend
legend(
  "topleft",
  legend = c("A) individual", "B) full", "C) mean", "D) weighted.mean"),
  lty = c(1,1,4,2),
  col = pals
)

Right now, we are using (C), but I'm not sure that it is optimal.

FWIW, this is the equivalent ggplot2 result. It looks like they're just taking the simplest approach (A) and not using any kind of joint bandwidth.

library(ggplot2)
theme_set(theme_linedraw())

ggplot(mtcars, aes(wt, group = am)) +
  geom_density() +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1.1)) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(linetype = "32"))

zeileis · 2025-01-08T17:52:14Z

Thanks for the write-up and apologies for the 12" version of Simon & Garfunkel playing over the last weeks. (You know...The Sound of Silence...) I took some time off over the holidays.

Coming back to this issue now I played around with your examples above as well as some artificial data I simulated from various normal distributions with different means and/or variances. Based on this my impression is that only A or D should be used.

B is problematic in settings where I have subgroups with similar variances but (very) different means. Computing the bandwidth from the joint sample will lead to inflated values, e.g.,

set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100, mean = 10)
list(x1 = x1, x2 = x2, joint = c(x1, x2)) |> sapply(bw.nrd0)
##        x1        x2     joint 
## 0.3170624 0.3080127 1.5674067

This will dampen the density curve as you can also see in the red line for your mtcars example.

C is clearly inferior to D when the subgroups have very different sizes because small groups might change the bandwidth estimates too much.

For A vs. D I think that A is probably the safer default. In cases where the bandwidth from the different groups are roughly similar, strategy D will stabilize the bandwidth estimates. But in cases where the bandwidth is very different (e.g., because variances are very different), D might be a poor compromise and using separate bandwidths from A would be better.

So, in short, my current thinking is that A should be the default but it would be nice if D were easy to do as well.

grantmcdermott · 2025-01-08T19:48:00Z

Excellent, thanks for the deep dive @zeileis! I'll aim to incorporate those changes into #284 when I get a chance.

vincentarelbundock · 2025-01-08T20:39:47Z

@zeileis love the thoroughness. Very, very good stuff.

zeileis · 2025-01-09T00:47:04Z

Thinking some more about this: B may be a useful strategy when the groups overlap a lot, i.e., all have similar means and variances. So adding this as another option might also be nice to have sometimes...

grantmcdermott · 2025-01-09T05:42:08Z

For posterity, the dev version of tinyplot now defaults to (A), but makes options (B) and (D) available to users via the type_density(joint.bw = <arg>) argument.

library(tinyplot)
tinytheme("basic")

tinyplot(~wt | factor(am), data = mtcars, type = "density")   # "none" (default)
tinyplot_add(type = type_density(joint.bw = "full"), lty = 2) # full dataset
tinyplot_add(type = type_density(joint.bw = "owm"), lty = 3)  # obs-weighted mean

legend("topright", c("(A) None", "(B) Full", "(D) OWM"), lty = 1:3, title = "Joint BW")

^{Created on 2025-01-08 with reprex v2.1.1}

grantmcdermott mentioned this issue Nov 13, 2024

type_ridge() #252

Merged

grantmcdermott mentioned this issue Dec 20, 2024

CRAN 0.3.0 release #280

Open

grantmcdermott mentioned this issue Dec 27, 2024

Add type_density() function #284

Merged

grantmcdermott closed this as completed in #284 Jan 9, 2025

grantmcdermott mentioned this issue Jan 26, 2025

ylevels reordering in type_ridge() #270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`type_density()` using the new `type_data`/`type_draw` system #243

`type_density()` using the new `type_data`/`type_draw` system #243

vincentarelbundock commented Nov 7, 2024

grantmcdermott commented Nov 7, 2024

vincentarelbundock commented Nov 8, 2024

grantmcdermott commented Nov 12, 2024

grantmcdermott commented Dec 21, 2024

zeileis commented Jan 8, 2025

grantmcdermott commented Jan 8, 2025

vincentarelbundock commented Jan 8, 2025

zeileis commented Jan 9, 2025

grantmcdermott commented Jan 9, 2025

type_density() using the new type_data/type_draw system #243

type_density() using the new type_data/type_draw system #243

Comments

vincentarelbundock commented Nov 7, 2024

grantmcdermott commented Nov 7, 2024

vincentarelbundock commented Nov 8, 2024

grantmcdermott commented Nov 12, 2024

grantmcdermott commented Dec 21, 2024

zeileis commented Jan 8, 2025

grantmcdermott commented Jan 8, 2025

vincentarelbundock commented Jan 8, 2025

zeileis commented Jan 9, 2025

grantmcdermott commented Jan 9, 2025

`type_density()` using the new `type_data`/`type_draw` system #243

`type_density()` using the new `type_data`/`type_draw` system #243