Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Density scaling and bandwidth in type_ridge() #271

Open
zeileis opened this issue Nov 27, 2024 · 6 comments · May be fixed by #270
Open

Density scaling and bandwidth in type_ridge() #271

zeileis opened this issue Nov 27, 2024 · 6 comments · May be fixed by #270

Comments

@zeileis
Copy link
Collaborator

zeileis commented Nov 27, 2024

In get_density() within type_ridge() the density estimates are scaled to the same maximum which I believe is incorrect. Also the bandwidths are by default computed separately whereas ggridges employs a joint bandwidth estimate which typically seems to be better. But maybe I'm overlooking something here, Vincent @vincentarelbundock ?

For illustration let's consider a sample with two groups with very different variances:

set.seed(0)
d <- data.frame(
  y = rep(1, 100),
  x = c(rnorm(50), rnorm(50, mean = 3, sd = 0.2)),
  z = factor(rep(1:2, each = 50))
)

Using ggridges a joint bandwith of 0.218 is selected and the resulting densities have very different maxima - as you would expect if both should have the area 1.

ggplot(d, aes(x = x, y = z)) + geom_density_ridges() + theme_minimal()
ggplot(d, aes(x = x, y = y, group = z)) + geom_density_ridges() + theme_minimal()

tinyplot-ridge1

In contrast, because get_density() scales the density to have the same maximum, the second group has a much smaller area than the first in tinyplot(..., type = "ridge").

tinyplot(z ~ x, data = d, type = "ridge", grid = TRUE)
tinyplot(y ~ x | z, data = d, type = "ridge", grid = TRUE)

tinyplot-ridge2

However, if I drop the scaling in line https://github.com/grantmcdermott/tinyplot/blob/main/R/type_ridge.R#L99 and select the same bandwidth of 0.218, then I get virtually the same result as in ggridges.

tinyplot(z ~ x, data = d, type = type_ridge(bw = 0.218), grid = TRUE)
tinyplot(y ~ x | z, data = d, type = type_ridge(bw = 0.218), grid = TRUE)

tinyplot-ridge3

Note that type = "density" agrees here and chooses the same joint bandwidth without applying any scaling of the maximum:

tinyplot(~ x | z, data = d, type = "density", grid = TRUE)

tinyplot-ridge4

@vincentarelbundock
Copy link
Collaborator

Oh that's interesting. I chose this purely for visual reason, and had no principled reason. Sorry!

I'll defer to you on best defaults.

@zeileis
Copy link
Collaborator Author

zeileis commented Nov 28, 2024

OK, good, thanks for the quick feedback. Removing the scaling to the same maximum is straightforward.

Regarding the bandwidth: Both ggridges and type_density() seem to use the average of the individual bandwidths per group, see: https://github.com/wilkelab/ggridges/blob/master/R/stats.R#L109-L112 and https://github.com/grantmcdermott/tinyplot/blob/main/R/type_density.R#L83-L86

As the code is so similar: Is the type_density code inspired by ggridges or are they both inspired by something else? Why is this simply using the mean rather than the weighted mean (so that larger groups would receive more weight)?

Should we then always enforce that the same bandwith is used throughout? Or should we allow different bandwidth per ridgeline? If the latter, how should we specify this, with an additional argument or does anyone have a better idea?

@zeileis zeileis linked a pull request Nov 28, 2024 that will close this issue
@grantmcdermott
Copy link
Owner

As the code is so similar: Is the type_density code inspired by ggridges or are they both inspired by something else?

I wrote (what eventually became) the type_density code a long time ago, so I can't recall my exact thinking. But I think you be be correct in that I compared what I was doing with the ggridges code to make sure they were consistent. IIRC that was prompted by your suggestion to use a common bandwith across groups. See point 2 here.

Why is this simply using the mean rather than the weighted mean (so that larger groups would receive more weight)?

I don't have a good reason. I think I just did what was expedient. Should we switch to weighted.mean (both for type_density and type_ridges)?

@zeileis
Copy link
Collaborator Author

zeileis commented Dec 20, 2024

OK, thanks for the explanation. In any case, I think that type_density and type_ridge should use the same default bandwidth. And I lean towards weighted.mean but I haven't explored this more systematically to check whether it really works better in unbalanced data.

@grantmcdermott
Copy link
Owner

Note: with #284 we ultimately opted with individual bandwidths in (regular) grouped density plots... albeit with the possibility for users to override via the joint.bw argument.

Should we update the type_ridge code to do so as well?

@zeileis
Copy link
Collaborator Author

zeileis commented Jan 9, 2025

Yes, type_density() and type_ridge() should be consistent in their handling of the bandwidths, I think.

I'll try to have a closer look at the details in type_density() later tonight and then follow-up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants