Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

Open
DrAndiLowe opened this issue Jun 11, 2018 · 0 comments

Comments

@DrAndiLowe
Copy link

DrAndiLowe commented Jun 11, 2018

Your paper, DataSynthesizer: Privacy-Preserving Synthetic Datasets, says in section 3.1.3:

When invoked in independent attribute mode, DataDescriber performs frequency-based estimation of the unconditioned probability distributions of the attributes. The distribution is captured by bar charts for categorical attributes, and by histograms for numerical attributes. Precision of the histograms can be refined by a parameter named histogram size, which represents the number of bins and is set to 20 by default.

After checking the code for DataDescriber, I confirmed that this statement is also true for running in correlated attribute mode. Where does the figure 20 come from? Was this an arbitrary choice, or was there some motivation for this specific value?

Might I suggest that you use a variation of the Freedman–Diaconis rule instead? Or Scott's, or Sturges'? (More info here.)I believe this will lead to significant improvements in performance.

To demonstrate this, here's plots of two attributes from my dataset: age of person, and date of measured age (expressed as an integer count of days since 1900-01-01), with histogram_size = 20:

defaultagedate_compare_histograms
defaultprofileage_compare_histograms

The synthesised data doesn't look so good.

However, if I use a variation of the Freedman-Diaconis rule to choose the number of bins (8), this is what I get:

defaultagedate_compare_histograms
defaultprofileage_compare_histograms

Which looks much better!

Code for calculating number of bins for a 1D histogram can be got from SciPy
numpy.histogram

@DrAndiLowe DrAndiLowe changed the title Use Freedman–Diaconis rule to calculate histogram size for numeric attributes Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant