Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

DrAndiLowe · 2018-06-11T14:56:55Z

Your paper, DataSynthesizer: Privacy-Preserving Synthetic Datasets, says in section 3.1.3:

When invoked in independent attribute mode, DataDescriber performs frequency-based estimation of the unconditioned probability distributions of the attributes. The distribution is captured by bar charts for categorical attributes, and by histograms for numerical attributes. Precision of the histograms can be refined by a parameter named histogram size, which represents the number of bins and is set to 20 by default.

After checking the code for DataDescriber, I confirmed that this statement is also true for running in correlated attribute mode. Where does the figure 20 come from? Was this an arbitrary choice, or was there some motivation for this specific value?

Might I suggest that you use a variation of the Freedman–Diaconis rule instead? Or Scott's, or Sturges'? (More info here.)I believe this will lead to significant improvements in performance.

To demonstrate this, here's plots of two attributes from my dataset: age of person, and date of measured age (expressed as an integer count of days since 1900-01-01), with histogram_size = 20:

The synthesised data doesn't look so good.

However, if I use a variation of the Freedman-Diaconis rule to choose the number of bins (8), this is what I get:

Which looks much better!

Code for calculating number of bins for a 1D histogram can be got from SciPy
numpy.histogram

DrAndiLowe mentioned this issue Jun 12, 2018

Dates before 1970-01-01 cause crash #10

Closed

DrAndiLowe changed the title ~~Use Freedman–Diaconis rule to calculate histogram size for numeric attributes~~ Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes Jun 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

DrAndiLowe commented Jun 11, 2018 •

edited

Loading

Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes #11

Comments

DrAndiLowe commented Jun 11, 2018 • edited Loading

DrAndiLowe commented Jun 11, 2018 •

edited

Loading