You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When invoked in independent attribute mode, DataDescriber performs frequency-based estimation of the unconditioned probability distributions of the attributes. The distribution is captured by bar charts for categorical attributes, and by histograms for numerical attributes. Precision of the histograms can be refined by a parameter named histogram size, which represents the number of bins and is set to 20 by default.
After checking the code for DataDescriber, I confirmed that this statement is also true for running in correlated attribute mode. Where does the figure 20 come from? Was this an arbitrary choice, or was there some motivation for this specific value?
Might I suggest that you use a variation of the Freedman–Diaconis rule instead? Or Scott's, or Sturges'? (More info here.)I believe this will lead to significant improvements in performance.
To demonstrate this, here's plots of two attributes from my dataset: age of person, and date of measured age (expressed as an integer count of days since 1900-01-01), with histogram_size = 20:
The synthesised data doesn't look so good.
However, if I use a variation of the Freedman-Diaconis rule to choose the number of bins (8), this is what I get:
Which looks much better!
Code for calculating number of bins for a 1D histogram can be got from SciPy numpy.histogram
The text was updated successfully, but these errors were encountered:
DrAndiLowe
changed the title
Use Freedman–Diaconis rule to calculate histogram size for numeric attributes
Use Freedman–Diaconis, Scott's, or Sturges' rule to calculate histogram size for numeric attributes
Jun 13, 2018
Your paper, DataSynthesizer: Privacy-Preserving Synthetic Datasets, says in section 3.1.3:
After checking the code for DataDescriber, I confirmed that this statement is also true for running in correlated attribute mode. Where does the figure 20 come from? Was this an arbitrary choice, or was there some motivation for this specific value?
Might I suggest that you use a variation of the Freedman–Diaconis rule instead? Or Scott's, or Sturges'? (More info here.)I believe this will lead to significant improvements in performance.
To demonstrate this, here's plots of two attributes from my dataset: age of person, and date of measured age (expressed as an integer count of days since 1900-01-01), with histogram_size = 20:
The synthesised data doesn't look so good.
However, if I use a variation of the Freedman-Diaconis rule to choose the number of bins (8), this is what I get:
Which looks much better!
Code for calculating number of bins for a 1D histogram can be got from SciPy
numpy.histogram
The text was updated successfully, but these errors were encountered: