Skip to content

Latest commit

 

History

History
125 lines (97 loc) · 4.45 KB

README.rst

File metadata and controls

125 lines (97 loc) · 4.45 KB

pysum: summarize pandas dataframes

Documentation Status https://static.pepy.tech/badge/pysum

pysum takes a pandas dataframe (and a few others arguments to customize the output) and creates a markdown, html, or xlsx report with summary of each of variables in the dataframe.

The program iterates through each of the columns in the dataframe and based on the datatype, creates summary statistics for each, and prints them out to a table.

Inputs

The function takes the following arguments:

  1. dataframe: pandas dataframe. No Default. The passed dataframe must also have an attribute name that carries the name of the dataframe. See examples for clarification.
  2. round_digits: Integer. Digits to which the numbers reported should be rounded. Default is 2.
  3. var_numbers: Boolean. Whether or not to add a column indicating the column number. Default is true.
  4. missing_col: Boolean. Adds a column that reports proportion missing. Default in true.
  5. max_distinct_values: Numeric. The maximum number of values to display frequencies for. If variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.
  6. max_string_width: Integer. Limits the number of characters to display in the frequency tables. Default is 25.
  7. output_type: String. The file format of the output file. xlsx, html, markdown. Default is html.
  8. output_file: String. The path and filename to which the script should output the results. Default is summary.html in the local directory
  9. append: Boolean. If there is an existing file, should we append the results or should we overwrite the file. Default is true. When append is true, the results are appended. When it is false, the file is overwritten.

The html output also depends on custom.css in the local folder.

Output

The output is a xlsx, html, or markdown file. For numeric columns, it reports mean, standard deviation, minimum, maximum, median, IQR, Number of distinct values, Percentage that are valid, and Percentage missing, by default.

Definitions of Things in Output

  1. Valid = entries with non-missing values
  2. mean (sd) = mean (standard deviation).
  3. min = minimum
  4. med = median
  5. max = maximum
  6. IQR = Interquartile range
  7. CV = Coefficient of variation

For character vectors, it reports as many as max_distinct_values, reports the number of other values, and their percentage. It also reports percentage of observations that are valid and that are missing by default.

Limitations: Dates by default are parsed as characters. Dates are best handled as numeric. But given the variety of formats in which dates appear, no standard support is offered for now.

Running the Script

Install the requirements:

pip install -r requirements.txt

You also need pandoc to be installed on your machine.

Examples

Iris data:

import pandas
import pysum

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

# Pass name of the dataset; required
dataset.name = 'iris'

pysum.summarizeDF(dataset)
pysum.summarizeDF(dataset, output_type = "xlsx", append = False)
pysum.summarizeDF(dataset, output_type = "markdown", append = False)

Markdown Output, HTML Output and XLSX Output

Attribution

The package is based on https://github.com/dcomtois/summarytools