Welcome to advanced-value-counts

advanced-value-counts is a Python-package containing the AdvancedValueCounts class that makes use of pandas' .value_counts(), .groupby() and seaborn to easily get a lot of info about the counts of a (categorical) column in a pandas DataFrame. The potential of this package is at its peak when wanting info of counts of a column after a grouping: df.groupby(groupby_col)[column].value_counts(). See Usage on how to use AdvancedValueCounts. Read this medium article or consult this notebook for an explanation on the added value of this package.

Table of contents:

Installation for users using PyPi
Installation for users without PyPi
Usage
Installation for contributors

Installation for users using PyPi

pip install advanced-value-counts

If errors surface please upgrade pip and setuptools

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade setuptools

Installation for users without PyPi

git clone https://github.com/sTomerG/advanced-value-counts.git
cd advanced-value-counts
pip install -e .
# optional but potentially crucial
pip install -r requirements/requirements.txt

To test whether the installation was succesfull run in the advanced-value-counts directory (DeprecationWarnings are expected)

pytest

Usage

The example below uses a modified version of the Titanic dataset from Kaggle, which can be found in this GitRepo here.

The code of this notebook can be found here.

from advanced_value_counts.avc import AdvancedValueCounts
import pandas as pd

# read in the data file
df = pd.read_csv('../tests/data/titanic.csv', usecols=['CabinArea','Title'])
df.head()

	CabinArea	Title
0	NaN	Mr.
1	C	Mrs.
2	NaN	Miss.
3	C	Mrs.
4	NaN	Mr.

# create an instance of AdvancedValueCounts
avc = AdvancedValueCounts(df=df, column='Title')

# print the AdvancedValueCounts DataFrame
avc.avc_df

	ratio	count
Title
Mr.	0.580247	517
Miss.	0.204265	182
Mrs.	0.140292	125
Master.	0.044893	40
_na	0.007856	7
Rev.	0.006734	6
Major.	0.002245	2
Col.	0.002245	2
Mlle.	0.002245	2
Countess.	0.001122	1
Capt.	0.001122	1
Ms.	0.001122	1
Sir.	0.001122	1
Lady.	0.001122	1
Mme.	0.001122	1
Don.	0.001122	1
Jonkheer.	0.001122	1

Set min_group_count to 5 to group small groups into '_other' group

avc.min_group_count = 5
avc.avc_df

	ratio	count
Title
Mr.	0.580247	517
Miss.	0.204265	182
Mrs.	0.140292	125
Master.	0.044893	40
_other	0.015713	14
_na	0.007856	7
Rev.	0.006734	6

Parameters of the AdvancedValueCounts class to adjust for small groups for a single column:

dropna: bool = False
min_group_count: int = 1 # does not effect NA or the '_other' group
min_group_ratio: float = 0 # does not effect NA or the '_other' group

It is also possible to use column in combination with parameter groupy_col: str = None to mimick the behaviour of df.groupby(groupby_col)[column].value_counts()

avc_grouped = AdvancedValueCounts(df=df, column='Title', groupby_col='CabinArea')
avc_grouped.avc_df

		count	subgroup_ratio	subgr_r_diff_subgr_all	r_vs_total
CabinArea	Title
A	Col.	1	0.066667	0.064422	0.001122
	Lady.	1	0.066667	0.065544	0.001122
	Master.	1	0.066667	0.021773	0.001122
	Mr.	11	0.733333	0.153086	0.012346
	Sir.	1	0.066667	0.065544	0.001122
...	...	...	...	...	...
_na	Mrs.	81	0.117904	-0.022388	0.090909
	Ms.	1	0.001456	0.000333	0.001122
	Rev.	6	0.008734	0.002000	0.006734
	_na	4	0.005822	-0.002034	0.004489
	_total	687	1.000000	0.000000	0.771044

74 rows × 4 columns

To get a better overview of the data, set attributes to adjust group size and round the ratios

avc_grouped.min_group_ratio = 0.05
avc_grouped.min_subgroup_count = 5
avc_grouped.round_ratio = 3
avc_grouped.avc_df

		count	subgroup_ratio	subgr_r_diff_subgr_all	r_vs_total
CabinArea	Title
B	Miss.	14	0.298	0.094	0.016
	Mr.	16	0.340	-0.240	0.018
	Mrs.	10	0.213	0.073	0.011
	_na	1	0.021	0.013	0.001
	_other	6	0.127	0.111	0.007
	_total	47	1.000	0.000	0.053
C	Miss.	12	0.203	-0.001	0.013
	Mr.	29	0.492	-0.088	0.033
	Mrs.	14	0.237	0.097	0.016
	_na	1	0.017	0.009	0.001
	_other	3	0.051	0.035	0.003
	_total	59	1.000	0.000	0.066
_all	Master.	40	0.045	NaN	0.045
	Miss.	182	0.204	NaN	0.204
	Mr.	517	0.580	NaN	0.580
	Mrs.	125	0.140	NaN	0.140
	Rev.	6	0.007	NaN	0.007
	_na	7	0.008	NaN	0.008
	_other	14	0.016	NaN	0.016
	_total	891	1.000	NaN	1.000
_na	Master.	33	0.048	0.003	0.037
	Miss.	135	0.197	-0.007	0.152
	Mr.	424	0.617	0.037	0.476
	Mrs.	81	0.118	-0.022	0.091
	Rev.	6	0.009	0.002	0.007
	_na	4	0.006	-0.002	0.004
	_other	4	0.006	-0.010	0.004
	_total	687	1.000	0.000	0.771
_other	Master.	5	0.051	0.006	0.006
	Miss.	21	0.214	0.010	0.024
	Mr.	48	0.490	-0.090	0.054
	Mrs.	20	0.204	0.064	0.022
	_na	1	0.010	0.002	0.001
	_other	3	0.031	0.015	0.003
	_total	98	1.000	0.000	0.110

Parameters of the AdvancedValueCounts class to adjust for groupsize in a grouped-by AdvancedValueCounts

# for groupby_col:
dropna: bool = False
max_groups: int = None # does not effect NA or the '_other' group
min_group_count: int = 1 # does not effect NA or the '_other' group
min_group_ratio: float = 0 # does not effect NA or the '_other' group

# for column:
dropna: bool = False
max_subgroups: int = None # does not effect NA or the '_other' group
min_subgroup_count: int = 1 # does not effect NA or the '_other' group
min_subgroup_ratio: float = 0 # does not effect NA or the '_other' group
min_subgroup_ratio_vs_total: float = 0 # does not effect NA or the '_other' group

To get a plot of the AdvancedValueCounts.avc_df:

avc_grouped.get_plot(normalize=True) # normalize = True is default value

To get a DataFrame without the summary_statistics such as '_all' and '_total':

avc_grouped.unsummerized_df

		count	subgroup_ratio	subgr_r_diff_subgr_all	r_vs_total
CabinArea	Title
B	Miss.	14	0.298	0.094	0.016
	Mr.	16	0.340	-0.240	0.018
	Mrs.	10	0.213	0.073	0.011
	_na	1	0.021	0.013	0.001
	_other	6	0.127	0.111	0.007
C	Miss.	12	0.203	-0.001	0.013
	Mr.	29	0.492	-0.088	0.033
	Mrs.	14	0.237	0.097	0.016
	_na	1	0.017	0.009	0.001
	_other	3	0.051	0.035	0.003
_na	Master.	33	0.048	0.003	0.037
	Miss.	135	0.197	-0.007	0.152
	Mr.	424	0.617	0.037	0.476
	Mrs.	81	0.118	-0.022	0.091
	Rev.	6	0.009	0.002	0.007
	_na	4	0.006	-0.002	0.004
	_other	4	0.006	-0.010	0.004
_other	Master.	5	0.051	0.006	0.006
	Miss.	21	0.214	0.010	0.024
	Mr.	48	0.490	-0.090	0.054
	Mrs.	20	0.204	0.064	0.022
	_na	1	0.010	0.002	0.001
	_other	3	0.031	0.015	0.003

Installation for contributors

git clone https://github.com/sTomerG/advanced-value-counts.git
cd advanced-value-counts
python3 -m venv .venv

Activate the virtual environment

Windows:

.\.venv\Scripts\activate

Linux / MacOS:

source .venv/bin/activate

Install requirements

python -m pip install --upgrade pip
pip install -r requirements/requirements.txt

Test if everything works properly

(DeprecationWarnings are expected)

With tox

tox

Without tox

pip install -e .
pytest

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
notebooks		notebooks
requirements		requirements
src/advanced_value_counts		src/advanced_value_counts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_pypi.md		README_pypi.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to advanced-value-counts

Installation for users using PyPi

Installation for users without PyPi

Usage

Installation for contributors

About

Releases

Packages

Languages

License

sTomerG/advanced-value-counts

Folders and files

Latest commit

History

Repository files navigation

Welcome to advanced-value-counts

Installation for users using PyPi

Installation for users without PyPi

Usage

Installation for contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages