Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Add the AggJoiner and AggTarget transformers #600

Merged
merged 54 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
72b4cbc
first draft
Vincent-Maladiere Jun 13, 2023
5a3a7d6
apply Gael feedback skip-ci
Vincent-Maladiere Jun 15, 2023
1058d17
ci skip
Vincent-Maladiere Jun 15, 2023
6f7b5a4
[ci skip]
Vincent-Maladiere Jun 15, 2023
bd2a533
add tests
Vincent-Maladiere Jun 16, 2023
f3eb33d
install and run pre-commit
Vincent-Maladiere Jun 16, 2023
94a6820
update changelog
Vincent-Maladiere Jun 16, 2023
3d25d20
add JoinAggregator to api.rst
Vincent-Maladiere Jun 16, 2023
fbff72c
add movielens dataset fetchers
Vincent-Maladiere Jun 17, 2023
1c79ff5
add string 'X' option for tables
Vincent-Maladiere Jun 17, 2023
05aa9fe
first iteration of the exemple 07
Vincent-Maladiere Jun 17, 2023
c8716b6
re-enable polars boolean flag
Vincent-Maladiere Jul 17, 2023
bc30b59
fix suffixes test
Vincent-Maladiere Jul 17, 2023
1fd9ad3
add an example to the docstring
Vincent-Maladiere Jul 17, 2023
0c542a4
add pandas (full) and polars (partial, WIP) support for agg joiner
Vincent-Maladiere Jul 21, 2023
b6fc4b7
add movielens example
Vincent-Maladiere Jul 21, 2023
c302e91
update tests
Vincent-Maladiere Jul 27, 2023
14b57fd
update CHANGES.rst
Vincent-Maladiere Jul 27, 2023
a68d786
Merge branch 'main' into add_join_agg
Vincent-Maladiere Jul 27, 2023
e9bf803
fix docstring format
Vincent-Maladiere Jul 27, 2023
cdecbb4
fix SyntaxError bad escape
Vincent-Maladiere Jul 27, 2023
d8f8928
make polars optional for testing
Vincent-Maladiere Jul 27, 2023
d5ea436
add tests for movielens fetching
Vincent-Maladiere Jul 27, 2023
bf87eb5
update movielens test
Vincent-Maladiere Jul 27, 2023
7ecf65d
try to add as_posix to debug fetching test
Vincent-Maladiere Jul 27, 2023
9430f04
fix tests for polars
Vincent-Maladiere Jul 27, 2023
71e43d2
add pyarrow for testing, used with polars
Vincent-Maladiere Jul 27, 2023
0776bb1
fix pyarrow not installed
Vincent-Maladiere Jul 27, 2023
60a5a98
fix polars and pyarrow dependency, hide tests that require polars
Vincent-Maladiere Jul 27, 2023
089f023
Update skrub/_agg_joiner.py
Vincent-Maladiere Jul 30, 2023
a0b2bf8
Update examples/07_join_aggregation.py
Vincent-Maladiere Jul 30, 2023
a5b4e3f
Apply suggestions from code review
Vincent-Maladiere Aug 23, 2023
2b58374
Apply suggestions from code review
Vincent-Maladiere Aug 23, 2023
a6270c3
apply reviews
Vincent-Maladiere Aug 28, 2023
ffc5723
Merge branch 'main' into add_join_agg
Vincent-Maladiere Aug 28, 2023
705f45f
update CI to add polars
Vincent-Maladiere Aug 28, 2023
15e245f
fix typo in install.sh
Vincent-Maladiere Aug 28, 2023
0304a14
add aggregate and join functions for polars and pandas
Vincent-Maladiere Sep 5, 2023
9f5d10e
fix pandas and polars xref
Vincent-Maladiere Sep 5, 2023
db7f239
Merge branch 'add_polars_pandas_utils' into add_join_agg
Vincent-Maladiere Sep 5, 2023
2315ace
Update CHANGES.rst
Vincent-Maladiere Sep 6, 2023
0cffc69
apply suggestions from review
Vincent-Maladiere Sep 6, 2023
3df128b
single item oriented api
Vincent-Maladiere Sep 7, 2023
272bb95
Merge branch 'add_polars_pandas_utils' into add_join_agg
Vincent-Maladiere Sep 7, 2023
00ff1b7
improve first-lines docstring and see also
Vincent-Maladiere Sep 7, 2023
f1c5070
fix doc due to seaborn versioning misalignment
Vincent-Maladiere Sep 7, 2023
3a8f0ce
revert seaborn barplot rendering
Vincent-Maladiere Sep 7, 2023
4945b56
Merge branch 'main' into add_join_agg
Vincent-Maladiere Sep 25, 2023
ebe81f7
reuse the operations parsing logic from _pandas in _agg_joiner
Vincent-Maladiere Sep 25, 2023
7fd73b2
apply discussion #751
Vincent-Maladiere Sep 25, 2023
a12f60b
update the example
Vincent-Maladiere Sep 25, 2023
6343f88
enhance doc
Vincent-Maladiere Oct 2, 2023
95b8e4a
Merge branch 'main' into add_join_agg
Vincent-Maladiere Oct 4, 2023
48c6734
update the example
Vincent-Maladiere Oct 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ jobs:
- dependencies-version: "dev"
- dependencies-version: "dev, pyarrow"
python-version: "3.11"
- dependencies-version: "dev, polars"
python-version: "3.11"
- dependencies-version: "dev, min-py310"
python-version: "3.10"
dependencies-version-type: "minimal"
Expand Down
15 changes: 14 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ development and backward compatibility is not ensured.
Major changes
-------------

* :func:`dataframe.pd_join`, :func:`dataframe.pd_aggregate`,
:func:`dataframe.pl_join` and :func:`dataframe.pl_aggregate`
are now available in the dataframe submodule.
:pr:`733` by :user:`Vincent Maladiere <Vincent-Maladiere>`

* :class:`FeatureAugmenter` is renamed to :class:`Joiner`.
:pr:`674` by :user:`Jovan Stojanovic <jovan-stojanovic>`

Expand All @@ -32,6 +37,14 @@ Major changes
* Parallelized the :class:`GapEncoder` column-wise. Parameters `n_jobs` and `verbose`
added to the signature. :pr:`582` by :user:`Lilian Boulard <LilianBoulard>`

* New experimental feature :class:`AggJoiner`, a transformer performing
Vincent-Maladiere marked this conversation as resolved.
Show resolved Hide resolved
aggregation on auxiliary tables followed by left-joining on a base table.
:pr:`600` by :user:`Vincent Maladiere <Vincent-Maladiere>`.

* New experimental feature :class:`AggTarget`, a transformer performing
aggregation on the target y, followed by left-joining on a base table.
:pr:`600` by :user:`Vincent Maladiere <Vincent-Maladiere>`.

* Parallelized the :func:`deduplicate` function. Parameter `n_jobs`
added to the signature. :pr:`618` by :user:`Jovan Stojanovic <jovan-stojanovic>`
and :user:`Lilian Boulard <LilianBoulard>`
Expand Down Expand Up @@ -187,7 +200,7 @@ Major changes
:pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`

* New experimental feature: :class:`FeatureAugmenter`, a transformer
that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxilliary tables.
that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxiliary tables.
:pr:`409` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Unnecessary API has been made private: everything (files, functions, classes)
Expand Down
2 changes: 1 addition & 1 deletion build_tools/github/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ fi

pip install --progress-bar off --only-binary :all: --no-binary liac-arff --upgrade ".[$DEPS_VERSION]"

if [[ "$DEPS_VERSION" != *"pyarrow"* ]]; then
if [[ "$DEPS_VERSION" != *"pyarrow"* && "$DEPS_VERSION" != *"polars"* ]]; then
# Since pyarrow is a dependency of pandas, we need to uninstall it explicitly
pip uninstall -y pyarrow
fi
44 changes: 44 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ This page lists all available functions and classes of `skrub`.
:nosignatures:

Joiner
AggJoiner
AggTarget


.. raw:: html

Expand Down Expand Up @@ -84,6 +87,46 @@ This page lists all available functions and classes of `skrub`.

deduplicate

.. raw:: html

<h2>Dataframes operations</h2>

.. autosummary::
:toctree: generated/
:template: function.rst
:nosignatures:
:caption: DataFrames operations

dataframe.get_df_namespace

.. raw:: html

<h3>Pandas</h3>

.. autosummary::
:toctree: generated/
:template: function.rst
:nosignatures:
:caption: Pandas operations

dataframe.is_pandas
dataframe.pd_aggregate
dataframe.pd_join

.. raw:: html

<h3>Polars</h3>

.. autosummary::
:toctree: generated/
:template: function.rst
:nosignatures:
:caption: Polars operations

dataframe.is_polars
dataframe.pl_aggregate
dataframe.pl_join

.. raw:: html

<h2>Data download and generation</h2>
Expand All @@ -102,6 +145,7 @@ This page lists all available functions and classes of `skrub`.
datasets.fetch_traffic_violations
datasets.fetch_drug_directory
datasets.fetch_world_bank_indicator
datasets.fetch_movielens
datasets.fetch_ken_table_aliases
datasets.fetch_ken_types
datasets.fetch_ken_embeddings
Expand Down
Loading