Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Add the AggJoiner and AggTarget transformers #600

Merged
merged 54 commits into from
Oct 10, 2023

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Jun 13, 2023

What does this PR introduce?

This draft proposes a first POC for the Join Aggregator. It aims at aggregating auxiliary tables before merging them on the base table, in a 1:N fashion.

Its API follows a consistent logic to the one of Feature Augmenter:

agg_joiner = AggJoiner(
    tables=[
        (aux_1_large, "Country Name", ["GDP per capita (current US$)"]),
        (aux_2_large, "Country Name", ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode", "value_counts", "hist(4)"],
)
agg_joiner.fit_transform(df)

It currently supports Pandas DataFrames, Polars DataFrames, and Polars LazyFrames. Therefore can be run lazily with Polars! My idea is to preserve the dataframe module of the input to avoid confusion, and refuse dataframes that mix backends. The Polars dependency is of course optional.

My next step will be to benchmark the RAM consumption and time to run for these 3 different dataframes on bigger datasets.

Edit: Here is a demo of the Join Aggregator applied to feature engineering for RecSys within a Kaggle Competition!


How is it implemented?

The API of Join Aggregator is straightforward and relies on specialized implementations of an abstract AssemblingEngine class to handle both Pandas and Polars dataframes, namely PandasAssemblingEngine and PolarsAssemblingEngine.

In the absence of fully fledged tests, here is a working example::

import numpy as np
import pandas as pd
from skrub.datasets import fetch_world_bank_indicator
from skrub._join_aggregator import JoinAggregator


main = pd.read_csv(
    "https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv",
    thousands=",",
)
main = main[["Country", "Happiness score"]]

aux_1 = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X
aux_2 = fetch_world_bank_indicator("SP.DYN.LE00.IN").X

# Duplicate rows to create 1:N conditions
def augmente(df, id_col, val_cols, n_repeat=5):
    dfs = []
    for val_col in val_cols:
        for id, el in df[[id_col, val_col]].values:
            repeated = np.random.normal(el, scale=el/100, size=n_repeat)
            df_ = pd.DataFrame({id_col: id, val_col: repeated})
            dfs.append(df_)
    return pd.concat(dfs)

aux_1_large = augmente(aux_1, "Country Name", ["GDP per capita (current US$)"])
aux_2_large = augmente(aux_2, "Country Name", ["Life expectancy at birth, total (years)"])

# Add a categorical column, arbitrarily
aux_2_large["country"] = aux_2_large["Country Name"]


# Pandas
join_agg = JoinAggregator(
    tables=[
        (aux_1_large, ["Country Name"], ["GDP per capita (current US$)"]),
        (aux_2_large, ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
pandas_out = join_agg.fit_transform(main)

# Polars, eager
import polars as pl

join_agg = JoinAggregator(
    tables=[
        (pl.DataFrame(aux_1_large), ["Country Name"], ["GDP per capita (current US$)"]),
        (pl.DataFrame(aux_2_large), ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
polars_eager_out = join_agg.fit_transform(pl.DataFrame(main))

# Polars, lazy
join_agg = JoinAggregator(
    tables=[
        (pl.DataFrame(aux_1_large).lazy(), ["Country Name"], ["GDP per capita (current US$)"]),
        (pl.DataFrame(aux_2_large).lazy(), ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
polars_lazy_out = join_agg.fit_transform(pl.DataFrame(main).lazy())

WDYT?

cc @strayMat and his super helpful aggregation implementations here and here :)

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This is exciting!!

I must confess that I am having a hard time reviewing this PR, as the github diff is ridden with warnings from codecov complaining of missing test coverage.

The user-level API feels right to me, although without using it, I don't have a perfect feeling of the user experience.

I think that we should try to find a consistent naming across the FeatureAugmenter and the JoinAggregator. This may call for renaming the FeatureAugmenter

We need to write tests, as these will help us feel if we have the right API for the internal components: if things are easy to test, it's a good sign.

It would be nice to have an exemple where we don't duplicate row to create the 1 to many relation. Maybe @jovan-stojanovic can help here. Ideally, this example should also show that the aggregation is beneficial for prediction, and we should add it to the PR so that we can comment on it.

Points for later:

Actual support of polars will require us to support it in every single function and class of skrub (else the user will be confused). This will be a bit of work, in particular it will require define impedence matching / adaptation to many functionality of pandas / polars.

I wonder if the pre-aggregation strategy is the good one. If my external table has many more entries on the common key than the main table, preaggregation will lead me to compute many aggregates that I don't need. We should note this and potentially address it later.

operators of the respective module.
"""
@classmethod
def get_for(cls, tables):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a function (a simple function, not a method or classmethod) as a factory to instantiate the engine. It leads to simpler code.

return num_ops, categ_ops


class AssemblingEngine:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be called a "Dispatcher", as IMHO this construct is related to dispatching and the Dispatcher pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what would this Dispatcher do if, according to your suggestion above, we replace get_for with a function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "get_for" would be a function "make_agg_dispatcher" that would return a dispatcher. Not very different from what you currently have, but as a function rather than a classmethod



def pandas_get_agg_ops(cols, agg_ops):
pandas_ops_mapping = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you should make those dictionaries globals defined outside the function (and thus called with allcaps names "PANDAS_CAT_OPS_MAPPING". In the long term, having them accessible to other functions can be useful (let alone for testing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is done, I believe that the present function can be inlined into were it is called, as it is very short (many very short function make code harder to read, as they require memorizing a lot of indirections).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay consistent, we'd like to also inline the {pandas, polars}_get_agg_ops and {pandas, polars}_split_num_categ_cols functions.

However, some of these functions are long and might clutter the logic of the calling method.

Maybe we should define these functions as staticmethod? So that things are close to each other.

WDYT?

skrub/_join_aggregator.py Outdated Show resolved Hide resolved
skrub/_join_aggregator.py Outdated Show resolved Hide resolved
skrub/_join_aggregator.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 15, 2023 via email

@Vincent-Maladiere
Copy link
Member Author

Thanks for this detailed feedback!

I must confess that I am having a hard time reviewing this PR, as the github diff is ridden with warnings from codecov complaining of missing test coverage.

Whoops sorry I didn't notice. Adding skip-ci in my commit messages should help here.

I think that we should try to find a consistent naming across the FeatureAugmenter and the JoinAggregator. This may call for renaming the FeatureAugmenter

Yes, that sounds good. Something like "FuzzyJoiner"?

We need to write tests, as these will help us feel if we have the right API for the internal components: if things are easy to test, it's a good sign.

Yes, that's next on my todo :)

It would be nice to have an example where we don't duplicate rows to create the 1-to-many relation.

I just made a demo on Kaggle, the base table is huge (30M rows) and needs Polars lazy mode :)

When using models like LambdaMart or BoostingTrees for Recommender Systems, you always end up aggregating the base table | user_id | product_id | timestamp | by user_id and also by product_id before joining these two aggregated tables back to the main one: this is an ideal and very useful application for the Join Aggregator!

Please have a look at it:

https://www.kaggle.com/code/vincentmaladiere/h-m-recsys-feature-engineering-with-skrub-polars

Of course, we also need a proper example in the documentation, as you mentioned. Why not use a lighter version of MovieLens and perform recommendations in our documentation? :)

Points for later:

Actual support of polars will require us to support it in every single function and class of skrub (else the user will be confused). This will be a bit of work, in particular it will require define impedence matching / adaptation to many functionality of pandas / polars.

Yes, how do you rank it as a priority? There is so much value in actually offering this, I'd be thrilled to start thinking about it very soon.

I wonder if the pre-aggregation strategy is the good one. If my external table has many more entries on the common key than the main table, preaggregation will lead me to compute many aggregates that I don't need. We should note this and potentially address it later.

Very good point. We could filter our tables before aggregation on the right keys during fit!

@Vincent-Maladiere
Copy link
Member Author

No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.

What are you suggesting then?

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Jun 15, 2023

I've just noticed that our TableVectorizer already handles Polars input héhé (not lazily though)

import polars as pl
from skrub import TableVectorizer

main = pd.read_csv(
    "https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv",
    thousands=",",
)

tv = TableVectorizer()
tv.fit_transform(
    pl.DataFrame(main)
)

This is thanks to a fortunate cast to pandas during the fit_transform

        # Convert to pandas DataFrame if not already.
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        else:
            # Create a copy to avoid altering the original data.
            X = X.copy()

A pleasant surprise 😄

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 15, 2023 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 15, 2023 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 15, 2023 via email

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great new feature @Vincent-Maladiere! I have a few comments on the example.

examples/07_join_aggregation.py Outdated Show resolved Hide resolved
examples/07_join_aggregation.py Outdated Show resolved Hide resolved
examples/07_join_aggregation.py Outdated Show resolved Hide resolved
examples/07_join_aggregation.py Outdated Show resolved Hide resolved
examples/07_join_aggregation.py Outdated Show resolved Hide resolved
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a first batch of small comments

CHANGES.rst Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
@jeromedockes
Copy link
Member

Maybe we should define these functions as staticmethod? So that things are close to each other.
No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.

just for the sake of argument, having a class, say skrub.DataFrame would allow to:

  • move around the methods together with the dataframe, without having to call get_df_namespace every time we need a specialized function
  • describe explicitly the interface that _polars and _pandas must implement

@Vincent-Maladiere
Copy link
Member Author

You're right. But it feels like we're going to reimplement the dataframe API in some way, WDYT?

@jeromedockes
Copy link
Member

You're right. But it feels like we're going to reimplement the dataframe API in some way, WDYT?

I agree but isn't this already what we're doing, just with the dataframe submodules instead of DataFrame subclasses?

it is definitely true it feels like we're doing something too similar to ibis's or the dataframe API's objectives; what I understood from IRL discussions is we want to do it for a much more restricted/specialized set of operations until the Dataframe API covers all we need in skrub but I may have misunderstood

@Vincent-Maladiere
Copy link
Member Author

I agree but isn't this already what we're doing, just with the dataframe submodules instead of DataFrame subclasses?

I understand your point, we're indeed already simulating our tiny dataframe API.

it is definitely true it feels like we're doing something too similar to ibis's or the dataframe API's objectives; what I understood from IRL discussions is we want to do it for a much more restricted/specialized set of operations until the Dataframe API covers all we need in skrub but I may have misunderstood

Yes, you're right, this is where we're heading. So I agree with what you say on the skrub.DataFrame, providing it doesn't add too much complexity and cost on our side.

@Vincent-Maladiere
Copy link
Member Author

@GaelVaroquaux, does this new version match your requirements?

@Vincent-Maladiere
Copy link
Member Author

I guess we can merge this now since we've converged on the design. #734 lists the next TODOs.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very cool.

I left one small inline comment (I hope that it won't be too much to address), and I have a couple of comments on the docs / website:

Front page:

Replace on the front page:

"Joiner, a transformer for joining multiple tables together."

by

"Joiner, AggJoiner", transformers for joining multiple tables together."

Assembling docs

Rework a tiny bit the assembling narrative docs to list the AggJoiner and
AggTarget.

I think that the way that I would do this is by add to the section
"Joinning external tables for machine learning. Where the Joiner is
mentioned, I would do a list with Joiner, AggJoiner and AggTarget, giving
quickly the differences between these.

aux_key="userId",
cols=timestamp_cols,
main_key="userId",
suffix="_user",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is specifying the suffix necessary for the example to work (not only here, but all over the example)?

If yes, can we think of a default for the suffix that makes it work here (and is not too strange / magic)

If no, I think that we should remove it from the example, to make the example a bit lighter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relatively complete example. In most simple settings, users won't need to set a suffix.

When a suffix is set, the logic is the following:

  1. This suffix is added to each column of the aux table after aggregating it.
  2. Then, the join procedure is performed between the main table and the aggregated aux table, using the default suffix values, e.g., ("_x", "_y") for pandas.

Therefore, we won't have errors if we don't set the suffix in this example. However, we will have columns suffixed with variations of _x and _y, which will be hard to decipher.

Note that if we input severable tables without setting suffixes, we will automatically generate suffixes using the index of the auxiliary tables (_1, _2, ...).

However, if we call several AggJoiner like in this example, I'm afraid we can't make good default suffixes.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the meaningful suffixes "_movie" and "_user" actually help understand what we are doing and what we see in the cells' outputs (although that part is a bit hard still because the tables are quite wide)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 2, 2023 via email

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 2, 2023

This is a relatively complete example. In most simple settings, users won't need to set a suffix.
OK, but in the current situation the website will convey the impression that it is very complicated to use and will scare people away. I'm trying to act on this as much as I can by removing every possible element of complexity. It's a pity that we cannot have heuristics that avoid collisions and that we need to set suffixes in this example.

We have heuristics that avoid collisions, but these will be very un-informative in this example (what is _x, what is _y). We could do magic stuff with metaclasses or caching to replace those with _1 and _2, but I'm not eager to go down that road.

@Vincent-Maladiere
Copy link
Member Author

Alternatively, we can simplify the example by removing one of the AggJoiner and AggTarget and hope it does not degrade the already weak performances too much.

@jeromedockes
Copy link
Member

degrade the already weak performances too much.

as the baseline have you tried the same estimator but without joining any auxiliary tables?

@Vincent-Maladiere
Copy link
Member Author

degrade the already weak performances too much.

as the baseline have you tried the same estimator but without joining any auxiliary tables?

Yes, it gives random performances, with zero R2.
We already have some baselines that bring predictive power.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 2, 2023 via email

@Vincent-Maladiere
Copy link
Member Author

Here is a quick ablation study.

Full pipeline

pipeline = make_pipeline(
    table_vectorizer,
    agg_joiner_user,
    agg_joiner_movie,
    agg_target_user,
    agg_target_movie,
    HistGradientBoostingRegressor(learning_rate=0.1, max_depth=4, max_iter=40),
)

Without agg_joiner_movie and agg_joiner_user

Without agg_joiner_movie and agg_joiner_user and agg_target_user

Without agg_joiner_movie and agg_joiner_user and agg_target_movie


Conclusions:

  • both AggJoiner don't improve performances
  • both AggTarget play a significant role

So, we can remove AggJoiner , but we have to keep both AggTarget.
Are we happy with this simplification? I know it doesn't showcase AggJoiner anymore, but we can find another example who will.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 4, 2023 via email

@Vincent-Maladiere
Copy link
Member Author

It's done! LMK what you think :)

@Vincent-Maladiere
Copy link
Member Author

@GaelVaroquaux should we merge this now?

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging, thanks!

@GaelVaroquaux GaelVaroquaux merged commit f332ca6 into skrub-data:main Oct 10, 2023
@GaelVaroquaux
Copy link
Member

🎉

@Vincent-Maladiere Vincent-Maladiere deleted the add_join_agg branch November 9, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants