-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT Add the AggJoiner and AggTarget transformers #600
FEAT Add the AggJoiner and AggTarget transformers #600
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is exciting!!
I must confess that I am having a hard time reviewing this PR, as the github diff is ridden with warnings from codecov complaining of missing test coverage.
The user-level API feels right to me, although without using it, I don't have a perfect feeling of the user experience.
I think that we should try to find a consistent naming across the FeatureAugmenter and the JoinAggregator. This may call for renaming the FeatureAugmenter
We need to write tests, as these will help us feel if we have the right API for the internal components: if things are easy to test, it's a good sign.
It would be nice to have an exemple where we don't duplicate row to create the 1 to many relation. Maybe @jovan-stojanovic can help here. Ideally, this example should also show that the aggregation is beneficial for prediction, and we should add it to the PR so that we can comment on it.
Points for later:
Actual support of polars will require us to support it in every single function and class of skrub (else the user will be confused). This will be a bit of work, in particular it will require define impedence matching / adaptation to many functionality of pandas / polars.
I wonder if the pre-aggregation strategy is the good one. If my external table has many more entries on the common key than the main table, preaggregation will lead me to compute many aggregates that I don't need. We should note this and potentially address it later.
skrub/_join_aggregator.py
Outdated
operators of the respective module. | ||
""" | ||
@classmethod | ||
def get_for(cls, tables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a function (a simple function, not a method or classmethod) as a factory to instantiate the engine. It leads to simpler code.
skrub/_join_aggregator.py
Outdated
return num_ops, categ_ops | ||
|
||
|
||
class AssemblingEngine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be called a "Dispatcher", as IMHO this construct is related to dispatching and the Dispatcher pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what would this Dispatcher do if, according to your suggestion above, we replace get_for
with a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "get_for" would be a function "make_agg_dispatcher" that would return a dispatcher. Not very different from what you currently have, but as a function rather than a classmethod
skrub/_join_aggregator.py
Outdated
|
||
|
||
def pandas_get_agg_ops(cols, agg_ops): | ||
pandas_ops_mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you should make those dictionaries globals defined outside the function (and thus called with allcaps names "PANDAS_CAT_OPS_MAPPING". In the long term, having them accessible to other functions can be useful (let alone for testing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is done, I believe that the present function can be inlined into were it is called, as it is very short (many very short function make code harder to read, as they require memorizing a lot of indirections).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To stay consistent, we'd like to also inline the {pandas, polars}_get_agg_ops
and {pandas, polars}_split_num_categ_cols
functions.
However, some of these functions are long and might clutter the logic of the calling method.
Maybe we should define these functions as staticmethod
? So that things are close to each other.
WDYT?
Maybe we should define these functions as staticmethod? So that things are close to each other.
No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.
|
Thanks for this detailed feedback!
Whoops sorry I didn't notice. Adding
Yes, that sounds good. Something like "FuzzyJoiner"?
Yes, that's next on my todo :)
I just made a demo on Kaggle, the base table is huge (30M rows) and needs Polars lazy mode :) When using models like LambdaMart or BoostingTrees for Recommender Systems, you always end up aggregating the base table Please have a look at it: https://www.kaggle.com/code/vincentmaladiere/h-m-recsys-feature-engineering-with-skrub-polars Of course, we also need a proper example in the documentation, as you mentioned. Why not use a lighter version of MovieLens and perform recommendations in our documentation? :) Points for later:
Yes, how do you rank it as a priority? There is so much value in actually offering this, I'd be thrilled to start thinking about it very soon.
Very good point. We could filter our tables before aggregation on the right keys during |
What are you suggesting then? |
I've just noticed that our TableVectorizer already handles Polars input héhé (not lazily though) import polars as pl
from skrub import TableVectorizer
main = pd.read_csv(
"https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv",
thousands=",",
)
tv = TableVectorizer()
tv.fit_transform(
pl.DataFrame(main)
) This is thanks to a fortunate cast to pandas during the # Convert to pandas DataFrame if not already.
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
else:
# Create a copy to avoid altering the original data.
X = X.copy() A pleasant surprise 😄 |
I've just noticed that our TableVectorizer already handle polars input héhé
Good. As we move to supporting polars, we are going to have to: first test aggressively that all our objects and function do support polars, second optimize them to avoid casting when not necessary (as it will induce copies), and third support lazy polars.
|
Yes, that sounds good. Something like "FuzzyJoiner"?
Maybe just "Joiner"? I think that we will probably want to add fuzziness to almost everything that we do, so we shouldn't start putting it in every name.
I just made a demo on Kaggle, the base table is huge (30M rows) and needs Polars lazy mode :)
https://www.kaggle.com/code/vincentmaladiere/h-m-recsys-feature-engineering-with-skrub-polars
Nice! Great demonstration of the vision for skrub, this is material for an absolutely great example. Any chance that we can make this light and fast enough to be in our documentation? I would love to have it in our main docs, and to progressively build our features and API to keep making this example simpler and simpler.
Yes, how do you rank it as a priority? There is so much value in actually
offering this, I'd be thrilled to start thinking about it very soon.
Let's merge the present PR first :). After this, why not :).
|
What are you suggesting then?
Plain functions for now, and long term maybe two modules
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great new feature @Vincent-Maladiere! I have a few comments on the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a first batch of small comments
just for the sake of argument, having a class, say
|
You're right. But it feels like we're going to reimplement the dataframe API in some way, WDYT? |
I agree but isn't this already what we're doing, just with the it is definitely true it feels like we're doing something too similar to ibis's or the dataframe API's objectives; what I understood from IRL discussions is we want to do it for a much more restricted/specialized set of operations until the Dataframe API covers all we need in skrub but I may have misunderstood |
I understand your point, we're indeed already simulating our tiny dataframe API.
Yes, you're right, this is where we're heading. So I agree with what you say on the |
@GaelVaroquaux, does this new version match your requirements? |
I guess we can merge this now since we've converged on the design. #734 lists the next TODOs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very very cool.
I left one small inline comment (I hope that it won't be too much to address), and I have a couple of comments on the docs / website:
Front page:
Replace on the front page:
"Joiner, a transformer for joining multiple tables together."
by
"Joiner, AggJoiner", transformers for joining multiple tables together."
Assembling docs
Rework a tiny bit the assembling narrative docs to list the AggJoiner and
AggTarget.
I think that the way that I would do this is by add to the section
"Joinning external tables for machine learning. Where the Joiner is
mentioned, I would do a list with Joiner, AggJoiner and AggTarget, giving
quickly the differences between these.
examples/08_join_aggregation.py
Outdated
aux_key="userId", | ||
cols=timestamp_cols, | ||
main_key="userId", | ||
suffix="_user", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is specifying the suffix necessary for the example to work (not only here, but all over the example)?
If yes, can we think of a default for the suffix that makes it work here (and is not too strange / magic)
If no, I think that we should remove it from the example, to make the example a bit lighter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a relatively complete example. In most simple settings, users won't need to set a suffix.
When a suffix is set, the logic is the following:
- This suffix is added to each column of the
aux
table after aggregating it. - Then, the join procedure is performed between the main table and the aggregated aux table, using the default suffix values, e.g.,
("_x", "_y")
for pandas.
Therefore, we won't have errors if we don't set the suffix in this example. However, we will have columns suffixed with variations of _x
and _y
, which will be hard to decipher.
Note that if we input severable tables without setting suffixes, we will automatically generate suffixes using the index of the auxiliary tables (_1
, _2
, ...).
However, if we call several AggJoiner
like in this example, I'm afraid we can't make good default suffixes.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the meaningful suffixes "_movie" and "_user" actually help understand what we are doing and what we see in the cells' outputs (although that part is a bit hard still because the tables are quite wide)
This is a relatively complete example. In most simple settings, users won't need to set a suffix.
OK, but in the current situation the website will convey the impression that it is very complicated to use and will scare people away. I'm trying to act on this as much as I can by removing every possible element of complexity.
It's a pity that we cannot have heuristics that avoid collisions and that we need to set suffixes in this example.
|
We have heuristics that avoid collisions, but these will be very un-informative in this example (what is |
Alternatively, we can simplify the example by removing one of the AggJoiner and AggTarget and hope it does not degrade the already weak performances too much. |
as the baseline have you tried the same estimator but without joining any auxiliary tables? |
Yes, it gives random performances, with zero R2. |
Alternatively, we can simplify the example by removing one of the AggJoiner and AggTarget and hope it does not degrade the already weak performances too much.
If the performance don't degrade too much, I think that this would be great
|
So, we can remove `AggJoiner` , but we have to keep both `AggTarget`.
Are we happy with this simplification? I know it doesn't showcase `AggJoiner` anymore, but we can find another example who will.
Yes!! Thank you for doing this.
Maybe add a note in the example saying that if we wanted to aggregate on a value of X rather than y we have the AggJoiner.
|
It's done! LMK what you think :) |
@GaelVaroquaux should we merge this now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging, thanks!
🎉 |
What does this PR introduce?
This draft proposes a first POC for the Join Aggregator. It aims at aggregating auxiliary tables before merging them on the base table, in a 1:N fashion.
Its API follows a consistent logic to the one of Feature Augmenter:
It currently supports Pandas DataFrames, Polars DataFrames, and Polars LazyFrames. Therefore can be run lazily with Polars! My idea is to preserve the dataframe module of the input to avoid confusion, and refuse dataframes that mix backends. The Polars dependency is of course optional.
My next step will be to benchmark the RAM consumption and time to run for these 3 different dataframes on bigger datasets.
Edit: Here is a demo of the Join Aggregator applied to feature engineering for RecSys within a Kaggle Competition!
How is it implemented?
The API of Join Aggregator is straightforward and relies on specialized implementations of an abstract
AssemblingEngine
class to handle both Pandas and Polars dataframes, namelyPandasAssemblingEngine
andPolarsAssemblingEngine
.In the absence of fully fledged tests, here is a working example::
WDYT?
cc @strayMat and his super helpful aggregation implementations here and here :)