-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH make Random*Sampler accept dask array and dataframe #777
base: master
Are you sure you want to change the base?
Conversation
Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-11-08 19:22:54 UTC |
This pull request introduces 2 alerts when merging 7aae9d9 into edd7522 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I think the comment about dask.compute(...)
rather than x.compute(), y.compute()
is the most important.
Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.
imblearn/dask/_support.py
Outdated
_REGISTERED_DASK_CONTAINER = [] | ||
|
||
try: | ||
from dask import array, dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People can have just dask[array]
installed (not dataframe) so it's possible to have the array
import succeed, but the dataframe
import fail. So if you want to support that case those would need to be in separate try / except blocks.
Maybe you instead want from dask import is_dask_collection
? That's a bit broader though (it also covers anything implementing dask's collection interface like dask.Bag
, xarray.DataArray
and xarray.Dataset
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems what I wanted :)
if hasattr(y, "unique"): | ||
labels = np.asarray(y.unique()) | ||
else: | ||
labels = np.unique(y).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).
Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword to fit
(dunno if that applies here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it. Do you think that having a context manager outside would make sense:
with set_config(avoid_check=True):
# some imblearn/scikit-learn/dask code
Thought, we might get into trouble with issues related to scikit-learn/scikit-learn#18736
It might just be easier to have an optional class parameter that applies only for dask arrays.
force_all_finite=False, | ||
if is_dask_container(y) and hasattr(y, "to_dask_array"): | ||
y = y.to_dask_array() | ||
y.compute_chunk_sizes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this: https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like a cross_val_score
, then I think this would end up loading data multiple times just to compute the chunk sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I was unsure of, here. If I recall, the issue was that I could not have called ravel
on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.
However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.
So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting
imblearn/utils/_validation.py
Outdated
if is_dask_container(unique): | ||
unique, counts = unique.compute(), counts.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, this will fully execute the task graph of y
twice. Once to compute unique
and once to compute counts
.
if is_dask_container(unique): | |
unique, counts = unique.compute(), counts.compute() | |
if is_dask_container(unique): | |
unique, counts = dask.compute(unique, counts) |
You'll need to import dask
This pull request introduces 5 alerts when merging d4aabf8 into edd7522 - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging 58acdf2 into edd7522 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging e54c772 into edd7522 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 167fc2a into edd7522 - view on LGTM.com new alerts:
|
167fc2a
to
ec4a717
Compare
ec4a717
to
20b44c6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #777 +/- ##
==========================================
+ Coverage 96.55% 98.18% +1.63%
==========================================
Files 82 94 +12
Lines 5140 5900 +760
Branches 0 515 +515
==========================================
+ Hits 4963 5793 +830
+ Misses 177 100 -77
- Partials 0 7 +7 ☔ View full report in Codecov by Sentry. |
This pull request introduces 2 alerts when merging 20b44c6 into edd7522 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging a6e975b into edd7522 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 456c3eb into edd7522 - view on LGTM.com new alerts:
|
@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ? |
It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal with It would need further work to go ahead. |
I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :) |
POC to see if we can make at least the
RandomUnderSampler
andRandomOverSampler
accept some dask array and dataframeNote: