Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] is there any feature to undersampling or oversampling like scikit-learn-contrib/imbalanced-learn #4362

Closed
rrfaria opened this issue Nov 14, 2021 · 6 comments
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@rrfaria
Copy link

rrfaria commented Nov 14, 2021

I could not found anything from imbalance-learn
I'm using something like this

from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
...
# x is the feature. y is classes
x_resampled, y_resampled = SMOTE().fit_resample(x, y)
# or for undersampling 
x_resampled, y_resampled = NearMiss().fit_resample(x, y)

But I would like to use cuml to speed up because with big amount of data it takes a lot of time

Is there any method I could use to do it?

@rrfaria rrfaria added ? - Needs Triage Need team to review and classify question Further information is requested labels Nov 14, 2021
@beckernick
Copy link
Member

beckernick commented Nov 15, 2021

Hi @rrfaria . Today, there isn't a simple way to do this.

We're excited about this use case, as we've also seen that nuanced oversampling and undersampling on CPUs can be very time consuming.

We're currently working with the imbalanced-learn maintainers on a pull request that would allow you to use cuML estimators with imbalanced learn, like this:

from imblearn.over_sampling import SMOTE
...
nn = cuml.neighbors.NearestNeighbors()
x_resampled, y_resampled = SMOTE(k_neighbors=nn).fit_resample(x, y)

If accelerated imbalanced-learn is important for your work, it would be great if you could comment on this imbalanced-learn issue to indicate your interest in this effort.

@rrfaria
Copy link
Author

rrfaria commented Nov 16, 2021

Thank you so much
It will help a lot
let me know if I can help in something

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@antno1000
Copy link

any update?

@beckernick
Copy link
Member

The relevant code has been merged into imbalanced-learn, so the code snippet above now works when using imbalanced-learn built from source. It's not yet available in pip/conda installations of imbalanced-learn, but will be in the next release.

Based on initial testing, it's possible to achieve large speedups on samplers as data sizes grow.

I'm going to close this issue. If you build imbalanced-learn from source and run into any issues using it with cuML, please feel free to re-open this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants