-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856
Comments
We will be happy to integrate such changes. We already adopt this strategy in Regarding taking profit from other array-like libraries (daks, cuML), we are also happy to look at it. I think that we still should make it optional (we don't want to force a user to install any of the dask or cuML by default). When working on #777 where I wanted to check what it takes to support dask array for Bottom line, we would be happy to have the contributions, until we can get a couple of tests to check that everything work with the backend on one of the CI build. |
Thank you for the positive response! I will open up a pull request for further discussion. |
It would be awesome if I could use cuml because for now with a hige dataset in a standard method I took too much to oversampling thank you so much |
General Introduction and Relevance
Currently, many imbalanced-learn samplers require an estimator explicitly inherits from Scikit-learn Mixin classes in order to determine the nature of an estimator operator; often this is achieved with the method check_neighbors_object() .
This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency, but also allow users to use other scikit-learn API-compliant libraries with imbalanced-learn that cannot directly subclass scikit-learn, such as cuML . Using cuML with imbalanced-learn can be significantly faster, as shown in the examples below that achieve roughly 180x and 680x speedups with the example configuration.
Additional Context
A key limitation is that users cannot cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.
The use of a duck typing based programming model would allow imbalanced-learn to be more flexible; duck typing in PyData is becoming increasingly powerful as more libraries are standardizing on numpy and scikit-learn like API contracts rather than inheritance (Dask, xarray, Pangeo, CuPy, RAPIDS, and more).
The TPOT community recently adopted duck-typing to allow GPU-accelerated CUML estimators and have seen great results: https://github.com/EpistasisLab/tpot/blob/master/tutorials/Higgs_Boson.ipynb
Proposal
In check_neighbors_object(), convert the Mixin class check to instead check key attributes that determine whether an object is a certain type of estimator:
Instead of KNeighborsMixin, check for the key attributes that determine whether an object is KNeighborsMixin-like
If there is potential interest, I'd be happy to open up a pull request for further discussion.
Potential Impact
Adopting duck typing for these checks would immediately allow using cuML (a GPU-accelerated machine learning library with a scikit-learn-like API) with imbalanced-learn. It would also open the door to other libraries (current and future), without requiring imbalanced-learn to have any knowledge or involvement.
As a concrete example of the benefit, using cuML estimators on the GPU instead of scikit-learn can provide a large speedup. Faster estimators can make a significant difference in the total runtime. The benchmark graphs below provide a couple small examples (reproducing gist provided at the bottom) which correspond to 180x and 680x speedups using cuML on a GPU vs the default scikit-learn with CPU cores.
There are a number of samplers that this proposed change would affect; allowing for the integration of estimators that do not directly subclass scikit-learn. This list includes: ADASYN, SMOTE, BorderlineSMOTE, SMOTENC, EditedNearestNeighbours, and RepeatedEditedNearestNeighbours samplers.
Samplers, such as CondensedNearestNeighbour, KMeansSMOTE, SMOTEN, SMOTEENN, and others, incorporate additional steps in their _validate_estimator() methods, and will require additional modifications to enable their seamless integration with estimators that do not directly subclass scikit-learn.
Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU
Benchmarking Code:
https://gist.github.com/NV-jpt/276be3fe57b0ca384dbdabeba4a7e643
The text was updated successfully, but these errors were encountered: