[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

NV-jpt · 2021-08-30T20:55:33Z

General Introduction and Relevance

Currently, many imbalanced-learn samplers require an estimator explicitly inherits from Scikit-learn Mixin classes in order to determine the nature of an estimator operator; often this is achieved with the method check_neighbors_object() .

This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency, but also allow users to use other scikit-learn API-compliant libraries with imbalanced-learn that cannot directly subclass scikit-learn, such as cuML . Using cuML with imbalanced-learn can be significantly faster, as shown in the examples below that achieve roughly 180x and 680x speedups with the example configuration.

Additional Context

A key limitation is that users cannot cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.

The use of a duck typing based programming model would allow imbalanced-learn to be more flexible; duck typing in PyData is becoming increasingly powerful as more libraries are standardizing on numpy and scikit-learn like API contracts rather than inheritance (Dask, xarray, Pangeo, CuPy, RAPIDS, and more).

The TPOT community recently adopted duck-typing to allow GPU-accelerated CUML estimators and have seen great results: https://github.com/EpistasisLab/tpot/blob/master/tutorials/Higgs_Boson.ipynb

Proposal

In check_neighbors_object(), convert the Mixin class check to instead check key attributes that determine whether an object is a certain type of estimator:

Instead of KNeighborsMixin, check for the key attributes that determine whether an object is KNeighborsMixin-like

kneighbors()
kneighbors_graph()

If there is potential interest, I'd be happy to open up a pull request for further discussion.

Potential Impact

Adopting duck typing for these checks would immediately allow using cuML (a GPU-accelerated machine learning library with a scikit-learn-like API) with imbalanced-learn. It would also open the door to other libraries (current and future), without requiring imbalanced-learn to have any knowledge or involvement.

As a concrete example of the benefit, using cuML estimators on the GPU instead of scikit-learn can provide a large speedup. Faster estimators can make a significant difference in the total runtime. The benchmark graphs below provide a couple small examples (reproducing gist provided at the bottom) which correspond to 180x and 680x speedups using cuML on a GPU vs the default scikit-learn with CPU cores.

There are a number of samplers that this proposed change would affect; allowing for the integration of estimators that do not directly subclass scikit-learn. This list includes: ADASYN, SMOTE, BorderlineSMOTE, SMOTENC, EditedNearestNeighbours, and RepeatedEditedNearestNeighbours samplers.

Samplers, such as CondensedNearestNeighbour, KMeansSMOTE, SMOTEN, SMOTEENN, and others, incorporate additional steps in their _validate_estimator() methods, and will require additional modifications to enable their seamless integration with estimators that do not directly subclass scikit-learn.

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking Code:
https://gist.github.com/NV-jpt/276be3fe57b0ca384dbdabeba4a7e643

glemaitre · 2021-08-30T21:20:17Z

We will be happy to integrate such changes. We already adopt this strategy in KMeansSMOTE regarding the KMeans estimator (to easily switch from KMeans to MiniBatchKMeans). So in the same spirit, we would be fine to accept something like that for the neighbours. We already thought about it thinking that switching to approximate nearest-neighbours algorithms would better scale.

Regarding taking profit from other array-like libraries (daks, cuML), we are also happy to look at it. I think that we still should make it optional (we don't want to force a user to install any of the dask or cuML by default). When working on #777 where I wanted to check what it takes to support dask array for Random*Sampler, I saw that indeed we need to bypass the validation of the array-like in scikit-learn and thus the same validation bypass in our base sampler class. I think this is feasible but it might be the most tricky part to do.

Bottom line, we would be happy to have the contributions, until we can get a couple of tests to check that everything work with the backend on one of the CI build.

NV-jpt · 2021-09-02T14:22:15Z

Thank you for the positive response! I will open up a pull request for further discussion.

rrfaria · 2021-11-16T11:16:55Z

It would be awesome if I could use cuml because for now with a hige dataset in a standard method I took too much to oversampling

thank you so much

NV-jpt mentioned this issue Sep 2, 2021

ENH duck-typing scikit-learn estimator instead of inheritance #858

Merged

NV-jpt changed the title ~~[ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based~~ [ENH] Switch estimator operator logical checks to be interface based rather than inheritance based Sep 13, 2021

beckernick mentioned this issue Nov 15, 2021

[QST] is there any feature to undersampling or oversampling like scikit-learn-contrib/imbalanced-learn rapidsai/cuml#4362

Closed

glemaitre closed this as completed in #858 Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

NV-jpt commented Aug 30, 2021

glemaitre commented Aug 30, 2021

NV-jpt commented Sep 2, 2021

rrfaria commented Nov 16, 2021

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

Comments

NV-jpt commented Aug 30, 2021

General Introduction and Relevance

Additional Context

Proposal

Potential Impact

glemaitre commented Aug 30, 2021

NV-jpt commented Sep 2, 2021

rrfaria commented Nov 16, 2021