Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] update user guide, docstrings and comments from undersampling methods #853

Closed
wants to merge 9 commits into from
Closed
102 changes: 60 additions & 42 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,22 +125,24 @@ It would also work with pandas dataframe::
>>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
>>> df_resampled.head() # doctest: +SKIP

:class:`NearMiss` adds some heuristic rules to select samples
:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of
heuristic which can be selected with the parameter ``version``::
:class:`NearMiss` undersamples data based on heuristic rules to select the
observations :cite:`mani2003knn`. :class:`NearMiss` implements 3 different
methods to undersample, which can be selected with the parameter ``version``::

>>> from imblearn.under_sampling import NearMiss
>>> nm1 = NearMiss(version=1)
>>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]

As later stated in the next section, :class:`NearMiss` heuristic rules are
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin``
from scikit-learn. The former parameter is used to compute the average distance
to the neighbors while the latter is used for the pre-selection of the samples
of interest.

:class:`NearMiss` heuristic rules are based on the nearest neighbors algorithm.
Therefore, the parameters ``n_neighbors`` and ``n_neighbors_ver3`` accept either
integers with the size of the neighbourhood to explore or a classifier derived
from the ``KNeighborsMixin`` from scikit-learn. The parameter ``n_neighbors`` is
used to compute the average distance to the neighbors while ``n_neighbors_ver3``
is used for the pre-selection of the samples from the majority class, only in
version 3. More details about NearMiss in the next section.

Mathematical formulation
^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -175,19 +177,16 @@ is the largest.
:scale: 60
:align: center

In the next example, the different :class:`NearMiss` variant are applied on the
previous toy example. It can be seen that the decision functions obtained in
In the next example, the different :class:`NearMiss` variants are applied on the
previous toy example. We can see that the decision functions obtained in
each case are different.

When under-sampling a specific class, NearMiss-1 can be altered by the presence
of noise. In fact, it will implied that samples of the targeted class will be
selected around these samples as it is the case in the illustration below for
the yellow class. However, in the normal case, samples next to the boundaries
will be selected. NearMiss-2 will not have this effect since it does not focus
on the nearest samples but rather on the farthest samples. We can imagine that
the presence of noise can also altered the sampling mainly in the presence of
marginal outliers. NearMiss-3 is probably the version which will be less
affected by noise due to the first step sample selection.
When under-sampling a specific class, NearMiss-1 can be affected by noise. In
fact, samples of the targeted class located around observations from the minority
class tend to be selected, as shown in the illustration below (see yellow class).
NearMiss-2 might be less affected by noise as it does not focus on the nearest
samples but rather on the farthest samples. NearMiss-3 is probably the version
which will be less affected by noise due to the first step of sample selection.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
Expand All @@ -198,7 +197,7 @@ Cleaning under-sampling techniques
----------------------------------

Cleaning under-sampling techniques do not allow to specify the number of
samples to have in each class. In fact, each algorithm implement an heuristic
samples to have in each class. In fact, each algorithm implements an heuristic
which will clean the dataset.

.. _tomek_links:
Expand All @@ -214,20 +213,20 @@ defined such that for any sample :math:`z`:

d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z)

where :math:`d(.)` is the distance between the two samples. In some other
words, a Tomek's link exist if the two samples are the nearest neighbors of
each other. In the figure below, a Tomek's link is illustrated by highlighting
the samples of interest in green.
where :math:`d(.)` is the distance between the two samples. In other words,
a Tomek's link exists if two samples are nearest neighbors of each other,
but belong to a different class. In the figure below, a Tomek's link is illustrated
highlighting the samples of interest in green.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
:scale: 60
:align: center

The parameter ``sampling_strategy`` control which sample of the link will be
The parameter ``sampling_strategy`` controls which sample of the Tomek link will be
removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
remove the sample from the majority class. Both samples from the majority and
minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
remove the sample from the majority class. However, both the samples from the majority
and minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
figure illustrates this behaviour.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
Expand Down Expand Up @@ -311,15 +310,19 @@ Condensed nearest neighbors and derived algorithms

:class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to
iteratively decide if a sample should be removed or not
:cite:`hart1968condensed`. The algorithm is running as followed:
:cite:`hart1968condensed`. The algorithm runs as follows:

1. Get all minority samples in a set :math:`C`.
2. Add a sample from the targeted class (class to be under-sampled) in
:math:`C` and all other samples of this class in a set :math:`S`.
3. Go through the set :math:`S`, sample by sample, and classify each sample
using a 1 nearest neighbor rule.
4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing.
5. Reiterate on :math:`S` until there is no samples to be added.
3. Train a 1-KNN on `C`.
4. Go through the samples in set :math:`S`, sample by sample, and classify each one
using a 1 nearest neighbor rule (trained in 3).
5. If the sample is misclassified, add it to :math:`C`, and go to step 6.
6. Repeat steps 3 to 5 until all observations in `S` have been examined.

The final dataset is `S`, containing all observations from the minority class and
those from the majority that were miss-classified by the successive 1-KNN algorithms.

The :class:`CondensedNearestNeighbour` can be used in the following manner::

Expand All @@ -329,23 +332,38 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 24), (2, 115)]

However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
is sensitive to noise and will add noisy samples.
However, as illustrated in the figure below, :class:`CondensedNearestNeighbour`
is sensitive to noise and may select noisy samples.

In an attempt to remove noisy observations, :class:`OneSidedSelection`
will first find the observations that are hard to classify, and then will use
:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`.
:class:`OneSidedSelection` runs as follows:

In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest
neighbor rule is applied to all samples and the one which are misclassified
will be added to the set :math:`C`. No iteration on the set :math:`S` will take
place. The class can be used as::
1. Get all minority samples in a set :math:`C`.
2. Add a sample from the targeted class (class to be under-sampled) in
:math:`C` and all other samples of this class in a set :math:`S`.
3. Train a 1-KNN on `C`.
4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
set :math:`S`.
5. Add all misclassified samples to :math:`C`.
6. Remove Tomek Links from :math:`C`.

The final dataset is `S`, containing all observations from the minority class,
plus the observations from the majority that were added at random, plus all
those from the majority that were miss-classified by the 1-KNN algorithms. Note
that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
does not train a KNN after each sample is missclassified. It uses the one KNN
to classify all samples from the majority in 1 pass. The class can be used as::

>>> from imblearn.under_sampling import OneSidedSelection
>>> oss = OneSidedSelection(random_state=0)
>>> X_resampled, y_resampled = oss.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 174), (2, 4404)]

Our implementation offer to set the number of seeds to put in the set :math:`C`
originally by setting the parameter ``n_seeds_S``.
Our implementation offers the possibility to set the number of observations
to put at random in the set :math:`C` through the parameter ``n_seeds_S``.

:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,10 @@ class CondensedNearestNeighbour(BaseCleaningSampler):
be used.

n_seeds_S : int, default=1
Number of samples to extract in order to build the set S.
Number of samples from the majority class to add randomly to the set
with all minority observations before training the first KNN model. In
the original implementation is 1, but more samples can be added with this
parameter.

{n_jobs}

Expand All @@ -70,13 +73,13 @@ class CondensedNearestNeighbour(BaseCleaningSampler):
-----
The method is based on [1]_.

Supports multi-class resampling. A one-vs.-rest scheme is used when
Supports multi-class resampling. A one-vs.-one scheme is used when
sampling a class as proposed in [1]_.

References
----------
.. [1] P. Hart, "The condensed nearest neighbor rule,"
In Information Theory, IEEE Transactions on, vol. 14(3),
.. [1] P. Hart, "The condensed nearest neighbor rule",
in Information Theory, IEEE Transactions on, vol. 14(3),
pp. 515-516, 1968.

Examples
Expand Down Expand Up @@ -124,7 +127,7 @@ def _validate_estimator(self):
else:
raise ValueError(
f"`n_neighbors` has to be a int or an object"
f" inhereited from KNeighborsClassifier."
f" inherited from KNeighborsClassifier."
f" Got {type(self.n_neighbors)} instead."
)

Expand Down Expand Up @@ -168,7 +171,8 @@ def _fit_resample(self, X, y):
# Check each sample in S if we keep it or drop it
for idx_sam, (x_sam, y_sam) in enumerate(zip(S_x, S_y)):

# Do not select sample which are already well classified
# Do not select samples which are already well classified
# (or were already selected -randomly- to be part of C)
if idx_sam in good_classif_label:
continue

Expand All @@ -177,7 +181,7 @@ def _fit_resample(self, X, y):
x_sam = x_sam.reshape(1, -1)
pred_y = self.estimator_.predict(x_sam)

# If the prediction do not agree with the true label
# If the prediction does not agree with the true label
# append it in C_x
if y_sam != pred_y:
# Keep the index for later
Expand All @@ -191,9 +195,9 @@ def _fit_resample(self, X, y):
# fit a knn on C
self.estimator_.fit(C_x, C_y)

# This experimental to speed up the search
# Classify all the element in S and avoid to test the
# well classified elements
# This is experimental to speed up the search
# Classify all the elements in S and avoid testing the
# correctly classified elements
pred_S_y = self.estimator_.predict(S_x)
good_classif_label = np.unique(
np.append(idx_maj_sample, np.flatnonzero(pred_S_y == S_y))
Expand Down
38 changes: 21 additions & 17 deletions imblearn/under_sampling/_prototype_selection/_nearmiss.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,24 @@ class NearMiss(BaseUnderSampler):

n_neighbors : int or estimator object, default=3
If ``int``, size of the neighbourhood to consider to compute the
average distance to the minority point samples. If object, an
estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the k_neighbors.
By default, it will be a 3-NN.
average distance to the minority samples. If object, an estimator
that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`
that will be used to find the k_neighbors. By default, it considers
the 3 closest neighbours.

n_neighbors_ver3 : int or estimator object, default=3
If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This
parameter correspond to the number of neighbours selected create the
subset in which the selection will be performed. If object, an
estimator that inherits from
NearMiss version 3 starts by a phase of under-sampling where it selects
those observations from the majority class that are closest neighbors
to the minority class.

If ``int``, indicates to the number of neighbours to be selected in
the first step. The subset in which the selection will be performed.
If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the k_neighbors.
By default, it will be a 3-NN.
find the k_neighbors. By default, the 3 closest neighbours to the
minority observations will be selected.

Only used in version 3.

{n_jobs}

Expand All @@ -75,7 +79,7 @@ class NearMiss(BaseUnderSampler):
References
----------
.. [1] I. Mani, I. Zhang. "kNN approach to unbalanced data distributions:
a case study involving information extraction," In Proceedings of
a case study involving information extraction", in Proceedings of
workshop on learning from imbalanced datasets, 2003.

Examples
Expand Down Expand Up @@ -125,15 +129,15 @@ def _selection_dist_based(
Associated label to X.

dist_vec : ndarray, shape (n_samples, )
The distance matrix to the nearest neigbour.
The distance matrix to the nearest neighbor.

num_samples: int
The desired number of samples to select.

key : str or int,
The target class.

sel_strategy : str, optional (default='nearest')
sel_strategy : str, default='nearest'
Strategy to select the samples. Either 'nearest' or 'farthest'

Returns
Expand Down Expand Up @@ -169,13 +173,13 @@ def _selection_dist_based(
reverse=sort_way,
)

# Throw a warning to tell the user that we did not have enough samples
# to select and that we just select everything
# Raise a warning to tell the user that there were not enough samples
# to select from and thus, that all samples will be selected
if len(sorted_idx) < num_samples:
warnings.warn(
"The number of the samples to be selected is larger"
" than the number of samples available. The"
" balancing ratio cannot be ensure and all samples"
" balancing ratio cannot be ensured and all samples"
" will be returned."
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,14 @@ class OneSidedSelection(BaseCleaningSampler):
nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors. If `None`, a
:class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rules will
:class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rule will
be used.

n_seeds_S : int, default=1
Number of samples to extract in order to build the set S.
Number of samples from the majority class to add randomly to the set
with all minority observations before training the first KNN model. In
the original implementation is 1, but more samples can be added with this
parameter.

{n_jobs}

Expand All @@ -71,7 +74,7 @@ class OneSidedSelection(BaseCleaningSampler):
References
----------
.. [1] M. Kubat, S. Matwin, "Addressing the curse of imbalanced training
sets: one-sided selection," In ICML, vol. 97, pp. 179-186, 1997.
sets: one-sided selection", in ICML, vol. 97, pp. 179-186, 1997.

Examples
--------
Expand Down Expand Up @@ -150,8 +153,9 @@ def _fit_resample(self, X, y):
C_x = _safe_indexing(X, C_indices)
C_y = _safe_indexing(y, C_indices)

# create the set S with removing the seed from S
# since that it will be added anyway
# create the set S with all samples of the current class
# except those in the seed from S
# since they were added to C_x already
idx_maj_extracted = np.delete(idx_maj, sel_idx_maj, axis=0)
S_x = _safe_indexing(X, idx_maj_extracted)
S_y = _safe_indexing(y, idx_maj_extracted)
Expand Down
Loading