scikit-learn-contrib · solegalli · Aug 10, 2021 · Aug 10, 2021 · Aug 10, 2021 · Aug 10, 2021
diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst
@@ -125,22 +125,24 @@ It would also work with pandas dataframe::
   >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
   >>> df_resampled.head()  # doctest: +SKIP
 
-:class:`NearMiss` adds some heuristic rules to select samples
-:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of
-heuristic which can be selected with the parameter ``version``::
+:class:`NearMiss` undersamples data based on heuristic rules to select the
+observations :cite:`mani2003knn`. :class:`NearMiss` implements 3 different
+methods to undersample, which can be selected with the parameter ``version``::
 
   >>> from imblearn.under_sampling import NearMiss
   >>> nm1 = NearMiss(version=1)
   >>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 64), (2, 64)]
 
-As later stated in the next section, :class:`NearMiss` heuristic rules are
-based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
-and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin``
-from scikit-learn. The former parameter is used to compute the average distance
-to the neighbors while the latter is used for the pre-selection of the samples
-of interest.
+
+:class:`NearMiss` heuristic rules are based on the nearest neighbors algorithm.
+Therefore, the parameters ``n_neighbors`` and ``n_neighbors_ver3`` accept either
+integers with the size of the neighbourhood to explore or a classifier derived
+from the ``KNeighborsMixin`` from scikit-learn. The parameter ``n_neighbors`` is
+used to compute the average distance to the neighbors while ``n_neighbors_ver3``
+is used for the pre-selection of the samples from the majority class, only in
+version 3. More details about NearMiss in the next section.
 
 Mathematical formulation
 ^^^^^^^^^^^^^^^^^^^^^^^^
@@ -175,19 +177,16 @@ is the largest.
    :scale: 60
    :align: center
 
-In the next example, the different :class:`NearMiss` variant are applied on the
-previous toy example. It can be seen that the decision functions obtained in
+In the next example, the different :class:`NearMiss` variants are applied on the
+previous toy example. We can see that the decision functions obtained in
 each case are different.
 
-When under-sampling a specific class, NearMiss-1 can be altered by the presence
-of noise. In fact, it will implied that samples of the targeted class will be
-selected around these samples as it is the case in the illustration below for
-the yellow class. However, in the normal case, samples next to the boundaries
-will be selected. NearMiss-2 will not have this effect since it does not focus
-on the nearest samples but rather on the farthest samples. We can imagine that
-the presence of noise can also altered the sampling mainly in the presence of
-marginal outliers. NearMiss-3 is probably the version which will be less
-affected by noise due to the first step sample selection.
+When under-sampling a specific class, NearMiss-1 can be affected by noise. In
+fact, samples of the targeted class located around observations from the minority
+class tend to be selected, as shown in the illustration below (see yellow class).
+NearMiss-2 might be less affected by noise as it does not focus on the nearest
+samples but rather on the farthest samples. NearMiss-3 is probably the version
+which will be less affected by noise due to the first step of sample selection.
 
 .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png
    :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
@@ -198,7 +197,7 @@ Cleaning under-sampling techniques
 ----------------------------------
 
 Cleaning under-sampling techniques do not allow to specify the number of
-samples to have in each class. In fact, each algorithm implement an heuristic
+samples to have in each class. In fact, each algorithm implements an heuristic
 which will clean the dataset.
 
 .. _tomek_links:
@@ -214,20 +213,20 @@ defined such that for any sample :math:`z`:
 
    d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z)
 
-where :math:`d(.)` is the distance between the two samples. In some other
-words, a Tomek's link exist if the two samples are the nearest neighbors of
-each other. In the figure below, a Tomek's link is illustrated by highlighting
-the samples of interest in green.
+where :math:`d(.)` is the distance between the two samples. In other words,
+a Tomek's link exists if two samples are nearest neighbors of each other,
+but belong to a different class. In the figure below, a Tomek's link is illustrated
+highlighting the samples of interest in green.
 
 .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
    :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
    :scale: 60
    :align: center
 
-The parameter ``sampling_strategy`` control which sample of the link will be
+The parameter ``sampling_strategy`` controls which sample of the Tomek link will be
 removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
-remove the sample from the majority class. Both samples from the majority and
-minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
+remove the sample from the majority class. However, both the samples from the majority
+and minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
 figure illustrates this behaviour.
 
 .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
@@ -311,15 +310,19 @@ Condensed nearest neighbors and derived algorithms
 
 :class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to
 iteratively decide if a sample should be removed or not
-:cite:`hart1968condensed`. The algorithm is running as followed:
+:cite:`hart1968condensed`. The algorithm runs as follows:
 
 1. Get all minority samples in a set :math:`C`.
 2. Add a sample from the targeted class (class to be under-sampled) in
    :math:`C` and all other samples of this class in a set :math:`S`.
-3. Go through the set :math:`S`, sample by sample, and classify each sample
-   using a 1 nearest neighbor rule.
-4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing.
-5. Reiterate on :math:`S` until there is no samples to be added.
+3. Train a 1-KNN on `C`.
+4. Go through the samples in set :math:`S`, sample by sample, and classify each one
+   using a 1 nearest neighbor rule (trained in 3).
+5. If the sample is misclassified, add it to :math:`C`, and go to step 6.
+6. Repeat steps 3 to 5 until all observations in `S` have been examined.
+
+The final dataset is `S`, containing all observations from the minority class and
+those from the majority that were miss-classified by the successive 1-KNN algorithms.
 
 The :class:`CondensedNearestNeighbour` can be used in the following manner::
 
@@ -329,23 +332,38 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 24), (2, 115)]
 
-However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
-is sensitive to noise and will add noisy samples.
+However, as illustrated in the figure below, :class:`CondensedNearestNeighbour`
+is sensitive to noise and may select noisy samples.
+
+In an attempt to remove noisy observations, :class:`OneSidedSelection`
+will first find the observations that are hard to classify, and then will use
+:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`.
+:class:`OneSidedSelection` runs as follows:
 
-In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
-remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest
-neighbor rule is applied to all samples and the one which are misclassified
-will be added to the set :math:`C`. No iteration on the set :math:`S` will take
-place. The class can be used as::
+1. Get all minority samples in a set :math:`C`.
+2. Add a sample from the targeted class (class to be under-sampled) in
+   :math:`C` and all other samples of this class in a set :math:`S`.
+3. Train a 1-KNN on `C`.
+4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
+   set :math:`S`.
+5. Add all misclassified samples to :math:`C`.
+6. Remove Tomek Links from :math:`C`.
+
+The final dataset is `S`, containing all observations from the minority class,
+plus the observations from the majority that were added at random, plus all
+those from the majority that were miss-classified by the 1-KNN algorithms. Note
+that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
+does not train a KNN after each sample is missclassified. It uses the one KNN
+to classify all samples from the majority in 1 pass. The class can be used as::
 
   >>> from imblearn.under_sampling import OneSidedSelection
   >>> oss = OneSidedSelection(random_state=0)
   >>> X_resampled, y_resampled = oss.fit_resample(X, y)
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 174), (2, 4404)]
 
-Our implementation offer to set the number of seeds to put in the set :math:`C`
-originally by setting the parameter ``n_seeds_S``.
+Our implementation offers the possibility to set the number of observations
+to put at random in the set :math:`C` through the parameter ``n_seeds_S``.
 
 :class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
 condensing them :cite:`laurikkala2001improving`. Therefore, it will used the

diff --git a/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py b/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py
@@ -47,7 +47,10 @@ class CondensedNearestNeighbour(BaseCleaningSampler):
         be used.
 
     n_seeds_S : int, default=1
-        Number of samples to extract in order to build the set S.
+        Number of samples from the majority class to add randomly to the set
+        with all minority observations before training the first KNN model. In
+        the original implementation is 1, but more samples can be added with this
+        parameter.
 
     {n_jobs}
 
@@ -70,13 +73,13 @@ class CondensedNearestNeighbour(BaseCleaningSampler):
     -----
     The method is based on [1]_.
 
-    Supports multi-class resampling. A one-vs.-rest scheme is used when
+    Supports multi-class resampling. A one-vs.-one scheme is used when
     sampling a class as proposed in [1]_.
 
     References
     ----------
-    .. [1] P. Hart, "The condensed nearest neighbor rule,"
-       In Information Theory, IEEE Transactions on, vol. 14(3),
+    .. [1] P. Hart, "The condensed nearest neighbor rule",
+       in Information Theory, IEEE Transactions on, vol. 14(3),
        pp. 515-516, 1968.
 
     Examples
@@ -124,7 +127,7 @@ def _validate_estimator(self):
         else:
             raise ValueError(
                 f"`n_neighbors` has to be a int or an object"
-                f" inhereited from KNeighborsClassifier."
+                f" inherited from KNeighborsClassifier."
                 f" Got {type(self.n_neighbors)} instead."
             )
 
@@ -168,7 +171,8 @@ def _fit_resample(self, X, y):
                 # Check each sample in S if we keep it or drop it
                 for idx_sam, (x_sam, y_sam) in enumerate(zip(S_x, S_y)):
 
-                    # Do not select sample which are already well classified
+                    # Do not select samples which are already well classified
+                    # (or were already selected -randomly- to be part of C)
                     if idx_sam in good_classif_label:
                         continue
 
@@ -177,7 +181,7 @@ def _fit_resample(self, X, y):
                         x_sam = x_sam.reshape(1, -1)
                     pred_y = self.estimator_.predict(x_sam)
 
-                    # If the prediction do not agree with the true label
+                    # If the prediction does not agree with the true label
                     # append it in C_x
                     if y_sam != pred_y:
                         # Keep the index for later
@@ -191,9 +195,9 @@ def _fit_resample(self, X, y):
                         # fit a knn on C
                         self.estimator_.fit(C_x, C_y)
 
-                        # This experimental to speed up the search
-                        # Classify all the element in S and avoid to test the
-                        # well classified elements
+                        # This is experimental to speed up the search
+                        # Classify all the elements in S and avoid testing the
+                        # correctly classified elements
                         pred_S_y = self.estimator_.predict(S_x)
                         good_classif_label = np.unique(
                             np.append(idx_maj_sample, np.flatnonzero(pred_S_y == S_y))

diff --git a/imblearn/under_sampling/_prototype_selection/_nearmiss.py b/imblearn/under_sampling/_prototype_selection/_nearmiss.py
@@ -36,20 +36,24 @@ class NearMiss(BaseUnderSampler):
 
     n_neighbors : int or estimator object, default=3
         If ``int``, size of the neighbourhood to consider to compute the
-        average distance to the minority point samples.  If object, an
-        estimator that inherits from
-        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
-        find the k_neighbors.
-        By default, it will be a 3-NN.
+        average distance to the minority samples. If object, an estimator
+        that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`
+        that will be used to find the k_neighbors. By default, it considers
+        the 3 closest neighbours.
 
     n_neighbors_ver3 : int or estimator object, default=3
-        If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This
-        parameter correspond to the number of neighbours selected create the
-        subset in which the selection will be performed.  If object, an
-        estimator that inherits from
+        NearMiss version 3 starts by a phase of under-sampling where it selects
+        those observations from the majority class that are closest neighbors
+        to the minority class.
+
+        If ``int``, indicates to the number of neighbours to be selected in
+        the first step. The subset in which the selection will be performed.
+        If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
-        find the k_neighbors.
-        By default, it will be a 3-NN.
+        find the k_neighbors. By default, the 3 closest neighbours to the
+        minority observations will be selected.
+
+        Only used in version 3.
 
     {n_jobs}
 
@@ -75,7 +79,7 @@ class NearMiss(BaseUnderSampler):
     References
     ----------
     .. [1] I. Mani, I. Zhang. "kNN approach to unbalanced data distributions:
-       a case study involving information extraction," In Proceedings of
+       a case study involving information extraction", in Proceedings of
        workshop on learning from imbalanced datasets, 2003.
 
     Examples
@@ -125,15 +129,15 @@ def _selection_dist_based(
             Associated label to X.
 
         dist_vec : ndarray, shape (n_samples, )
-            The distance matrix to the nearest neigbour.
+            The distance matrix to the nearest neighbor.
 
         num_samples: int
             The desired number of samples to select.
 
         key : str or int,
             The target class.
 
-        sel_strategy : str, optional (default='nearest')
+        sel_strategy : str, default='nearest'
             Strategy to select the samples. Either 'nearest' or 'farthest'
 
         Returns
@@ -169,13 +173,13 @@ def _selection_dist_based(
             reverse=sort_way,
         )
 
-        # Throw a warning to tell the user that we did not have enough samples
-        # to select and that we just select everything
+        # Raise a warning to tell the user that there were not enough samples
+        # to select from and thus, that all samples will be selected
         if len(sorted_idx) < num_samples:
             warnings.warn(
                 "The number of the samples to be selected is larger"
                 " than the number of samples available. The"
-                " balancing ratio cannot be ensure and all samples"
+                " balancing ratio cannot be ensured and all samples"
                 " will be returned."
             )
 

diff --git a/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py b/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py
@@ -41,11 +41,14 @@ class OneSidedSelection(BaseCleaningSampler):
         nearest neighbors. If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
         find the nearest-neighbors. If `None`, a
-        :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rules will
+        :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rule will
         be used.
 
     n_seeds_S : int, default=1
-        Number of samples to extract in order to build the set S.
+        Number of samples from the majority class to add randomly to the set
+        with all minority observations before training the first KNN model. In
+        the original implementation is 1, but more samples can be added with this
+        parameter.
 
     {n_jobs}
 
@@ -71,7 +74,7 @@ class OneSidedSelection(BaseCleaningSampler):
     References
     ----------
     .. [1] M. Kubat, S. Matwin, "Addressing the curse of imbalanced training
-       sets: one-sided selection," In ICML, vol. 97, pp. 179-186, 1997.
+       sets: one-sided selection", in ICML, vol. 97, pp. 179-186, 1997.
 
     Examples
     --------
@@ -150,8 +153,9 @@ def _fit_resample(self, X, y):
                 C_x = _safe_indexing(X, C_indices)
                 C_y = _safe_indexing(y, C_indices)
 
-                # create the set S with removing the seed from S
-                # since that it will be added anyway
+                # create the set S with all samples of the current class
+                # except those in the seed from S
+                # since they were added to C_x already
                 idx_maj_extracted = np.delete(idx_maj, sel_idx_maj, axis=0)
                 S_x = _safe_indexing(X, idx_maj_extracted)
                 S_y = _safe_indexing(y, idx_maj_extracted)