diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 13798ad78..ff502aa44 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -125,9 +125,9 @@ It would also work with pandas dataframe:: >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult) >>> df_resampled.head() # doctest: +SKIP -:class:`NearMiss` adds some heuristic rules to select samples -:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of -heuristic which can be selected with the parameter ``version``:: +:class:`NearMiss` undersamples data based on heuristic rules to select the +observations :cite:`mani2003knn`. :class:`NearMiss` implements 3 different +methods to undersample, which can be selected with the parameter ``version``:: >>> from imblearn.under_sampling import NearMiss >>> nm1 = NearMiss(version=1) @@ -135,12 +135,14 @@ heuristic which can be selected with the parameter ``version``:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 64), (2, 64)] -As later stated in the next section, :class:`NearMiss` heuristic rules are -based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors`` -and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin`` -from scikit-learn. The former parameter is used to compute the average distance -to the neighbors while the latter is used for the pre-selection of the samples -of interest. + +:class:`NearMiss` heuristic rules are based on the nearest neighbors algorithm. +Therefore, the parameters ``n_neighbors`` and ``n_neighbors_ver3`` accept either +integers with the size of the neighbourhood to explore or a classifier derived +from the ``KNeighborsMixin`` from scikit-learn. The parameter ``n_neighbors`` is +used to compute the average distance to the neighbors while ``n_neighbors_ver3`` +is used for the pre-selection of the samples from the majority class, only in +version 3. More details about NearMiss in the next section. Mathematical formulation ^^^^^^^^^^^^^^^^^^^^^^^^ @@ -175,19 +177,16 @@ is the largest. :scale: 60 :align: center -In the next example, the different :class:`NearMiss` variant are applied on the -previous toy example. It can be seen that the decision functions obtained in +In the next example, the different :class:`NearMiss` variants are applied on the +previous toy example. We can see that the decision functions obtained in each case are different. -When under-sampling a specific class, NearMiss-1 can be altered by the presence -of noise. In fact, it will implied that samples of the targeted class will be -selected around these samples as it is the case in the illustration below for -the yellow class. However, in the normal case, samples next to the boundaries -will be selected. NearMiss-2 will not have this effect since it does not focus -on the nearest samples but rather on the farthest samples. We can imagine that -the presence of noise can also altered the sampling mainly in the presence of -marginal outliers. NearMiss-3 is probably the version which will be less -affected by noise due to the first step sample selection. +When under-sampling a specific class, NearMiss-1 can be affected by noise. In +fact, samples of the targeted class located around observations from the minority +class tend to be selected, as shown in the illustration below (see yellow class). +NearMiss-2 might be less affected by noise as it does not focus on the nearest +samples but rather on the farthest samples. NearMiss-3 is probably the version +which will be less affected by noise due to the first step of sample selection. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html @@ -198,7 +197,7 @@ Cleaning under-sampling techniques ---------------------------------- Cleaning under-sampling techniques do not allow to specify the number of -samples to have in each class. In fact, each algorithm implement an heuristic +samples to have in each class. In fact, each algorithm implements an heuristic which will clean the dataset. .. _tomek_links: @@ -214,20 +213,20 @@ defined such that for any sample :math:`z`: d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z) -where :math:`d(.)` is the distance between the two samples. In some other -words, a Tomek's link exist if the two samples are the nearest neighbors of -each other. In the figure below, a Tomek's link is illustrated by highlighting -the samples of interest in green. +where :math:`d(.)` is the distance between the two samples. In other words, +a Tomek's link exists if two samples are nearest neighbors of each other, +but belong to a different class. In the figure below, a Tomek's link is illustrated +highlighting the samples of interest in green. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html :scale: 60 :align: center -The parameter ``sampling_strategy`` control which sample of the link will be +The parameter ``sampling_strategy`` controls which sample of the Tomek link will be removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will -remove the sample from the majority class. Both samples from the majority and -minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The +remove the sample from the majority class. However, both the samples from the majority +and minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The figure illustrates this behaviour. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png @@ -311,15 +310,19 @@ Condensed nearest neighbors and derived algorithms :class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to iteratively decide if a sample should be removed or not -:cite:`hart1968condensed`. The algorithm is running as followed: +:cite:`hart1968condensed`. The algorithm runs as follows: 1. Get all minority samples in a set :math:`C`. 2. Add a sample from the targeted class (class to be under-sampled) in :math:`C` and all other samples of this class in a set :math:`S`. -3. Go through the set :math:`S`, sample by sample, and classify each sample - using a 1 nearest neighbor rule. -4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing. -5. Reiterate on :math:`S` until there is no samples to be added. +3. Train a 1-KNN on `C`. +4. Go through the samples in set :math:`S`, sample by sample, and classify each one + using a 1 nearest neighbor rule (trained in 3). +5. If the sample is misclassified, add it to :math:`C`, and go to step 6. +6. Repeat steps 3 to 5 until all observations in `S` have been examined. + +The final dataset is `S`, containing all observations from the minority class and +those from the majority that were miss-classified by the successive 1-KNN algorithms. The :class:`CondensedNearestNeighbour` can be used in the following manner:: @@ -329,14 +332,29 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 24), (2, 115)] -However as illustrated in the figure below, :class:`CondensedNearestNeighbour` -is sensitive to noise and will add noisy samples. +However, as illustrated in the figure below, :class:`CondensedNearestNeighbour` +is sensitive to noise and may select noisy samples. + +In an attempt to remove noisy observations, :class:`OneSidedSelection` +will first find the observations that are hard to classify, and then will use +:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`. +:class:`OneSidedSelection` runs as follows: -In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to -remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest -neighbor rule is applied to all samples and the one which are misclassified -will be added to the set :math:`C`. No iteration on the set :math:`S` will take -place. The class can be used as:: +1. Get all minority samples in a set :math:`C`. +2. Add a sample from the targeted class (class to be under-sampled) in + :math:`C` and all other samples of this class in a set :math:`S`. +3. Train a 1-KNN on `C`. +4. Using a 1 nearest neighbor rule trained in 3, classify all samples in + set :math:`S`. +5. Add all misclassified samples to :math:`C`. +6. Remove Tomek Links from :math:`C`. + +The final dataset is `S`, containing all observations from the minority class, +plus the observations from the majority that were added at random, plus all +those from the majority that were miss-classified by the 1-KNN algorithms. Note +that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` +does not train a KNN after each sample is missclassified. It uses the one KNN +to classify all samples from the majority in 1 pass. The class can be used as:: >>> from imblearn.under_sampling import OneSidedSelection >>> oss = OneSidedSelection(random_state=0) @@ -344,8 +362,8 @@ place. The class can be used as:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 174), (2, 4404)] -Our implementation offer to set the number of seeds to put in the set :math:`C` -originally by setting the parameter ``n_seeds_S``. +Our implementation offers the possibility to set the number of observations +to put at random in the set :math:`C` through the parameter ``n_seeds_S``. :class:`NeighbourhoodCleaningRule` will focus on cleaning the data than condensing them :cite:`laurikkala2001improving`. Therefore, it will used the diff --git a/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py b/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py index 738110cae..93302860a 100644 --- a/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py +++ b/imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py @@ -47,7 +47,10 @@ class CondensedNearestNeighbour(BaseCleaningSampler): be used. n_seeds_S : int, default=1 - Number of samples to extract in order to build the set S. + Number of samples from the majority class to add randomly to the set + with all minority observations before training the first KNN model. In + the original implementation is 1, but more samples can be added with this + parameter. {n_jobs} @@ -70,13 +73,13 @@ class CondensedNearestNeighbour(BaseCleaningSampler): ----- The method is based on [1]_. - Supports multi-class resampling. A one-vs.-rest scheme is used when + Supports multi-class resampling. A one-vs.-one scheme is used when sampling a class as proposed in [1]_. References ---------- - .. [1] P. Hart, "The condensed nearest neighbor rule," - In Information Theory, IEEE Transactions on, vol. 14(3), + .. [1] P. Hart, "The condensed nearest neighbor rule", + in Information Theory, IEEE Transactions on, vol. 14(3), pp. 515-516, 1968. Examples @@ -124,7 +127,7 @@ def _validate_estimator(self): else: raise ValueError( f"`n_neighbors` has to be a int or an object" - f" inhereited from KNeighborsClassifier." + f" inherited from KNeighborsClassifier." f" Got {type(self.n_neighbors)} instead." ) @@ -168,7 +171,8 @@ def _fit_resample(self, X, y): # Check each sample in S if we keep it or drop it for idx_sam, (x_sam, y_sam) in enumerate(zip(S_x, S_y)): - # Do not select sample which are already well classified + # Do not select samples which are already well classified + # (or were already selected -randomly- to be part of C) if idx_sam in good_classif_label: continue @@ -177,7 +181,7 @@ def _fit_resample(self, X, y): x_sam = x_sam.reshape(1, -1) pred_y = self.estimator_.predict(x_sam) - # If the prediction do not agree with the true label + # If the prediction does not agree with the true label # append it in C_x if y_sam != pred_y: # Keep the index for later @@ -191,9 +195,9 @@ def _fit_resample(self, X, y): # fit a knn on C self.estimator_.fit(C_x, C_y) - # This experimental to speed up the search - # Classify all the element in S and avoid to test the - # well classified elements + # This is experimental to speed up the search + # Classify all the elements in S and avoid testing the + # correctly classified elements pred_S_y = self.estimator_.predict(S_x) good_classif_label = np.unique( np.append(idx_maj_sample, np.flatnonzero(pred_S_y == S_y)) diff --git a/imblearn/under_sampling/_prototype_selection/_nearmiss.py b/imblearn/under_sampling/_prototype_selection/_nearmiss.py index ec3f33cfe..0050c96df 100644 --- a/imblearn/under_sampling/_prototype_selection/_nearmiss.py +++ b/imblearn/under_sampling/_prototype_selection/_nearmiss.py @@ -36,20 +36,24 @@ class NearMiss(BaseUnderSampler): n_neighbors : int or estimator object, default=3 If ``int``, size of the neighbourhood to consider to compute the - average distance to the minority point samples. If object, an - estimator that inherits from - :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the k_neighbors. - By default, it will be a 3-NN. + average distance to the minority samples. If object, an estimator + that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` + that will be used to find the k_neighbors. By default, it considers + the 3 closest neighbours. n_neighbors_ver3 : int or estimator object, default=3 - If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This - parameter correspond to the number of neighbours selected create the - subset in which the selection will be performed. If object, an - estimator that inherits from + NearMiss version 3 starts by a phase of under-sampling where it selects + those observations from the majority class that are closest neighbors + to the minority class. + + If ``int``, indicates to the number of neighbours to be selected in + the first step. The subset in which the selection will be performed. + If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the k_neighbors. - By default, it will be a 3-NN. + find the k_neighbors. By default, the 3 closest neighbours to the + minority observations will be selected. + + Only used in version 3. {n_jobs} @@ -75,7 +79,7 @@ class NearMiss(BaseUnderSampler): References ---------- .. [1] I. Mani, I. Zhang. "kNN approach to unbalanced data distributions: - a case study involving information extraction," In Proceedings of + a case study involving information extraction", in Proceedings of workshop on learning from imbalanced datasets, 2003. Examples @@ -125,7 +129,7 @@ def _selection_dist_based( Associated label to X. dist_vec : ndarray, shape (n_samples, ) - The distance matrix to the nearest neigbour. + The distance matrix to the nearest neighbor. num_samples: int The desired number of samples to select. @@ -133,7 +137,7 @@ def _selection_dist_based( key : str or int, The target class. - sel_strategy : str, optional (default='nearest') + sel_strategy : str, default='nearest' Strategy to select the samples. Either 'nearest' or 'farthest' Returns @@ -169,13 +173,13 @@ def _selection_dist_based( reverse=sort_way, ) - # Throw a warning to tell the user that we did not have enough samples - # to select and that we just select everything + # Raise a warning to tell the user that there were not enough samples + # to select from and thus, that all samples will be selected if len(sorted_idx) < num_samples: warnings.warn( "The number of the samples to be selected is larger" " than the number of samples available. The" - " balancing ratio cannot be ensure and all samples" + " balancing ratio cannot be ensured and all samples" " will be returned." ) diff --git a/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py b/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py index 305abec0b..84daa6195 100644 --- a/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py +++ b/imblearn/under_sampling/_prototype_selection/_one_sided_selection.py @@ -41,11 +41,14 @@ class OneSidedSelection(BaseCleaningSampler): nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the nearest-neighbors. If `None`, a - :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rules will + :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rule will be used. n_seeds_S : int, default=1 - Number of samples to extract in order to build the set S. + Number of samples from the majority class to add randomly to the set + with all minority observations before training the first KNN model. In + the original implementation is 1, but more samples can be added with this + parameter. {n_jobs} @@ -71,7 +74,7 @@ class OneSidedSelection(BaseCleaningSampler): References ---------- .. [1] M. Kubat, S. Matwin, "Addressing the curse of imbalanced training - sets: one-sided selection," In ICML, vol. 97, pp. 179-186, 1997. + sets: one-sided selection", in ICML, vol. 97, pp. 179-186, 1997. Examples -------- @@ -150,8 +153,9 @@ def _fit_resample(self, X, y): C_x = _safe_indexing(X, C_indices) C_y = _safe_indexing(y, C_indices) - # create the set S with removing the seed from S - # since that it will be added anyway + # create the set S with all samples of the current class + # except those in the seed from S + # since they were added to C_x already idx_maj_extracted = np.delete(idx_maj, sel_idx_maj, axis=0) S_x = _safe_indexing(X, idx_maj_extracted) S_y = _safe_indexing(y, idx_maj_extracted) diff --git a/imblearn/under_sampling/_prototype_selection/_tomek_links.py b/imblearn/under_sampling/_prototype_selection/_tomek_links.py index c3d84b61a..4d9f05cbc 100644 --- a/imblearn/under_sampling/_prototype_selection/_tomek_links.py +++ b/imblearn/under_sampling/_prototype_selection/_tomek_links.py @@ -54,7 +54,7 @@ class TomekLinks(BaseCleaningSampler): References ---------- - .. [1] I. Tomek, "Two modifications of CNN," In Systems, Man, and + .. [1] I. Tomek, "Two modifications of CNN", in Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 1976. Examples @@ -91,10 +91,10 @@ def is_tomek(y, nn_index, class_type): ---------- y : ndarray of shape (n_samples,) Target vector of the data set, necessary to keep track of whether a - sample belongs to minority or not. + sample belongs to minority class. nn_index : ndarray of shape (len(y),) - The index of the closes nearest neighbour to a sample point. + Index with the closest nearest neighbour to a sample. class_type : int or str The label of the minority class. @@ -102,21 +102,24 @@ def is_tomek(y, nn_index, class_type): Returns ------- is_tomek : ndarray of shape (len(y), ) - Boolean vector on len( # samples ), with True for majority samples + Boolean vector of len( # samples ), with True for majority samples that are Tomek links. """ links = np.zeros(len(y), dtype=bool) - # find which class to not consider + # find which class not to consider class_excluded = [c for c in np.unique(y) if c not in class_type] - # there is a Tomek link between two samples if they are both nearest - # neighbors of each others. + # there is a Tomek link between two samples if they are nearest + # neighbors of each other, and from a different class. for index_sample, target_sample in enumerate(y): if target_sample in class_excluded: continue if y[nn_index[index_sample]] != target_sample: + # corroborate that they are neighbours of each other: + # (if A's closest neighbour is B, but B's closest neighbour + # is C, then A and B are not a Tomek link) if nn_index[nn_index[index_sample]] == index_sample: links[index_sample] = True