Skip to content

Commit

Permalink
Merge pull request #96 from jbytecode/paper
Browse files Browse the repository at this point in the history
Update paper and bibtex
  • Loading branch information
EssamWisam authored Mar 15, 2024
2 parents 45e66a9 + f8266cd commit 63174a9
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 11 deletions.
22 changes: 18 additions & 4 deletions paper.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
@article{julia,
doi = {10.1137/141000671},
url = {https://doi.org/10.1137%2F141000671},
year = 2017,
month = {jan},
publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})},
volume = {59},
number = {1},
pages = {65--98},
author = {Jeff Bezanson and Alan Edelman and Stefan Karpinski and Viral B. Shah},
title = {Julia: A Fresh Approach to Numerical Computing},
journal = {{SIAM} Review}
}

@Inbook{Cunningham:2008,
author="Cunningham, P{\'a}draig
and Cord, Matthieu
Expand Down Expand Up @@ -67,7 +81,7 @@ @inproceedings{Kubt:1997
}

@article{Chawla:2002,
title={SMOTE: Synthetic Minority Over-sampling Technique},
title={{SMOTE}: Synthetic Minority Over-sampling Technique},
author={N. Chawla and K. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer},
doi={10.1613/jair.953},
journal={ArXiv},
Expand Down Expand Up @@ -133,7 +147,7 @@ @article{Hart:1968
}

@article{Lematre:2016,
title={Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
title={Imbalanced-learn: A {P}ython Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
author={Guillaume Lema{\^i}tre and Fernando Nogueira and Christos K. Aridas},
journal={ArXiv},
year={2016},
Expand All @@ -143,7 +157,7 @@ @article{Lematre:2016


@article{Kovács:2019,
title = {Smote-variants: A python implementation of 85 minority oversampling techniques},
title = {Smote-variants: A {P}ython implementation of 85 minority oversampling techniques},
journal = {Neurocomputing},
volume = {366},
pages = {352-354},
Expand All @@ -158,7 +172,7 @@ @article{Kovács:2019

@online{DataCamp:2023,
author = {Bekhruz Tuychiev},
title = {The Rise of Julia},
title = {The Rise of {J}ulia},
year = {2023},
url = {https://www.datacamp.com/blog/the-rise-of-julia-is-it-worth-learning-in-2022},
note = {Accessed on Oct 11, 2023}
Expand Down
18 changes: 11 additions & 7 deletions paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,17 @@ Given a set of observations that each belong to a certain class, supervised clas

In various real-world scenarios where supervised classification is employed, such as those pertaining to the detection of particular conditions like fraud, faults, pollution, or rare diseases, a severe discrepancy between the number of observations in each class can occur. This is known as class imbalance. This poses a problem if assumptions inherent in the classification model imply hindered performance when the model is trained on imbalanced data as is commonly the case [@Ali:2015]. Two prevalent strategies for mitigating class imbalance, when it poses a problem to the classification model, involve either increasing the representation of less frequently occurring classes through oversampling or reducing instances of more frequently occurring classes through undersampling. It may be also possible to achieve even greater performance by combining both approaches in a sequential pipeline [@Zeng:2016] or by undersampling the data multiple times and training the classification model on each resampled dataset to form an ensemble model that aggregates results from different model instances [@Liu:2009]. Contrary to undersampling, oversampling or their combination, the ensemble approach possesses the ability to address class imbalance while making use of the entire dataset and without generating synthetic data.



# Statement of Need

A substantial body of literature in the field of machine learning and statistics is devoted to addressing the class imbalance issue. This predicament has often been aptly labeled the "curse of class imbalance," as noted in [@Picek:2018] and [@Kubt:1997] which follows from the pervasive nature of the issue across diverse real-world applications and its pronounced severity; a classifier may incur an extraordinarily large performance penalty in response to training on imbalanced data.

The literature encompasses a myriad of oversampling and undersampling techniques to approach the class imbalance issue. These include SMOTE [@Chawla:2002] which operates by generating synthetic examples along the lines joining existing points, SMOTE-N and SMOTE-NC [@Chawla:2002] which are variants of SMOTE that can deal with categorical data. The sheer number of SMOTE variants makes them a body of literature on their own. Notably, the most widely cited variant of SMOTE is BorderlineSMOTE [@Han:2005]. Other well-established oversampling techniques include RWO [@Zhang:2014] and ROSE [@Menardi:2012] which operate by estimating probability densities and sampling from them to generate synthetic points. On the other hand, the literature also encompasses many undersampling techniques. Cluster undersampling [@Lin:2016] and condensed nearest neighbors [@Hart:1968] are two prominent examples which attempt to reduce the number of points while preserving the structure or classification of the data. Furthermore, methods that combine oversampling and undersampling such as SMOTETomek [@Zeng:2016] are also present. The motivation behind these methods is that when undersampling is not random, it can filter out noisy or irrelevant oversampled data. Lastly, resampling with ensemble learning has also been presented in the literature with EasyEnsemble being them most well-known approach of that type [@Liu:2009].


The existence of a toolbox with techniques that harness this wealth of research is imperative to the development of novel approaches to the class imbalance problem and for machine learning research broadly. Aside from addressing class imbalance in a general machine learning research setting, the toolbox can help in class imbalance research settings by making it possible to juxtapose different methods, compose them together, or form variants of them without having to reimplement them from scratch. In prevalent programming languages, such as Python, a variety of such toolboxes already exist, such as imbalanced-learn [@Lematre:2016] and SMOTE-variants [@Kovács:2019]. Meanwhile, Julia [@julia], a well known programming language with over 40M downloads [@DataCamp:2023], has been lacking a similar toolbox to address the class imbalance issue in general multi-class and heterogeneous data settings. This has served as the primary motivation for the creation of the `Imbalance.jl` toolbox.

# Imbalance.jl

In this work, we present, `Imbalance.jl`, a software toolbox implemented in the Julia programming language that offers over 10 well-established techniques that help address the class imbalance issue. Additionally, we present a companion package, `MLJBalancing.jl`, which: (i) facilitates the inclusion of resampling methods in pipelines with classification models via the `BalancedModel` construct; and (ii) implements a general version of the EasyEnsemble algorithm presented in [@Liu:2009].
Expand Down Expand Up @@ -110,14 +121,7 @@ This set of design principles is also satisfied by `Imbalance.jl`. Implemented t
The `Imbalance.jl` documentation indeed satisfies this set of design principles. Methods are associated with examples that can be copy-pasted, examples that demonstrate the operation of the technique visually, and possibly, examples that use it with a real-world dataset to improve the performance of a classification model.


# Statement of Need

A substantial body of literature in the field of machine learning and statistics is devoted to addressing the class imbalance issue. This predicament has often been aptly labeled the "curse of class imbalance," as noted in [@Picek:2018] and [@Kubt:1997] which follows from the pervasive nature of the issue across diverse real-world applications and its pronounced severity; a classifier may incur an extraordinarily large performance penalty in response to training on imbalanced data.

The literature encompasses a myriad of oversampling and undersampling techniques to approach the class imbalance issue. These include SMOTE [@Chawla:2002] which operates by generating synthetic examples along the lines joining existing points, SMOTE-N and SMOTE-NC [@Chawla:2002] which are variants of SMOTE that can deal with categorical data. The sheer number of SMOTE variants makes them a body of literature on their own. Notably, the most widely cited variant of SMOTE is BorderlineSMOTE [@Han:2005]. Other well-established oversampling techniques include RWO [@Zhang:2014] and ROSE [@Menardi:2012] which operate by estimating probability densities and sampling from them to generate synthetic points. On the other hand, the literature also encompasses many undersampling techniques. Cluster undersampling [@Lin:2016] and condensed nearest neighbors [@Hart:1968] are two prominent examples which attempt to reduce the number of points while preserving the structure or classification of the data. Furthermore, methods that combine oversampling and undersampling such as SMOTETomek [@Zeng:2016] are also present. The motivation behind these methods is that when undersampling is not random, it can filter out noisy or irrelevant oversampled data. Lastly, resampling with ensemble learning has also been presented in the literature with EasyEnsemble being them most well-known approach of that type [@Liu:2009].


The existence of a toolbox with techniques that harness this wealth of research is imperative to the development of novel approaches to the class imbalance problem and for machine learning research broadly. Aside from addressing class imbalance in a general machine learning research setting, the toolbox can help in class imbalance research settings by making it possible to juxtapose different methods, compose them together, or form variants of them without having to reimplement them from scratch. In prevalent programming languages, such as Python, a variety of such toolboxes already exist, such as imbalanced-learn [@Lematre:2016] and SMOTE-variants [@Kovács:2019]. Meanwhile, Julia, a well known programming language with over 40M downloads [@DataCamp:2023], has been lacking a similar toolbox to address the class imbalance issue in general multi-class and heterogeneous data settings. This has served as the primary motivation for the creation of the `Imbalance.jl` toolbox.


## Author Contributions
Expand Down

0 comments on commit 63174a9

Please sign in to comment.