[model:component] Add sampling techniques to address imbalanced dataset #4283

benjaminmah · 2024-06-28T12:21:54Z

Resolves #4281.

Investigating and adding sampling techniques (i.e. SMOTE, SMOTEEN, RandomUndersampling) to address the imbalanced dataset of bugs.

benjaminmah · 2024-06-28T12:25:37Z

Still a WIP. Metrics collected from SMOTE can be found here: metrics.log

Takeaways:

SMOTE increases the training time to around 2 hours (as opposed to the current 30-40 minute training time)
The precision and accuracy are extremely low

This is most likely due to the huge differences in the number of bugs in different products and components (1000+ vs 20), and SMOTE matches the number of bugs in each minority class to the majority class, making the ratio of synthetic data to real data very large.

benjaminmah added 2 commits June 25, 2024 16:39

Added SMOTE oversampling and random undersampling

63bcd8f

Added SMOTE

f450947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model:component] Add sampling techniques to address imbalanced dataset #4283

[model:component] Add sampling techniques to address imbalanced dataset #4283

benjaminmah commented Jun 28, 2024

benjaminmah commented Jun 28, 2024

[model:component] Add sampling techniques to address imbalanced dataset #4283

Are you sure you want to change the base?

[model:component] Add sampling techniques to address imbalanced dataset #4283

Conversation

benjaminmah commented Jun 28, 2024

benjaminmah commented Jun 28, 2024