Enhance: Fix Data Quality with Outlier Handling and Improved Missing Value Treatment #207
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces some enhancements to the Synthetic Data Generator (SDG) framework, focusing on improving data quality and handling of specific data anomalies. The key changes include:
Introduction of OutlierTransformer: A new transformer class designed to handle outliers in the data by converting them to specified fill values. This class is equipped to manage outliers in both integer and float columns, replacing them with default fill values (0 for integers and 0.0 for floats).
Enhancements to NonValueTransformer: The NonValueTransformer class has been updated to better handle missing values in a DataFrame. It now differentiates between numeric and non-numeric columns, filling missing values in numeric columns with specified numeric defaults (0 for integers, 0.0 for floats) and non-numeric columns with a default string ('NAN_VALUE').
Documentation Updates: Comprehensive docstrings have been added to both the OutlierTransformer and NonValueTransformer classes, providing clear descriptions of their functionalities, attributes, and methods.
Manager Registration: The OutlierTransformer has been registered with the DataProcessorManager, ensuring it can be utilized within the SDG framework.
Regex Inspector Parameter Update: A minor update to the Regex Inspector's
fit
method to change the parameter name fromraw_data
toinput_raw_data
for clarity and consistency.DiscreteTransformer Registration: DiscreteTransformer is currently disabled.
Test Cases for OutlierTransformer: Added test cases to validate the functionality of the OutlierTransformer, including handling of outliers in integer and float columns.
Motivation and Context
This change is required to enhance the robustness and reliability of the SDG, particularly in scenarios where data contains outliers or missing values.
By introducing the OutlierTransformer and enhancing the NonValueTransformer, we ensure that the generated synthetic data is of higher quality, suitable for a wider range of applications, and more representative of real-world data anomalies.
How has this been tested?
The changes have been thoroughly tested using automated test cases. Specifically:
Types of changes
Checklist: