Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ConstInspector and ConstValueTransformer for Handling Constant Columns #202

Merged
merged 32 commits into from
Jul 31, 2024

Conversation

MooooCat
Copy link
Contributor

@MooooCat MooooCat commented Jul 12, 2024

Description

This pull request introduces several enhancements and fixes to the Synthetic Data Generator (SDG) framework, focusing on the handling of constant columns in tabular data. The changes include:

  • Addition of a ConstInspector class to identify columns with constant values in a DataFrame.
  • Implementation of a ConstValueTransformer class to transform and reverse transform data by replacing specified columns with constant values.
  • Updates to metadata handling to include constant columns.

Motivation and Context

This change is required to improve the quality and utility of the synthetic data generated by the SDG framework.

By identifying and handling constant columns, we ensure that the synthetic data maintains the integrity of the original data.

This enhancement also addresses the need for more robust data transformation capabilities, allowing for more accurate and controlled generation of synthetic data.

How has this been tested?

The changes have been thoroughly tested using unit tests that cover the new functionality introduced by ConstInspector and ConstValueTransformer.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

MooooCat and others added 23 commits July 12, 2024 17:19
The management of metadata fields may be flawed, necessitating an examination of the eq method or the manner in which fields are retrieved. We will open a separate pull request to address this issue.
Addressing issues in pytest where erroneous references to certain pytest.fixture instances arise can be resolved through the utilization of deepcopy.
…s to ensure they are comprehensive and reflect the latest functionality.
@MooooCat
Copy link
Contributor Author

We have observed that initializing different Metadata within the same function or the same batch of unit tests seems to interfere with each other, leading to inaccurate table metadata. This might be a bug, and we should create a separate Issue and PR to address it.

For example, we can look at the error in the test , in tests/data_models/test_metadata.py::test_demo_multi_table_data_metadata_parent. This test is intended for a multi-table dataset, but the metadata includes columns from the single-table dataset adult.csv, i.e. {'workclass', 'fnlwgt', 'age'}. This issue could be caused by the metadata or the inspector.

@MooooCat MooooCat merged commit 7d37e58 into main Jul 31, 2024
12 checks passed
@MooooCat MooooCat deleted the feature-handle-const-columns branch August 22, 2024 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant