Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce LLM-based single-table model. #129

Merged
merged 38 commits into from
Feb 20, 2024
Merged

Introduce LLM-based single-table model. #129

merged 38 commits into from
Feb 20, 2024

Conversation

MooooCat
Copy link
Contributor

@MooooCat MooooCat commented Jan 31, 2024

Description

For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. More over, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .

In this PR, we introduce sdgx.models.LLM.single_table.SingleTableGPT.SingleTableGPTModel, our first synthetic data generation model integrating LLM.

Motivation and Context

Compared with existing models, SingleTableGPTModel implements two new features:

  • Generation without Data: No training data is required, synthetic data can be generated based on metadata data;
  • Off-Table Feature Inference: Infer new column data based on the existing data in the table and the knowledge mastered by LLM.

In addition, SingleTableGPTModel can directly generate data without complicated and time-consuming steps such as manual labeling and feature engineering, which will save a lot of operator time and allow them to focus on creative work.

How has this been tested?

We currently provide some test cases at tests/models/test_singletableGPT.py. This test file contains some content returned by GPT. We will not repeatedly request GPT in the unit test to avoid consuming a large amount of tokens.

I will continue to improve these test cases.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@hitsz-ids hitsz-ids deleted a comment from sweep-ai bot Jan 31, 2024
@hitsz-ids hitsz-ids deleted a comment from sweep-ai bot Jan 31, 2024
@hitsz-ids hitsz-ids deleted a comment from sweep-ai bot Jan 31, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (c55e340) 80.35% compared to head (a043c5c) 79.87%.

Files Patch % Lines
sdgx/models/LLM/single_table/gpt.py 71.21% 59 Missing ⚠️
sdgx/models/LLM/base.py 87.80% 5 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #129      +/-   ##
==========================================
- Coverage   80.35%   79.87%   -0.48%     
==========================================
  Files          66       69       +3     
  Lines        3003     3250     +247     
==========================================
+ Hits         2413     2596     +183     
- Misses        590      654      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MooooCat
Copy link
Contributor Author

MooooCat commented Jan 31, 2024

Some of the comments are still incomplete at the moment, I will add them as soon as possible.

In addition, the unit test coverage is insufficient, I will add some test cases.

After completing this I will set the PR status to Ready, developers are also welcome to help me improve the above two contents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use snakecase for the filename

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use snakecase for the filename, maybe test_singletable_gpt.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, I'll change the filename.

@MooooCat MooooCat marked this pull request as ready for review February 20, 2024 08:00
@MooooCat MooooCat requested a review from Z712023 February 20, 2024 08:00
@MooooCat MooooCat enabled auto-merge (squash) February 20, 2024 08:05
the metadata.
"""

off_table_features = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of variable off_table_features is an interesting idea. :)

@MooooCat MooooCat merged commit 269063d into main Feb 20, 2024
11 checks passed
@MooooCat MooooCat deleted the feature-LLM-models branch February 20, 2024 08:58
@Wh1isper
Copy link
Collaborator

How about release 0.2.0 for it?

@MooooCat
Copy link
Contributor Author

How about release 0.2.0 for it?

Good idea, I need make another few changes (google colab examples, readme updates...) before releasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants