Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Datatime formatter in small dataset and improve performace #244

Merged
merged 2 commits into from
Nov 22, 2024

Conversation

cyantangerine
Copy link
Contributor

Description

When trainning ctgan, we used the DataLoader to load data as chunk. While, due to the historical reasons, DatetimeFormatter using a simple list to format columns. When we using DataLoader, the formatter format the data by chunk, which lost it's index. So, when we concat the next formatted chunk column to chunk table, the table result (index beginning by 1*chunk_size) will be NaN becase it can not match the zero-based index of chunk column.

Motivation and Context

I changed the method of formatting, instead of using the 'apply' function of DataFrame/Series. It fixed the problem and improved the performance by using 'for' cycling.

How has this been tested?

A whole test for DatetimeFormatter has been given.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@cyantangerine
Copy link
Contributor Author

cyantangerine commented Nov 22, 2024

I suggest that we should have a check for all code to replace 'for' to pandas.apply morely. I saw a lot of implement like this in code. It can improve the performance a lot.

@jalr4ever jalr4ever requested a review from Wh1isper November 22, 2024 10:14
@jalr4ever
Copy link
Collaborator

I suggest that we should have a check for all code to replace 'for' to pandas.apply morely. I saw a lot of implement like this in code. It can improve the performance a lot.

Thank you very much! Regarding this issue, I think it would be best to have a use case that can immediately demonstrate the system's performance problems, so we can then discuss whether to make this optimization. If there is interest in this performance issue, we can create an ISSUE for separate analysis and tracking.

Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable for me, @jalr4ever what do you think?

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 22, 2024

And yes, for is usually bad in performance. I've open an issue for it #245. If anyone has interested in it, feel free to draft a PR!

@Wh1isper Wh1isper changed the title BugFix: Datetime formatter go error when chunk_size > dataset rows. fix: Datatime formatter in small dataset and improve performace Nov 22, 2024
@Wh1isper Wh1isper merged commit 0fc9ea2 into hitsz-ids:main Nov 22, 2024
11 checks passed
@cyantangerine
Copy link
Contributor Author

@Wh1isper
Something to attention: It's a bug.

Due to the original method, when the dataset length is bigger than DataLoader chunksize, the result of processed data is FULL of NaN except the first chunk.

@cyantangerine
Copy link
Contributor Author

cyantangerine commented Nov 22, 2024

default chunk_size = 10000
so for 15000 length dataset, it has 5000 NaN
image

@cyantangerine cyantangerine deleted the bugfix-datetimeformatter branch November 22, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants