fix: Datatime formatter in small dataset and improve performace #244

cyantangerine · 2024-11-22T10:05:02Z

Description

When trainning ctgan, we used the DataLoader to load data as chunk. While, due to the historical reasons, DatetimeFormatter using a simple list to format columns. When we using DataLoader, the formatter format the data by chunk, which lost it's index. So, when we concat the next formatted chunk column to chunk table, the table result (index beginning by 1*chunk_size) will be NaN becase it can not match the zero-based index of chunk column.

Motivation and Context

I changed the method of formatting, instead of using the 'apply' function of DataFrame/Series. It fixed the problem and improved the performance by using 'for' cycling.

How has this been tested?

A whole test for DatetimeFormatter has been given.

Types of changes

Maintenance (no change in code, maintain the project's CI, docs, etc.)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

for more information, see https://pre-commit.ci

cyantangerine · 2024-11-22T10:09:58Z

I suggest that we should have a check for all code to replace 'for' to pandas.apply morely. I saw a lot of implement like this in code. It can improve the performance a lot.

jalr4ever · 2024-11-22T10:22:41Z

I suggest that we should have a check for all code to replace 'for' to pandas.apply morely. I saw a lot of implement like this in code. It can improve the performance a lot.

Thank you very much! Regarding this issue, I think it would be best to have a use case that can immediately demonstrate the system's performance problems, so we can then discuss whether to make this optimization. If there is interest in this performance issue, we can create an ISSUE for separate analysis and tracking.

Wh1isper

Looks reasonable for me, @jalr4ever what do you think?

Wh1isper · 2024-11-22T10:38:42Z

And yes, for is usually bad in performance. I've open an issue for it #245. If anyone has interested in it, feel free to draft a PR!

cyantangerine · 2024-11-22T10:54:10Z

@Wh1isper
Something to attention: It's a bug.

Due to the original method, when the dataset length is bigger than DataLoader chunksize, the result of processed data is FULL of NaN except the first chunk.

cyantangerine · 2024-11-22T10:56:08Z

default chunk_size = 10000
so for 15000 length dataset, it has 5000 NaN

cyantangerine and others added 2 commits November 22, 2024 17:03

bugfix: datetime_formatter error

7461c0d

[pre-commit.ci] auto fixes from pre-commit.com hooks

3176553

for more information, see https://pre-commit.ci

jalr4ever requested a review from Wh1isper November 22, 2024 10:14

Wh1isper reviewed Nov 22, 2024

View reviewed changes

cyantangerine mentioned this pull request Nov 22, 2024

Performance: reduce for cycles when handling dataframe #245

Open

Wh1isper approved these changes Nov 22, 2024

View reviewed changes

Wh1isper changed the title ~~BugFix: Datetime formatter go error when chunk_size > dataset rows.~~ fix: Datatime formatter in small dataset and improve performace Nov 22, 2024

Wh1isper merged commit 0fc9ea2 into hitsz-ids:main Nov 22, 2024
11 checks passed

cyantangerine deleted the bugfix-datetimeformatter branch November 22, 2024 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Datatime formatter in small dataset and improve performace #244

fix: Datatime formatter in small dataset and improve performace #244

cyantangerine commented Nov 22, 2024

cyantangerine commented Nov 22, 2024 •

edited

Loading

jalr4ever commented Nov 22, 2024

Wh1isper left a comment •

edited

Loading

Wh1isper commented Nov 22, 2024 •

edited

Loading

cyantangerine commented Nov 22, 2024

cyantangerine commented Nov 22, 2024 •

edited

Loading

fix: Datatime formatter in small dataset and improve performace #244

fix: Datatime formatter in small dataset and improve performace #244

Conversation

cyantangerine commented Nov 22, 2024

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

cyantangerine commented Nov 22, 2024 • edited Loading

jalr4ever commented Nov 22, 2024

Wh1isper left a comment • edited Loading

Choose a reason for hiding this comment

Wh1isper commented Nov 22, 2024 • edited Loading

cyantangerine commented Nov 22, 2024

cyantangerine commented Nov 22, 2024 • edited Loading

cyantangerine commented Nov 22, 2024 •

edited

Loading

Wh1isper left a comment •

edited

Loading

Wh1isper commented Nov 22, 2024 •

edited

Loading

cyantangerine commented Nov 22, 2024 •

edited

Loading