-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Datatime formatter in small dataset and improve performace #244
fix: Datatime formatter in small dataset and improve performace #244
Conversation
for more information, see https://pre-commit.ci
I suggest that we should have a check for all code to replace 'for' to pandas.apply morely. I saw a lot of implement like this in code. It can improve the performance a lot. |
Thank you very much! Regarding this issue, I think it would be best to have a use case that can immediately demonstrate the system's performance problems, so we can then discuss whether to make this optimization. If there is interest in this performance issue, we can create an ISSUE for separate analysis and tracking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable for me, @jalr4ever what do you think?
And yes, |
@Wh1isper Due to the original method, when the dataset length is bigger than DataLoader chunksize, the result of processed data is FULL of NaN except the first chunk. |
Description
When trainning ctgan, we used the DataLoader to load data as chunk. While, due to the historical reasons, DatetimeFormatter using a simple list to format columns. When we using DataLoader, the formatter format the data by chunk, which lost it's index. So, when we concat the next formatted chunk column to chunk table, the table result (index beginning by 1*chunk_size) will be NaN becase it can not match the zero-based index of chunk column.
Motivation and Context
I changed the method of formatting, instead of using the 'apply' function of DataFrame/Series. It fixed the problem and improved the performance by using 'for' cycling.
How has this been tested?
A whole test for DatetimeFormatter has been given.
Types of changes
Checklist: