-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787
Conversation
@michal-gh yeah if you can run the existing asv's and report here (and you might need to add one to cover this case). cc @twoertwein |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add a whatsnew entry for this. This is quite an impressive speedup!
@skip_pyarrow | ||
def test_readcsv_memmap_utf8(all_parsers): | ||
lines = [] | ||
for lnum in range(0x20, 0x10080, 0x80): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment what these magic numbers represent or what the block is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanatory variables and comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an asv which covers this (if its straightforward to do)
960ddb5
to
944c73b
Compare
The last commit above includes asv perf test. On my machine, running asv's io.csv returns:
I also corrected the whatsnew file. |
thanks @michal-gh very nice! |
This PR improves performance of read_csv with memory_map=True and a UTF-8-encoded file by eliminating unnecessary
decode()
call. The code below demonstrates the speed improvement:On my machine the results are:
Without patch:
With patch:
The improved code runs in ~65% of time of the current code.
The spedup depends on the contents of the file; the test code above creates a 182 MB file containing almost all of Unicode Plane 0 and a several of Plane 1 characters UTF-8-encoded in 1, 2, 3 and 4 bytes; in this respect, it is a worst case. I also tested this patch on my 8.8 GB CSV file consisting of 1 and 2-byte encoded characters and the code ran in ~75% of time of the current code.