Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

Merged
merged 1 commit into from
Oct 6, 2021

Conversation

michal-gh
Copy link
Contributor

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

This PR improves performance of read_csv with memory_map=True and a UTF-8-encoded file by eliminating unnecessary decode() call. The code below demonstrates the speed improvement:

import timeit
import pandas as pd

def perftest_readcsv_memmap_utf8():
    lines = []
    for lnum in range(0x20, 0x10080, 0x80):
        line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n"
        try:
            line.encode("utf-8")
        except UnicodeEncodeError:
            continue
        lines.append(line)
    df = pd.DataFrame(lines)
    df = pd.concat([df for n in range(1000)], ignore_index=True)
    fname = "test_readcsv_utf8.csv"
    df.to_csv(fname, index=False, header=False, encoding="utf-8")
    ti_rep = 5
    ti_num = 10
    time_dfmem = timeit.repeat(f'dfnomem = pd.read_csv("{fname}", header=None, memory_map=True, engine="c")',
                  setup='import pandas as pd',
                  repeat=ti_rep, number=ti_num)
    print(f"Read CSV (memory_map=True), repeat={ti_rep}: {time_dfmem}")
    print(f"Median: {pd.Series(time_dfmem).median()}")

perftest_readcsv_memmap_utf8()

On my machine the results are:
Without patch:

Read CSV (memory_map=True), repeat=5: [8.107480439008214, 8.100845866953023, 8.135483622085303, 8.090781628969125, 8.068992758984677]
Median: 8.100845866953023

With patch:

Read CSV (memory_map=True), repeat=5: [5.280414769076742, 5.290980814024806, 5.127453051973134, 5.150275847059675, 5.276113986037672]
Median: 5.276113986037672

The improved code runs in ~65% of time of the current code.
The spedup depends on the contents of the file; the test code above creates a 182 MB file containing almost all of Unicode Plane 0 and a several of Plane 1 characters UTF-8-encoded in 1, 2, 3 and 4 bytes; in this respect, it is a worst case. I also tested this patch on my 8.8 GB CSV file consisting of 1 and 2-byte encoded characters and the code ran in ~75% of time of the current code.

@jreback jreback added IO CSV read_csv, to_csv Performance Memory or execution speed performance Unicode Unicode strings labels Sep 28, 2021
@jreback
Copy link
Contributor

jreback commented Sep 28, 2021

@michal-gh yeah if you can run the existing asv's and report here (and you might need to add one to cover this case).

cc @twoertwein

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add a whatsnew entry for this. This is quite an impressive speedup!

@skip_pyarrow
def test_readcsv_memmap_utf8(all_parsers):
lines = []
for lnum in range(0x20, 0x10080, 0x80):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment what these magic numbers represent or what the block is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanatory variables and comment.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an asv which covers this (if its straightforward to do)

doc/source/whatsnew/v1.3.4.rst Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.4 milestone Oct 3, 2021
@michal-gh
Copy link
Contributor Author

michal-gh commented Oct 5, 2021

The last commit above includes asv perf test. On my machine, running asv's io.csv returns:

       before           after         ratio
     [bd94bb12]       [52ac1bed]
     <perf-readcsv~2^2^2>       <perf-readcsv>
-      81.6±0.5ms       55.3±0.4ms     0.68  io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

I also corrected the whatsnew file.

@jreback jreback merged commit 8c201e2 into pandas-dev:master Oct 6, 2021
@jreback
Copy link
Contributor

jreback commented Oct 6, 2021

thanks @michal-gh very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants