[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

michal-gh · 2021-09-28T20:17:45Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This PR improves performance of read_csv with memory_map=True and a UTF-8-encoded file by eliminating unnecessary decode() call. The code below demonstrates the speed improvement:

import timeit
import pandas as pd

def perftest_readcsv_memmap_utf8():
    lines = []
    for lnum in range(0x20, 0x10080, 0x80):
        line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n"
        try:
            line.encode("utf-8")
        except UnicodeEncodeError:
            continue
        lines.append(line)
    df = pd.DataFrame(lines)
    df = pd.concat([df for n in range(1000)], ignore_index=True)
    fname = "test_readcsv_utf8.csv"
    df.to_csv(fname, index=False, header=False, encoding="utf-8")
    ti_rep = 5
    ti_num = 10
    time_dfmem = timeit.repeat(f'dfnomem = pd.read_csv("{fname}", header=None, memory_map=True, engine="c")',
                  setup='import pandas as pd',
                  repeat=ti_rep, number=ti_num)
    print(f"Read CSV (memory_map=True), repeat={ti_rep}: {time_dfmem}")
    print(f"Median: {pd.Series(time_dfmem).median()}")

perftest_readcsv_memmap_utf8()

On my machine the results are:
Without patch:

Read CSV (memory_map=True), repeat=5: [8.107480439008214, 8.100845866953023, 8.135483622085303, 8.090781628969125, 8.068992758984677]
Median: 8.100845866953023

With patch:

Read CSV (memory_map=True), repeat=5: [5.280414769076742, 5.290980814024806, 5.127453051973134, 5.150275847059675, 5.276113986037672]
Median: 5.276113986037672

The improved code runs in ~65% of time of the current code.
The spedup depends on the contents of the file; the test code above creates a 182 MB file containing almost all of Unicode Plane 0 and a several of Plane 1 characters UTF-8-encoded in 1, 2, 3 and 4 bytes; in this respect, it is a worst case. I also tested this patch on my 8.8 GB CSV file consisting of 1 and 2-byte encoded characters and the code ran in ~75% of time of the current code.

pandas/tests/io/parser/test_encoding.py

jreback · 2021-09-28T23:41:13Z

@michal-gh yeah if you can run the existing asv's and report here (and you might need to add one to cover this case).

cc @twoertwein

twoertwein

You might want to add a whatsnew entry for this. This is quite an impressive speedup!

twoertwein · 2021-09-28T23:50:27Z

pandas/tests/io/parser/test_encoding.py

+@skip_pyarrow
+def test_readcsv_memmap_utf8(all_parsers):
+    lines = []
+    for lnum in range(0x20, 0x10080, 0x80):


Maybe add a comment what these magic numbers represent or what the block is doing.

Added explanatory variables and comment.

…as-dev#43787)

jreback

do we have an asv which covers this (if its straightforward to do)

doc/source/whatsnew/v1.3.4.rst

…as-dev#43787)

michal-gh · 2021-10-05T16:01:43Z

The last commit above includes asv perf test. On my machine, running asv's io.csv returns:

       before           after         ratio
     [bd94bb12]       [52ac1bed]
     <perf-readcsv~2^2^2>       <perf-readcsv>
-      81.6±0.5ms       55.3±0.4ms     0.68  io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

I also corrected the whatsnew file.

jreback · 2021-10-06T01:02:52Z

thanks @michal-gh very nice!

…as-dev#43787) (pandas-dev#43787)

jbrockmendel reviewed Sep 28, 2021

View reviewed changes

pandas/tests/io/parser/test_encoding.py Show resolved Hide resolved

jreback added IO CSV read_csv, to_csv Performance Memory or execution speed performance Unicode Unicode strings labels Sep 28, 2021

twoertwein approved these changes Sep 28, 2021

View reviewed changes

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 2, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

1df9c2d

…as-dev#43787)

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 2, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

12217ff

…as-dev#43787)

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 2, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

fdc7709

…as-dev#43787)

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 2, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

e25df6b

…as-dev#43787)

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 2, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

000fcab

…as-dev#43787)

jreback requested changes Oct 3, 2021

View reviewed changes

doc/source/whatsnew/v1.3.4.rst Outdated Show resolved Hide resolved

jreback added this to the 1.4 milestone Oct 3, 2021

michal-gh added a commit to michal-gh/pandas that referenced this pull request Oct 3, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

96ddaef

…as-dev#43787)

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

944c73b

…as-dev#43787)

michal-gh force-pushed the perf-readcsv branch from 960ddb5 to 944c73b Compare October 5, 2021 15:57

jreback approved these changes Oct 6, 2021

View reviewed changes

jreback merged commit 8c201e2 into pandas-dev:master Oct 6, 2021

michal-gh deleted the perf-readcsv branch October 6, 2021 18:11

gasparitiago pushed a commit to gasparitiago/pandas that referenced this pull request Oct 9, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

db68b75

…as-dev#43787) (pandas-dev#43787)

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Oct 10, 2021

PERF: read_csv with memory_map=True when file encoding is UTF-8 (pand…

aa0a1d6

…as-dev#43787) (pandas-dev#43787)

twoertwein mentioned this pull request Dec 5, 2021

CLN: memory-mapping code #44766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

michal-gh commented Sep 28, 2021

jreback commented Sep 28, 2021

twoertwein left a comment

twoertwein Sep 28, 2021

michal-gh Oct 3, 2021

jreback left a comment

michal-gh commented Oct 5, 2021 •

edited

Loading

jreback commented Oct 6, 2021

[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

[PERF] Improve performace of read_csv with memory_map=True when file encoding is UTF-8 #43787

Conversation

michal-gh commented Sep 28, 2021

jreback commented Sep 28, 2021

twoertwein left a comment

Choose a reason for hiding this comment

twoertwein Sep 28, 2021

Choose a reason for hiding this comment

michal-gh Oct 3, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

michal-gh commented Oct 5, 2021 • edited Loading

jreback commented Oct 6, 2021

michal-gh commented Oct 5, 2021 •

edited

Loading