BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

DanielFEvans · 2020-10-29T13:55:08Z

I have checked that this issue has not already been reported.

There are some historic "to_csv is slow" reports, but none mention the specific behaviour I've seen with regard to indices.

I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import cProfile
import time
import pstats
from io import BytesIO, StringIO

import pandas as pd
import numpy as np

NUM_INDEX_COLS = 3
NUM_COLS = 50
NUM_ROWS = 1000000

# Generate a large-ish dummy dataframe
data = {}
idx_cols = []
for idx_id in range(NUM_INDEX_COLS):
    idx_name = f"Index_{idx_id}"
    idx_cols.append(idx_name)
    if idx_id == 0:
        data[idx_name] = np.random.randint(0, NUM_ROWS, NUM_ROWS)
    else:
        data[idx_name] = np.full(NUM_ROWS, 1, dtype=np.int)

for col_id in range(NUM_COLS):
    col_name = f"Column_{col_id}"
    data[col_name] = np.random.uniform(0, 100000.0, NUM_ROWS)

source_df = pd.DataFrame(data)
source_df_indexed = source_df.set_index(idx_cols)

# Default index - fast
df1 = source_df.head(10000)
# MultiIndex from idx_cols, then take head - slow
df2 = source_df_indexed.head(10000)
# Take head, then set MultiIndex from idx_cols - fast
df3 = source_df.head(10000).set_index(idx_cols)
# Deepcopy of df2 - slow (docs say that Index is not copied, so not surprising)
df4 = df2.copy(deep=True)

# Will run each df twice, and take timings from second run.
dataframes = [df1, df1, df2, df2, df3, df3, df4, df4]

#### WRITE TO CSV WITH EXISTING INDEX
#
# df1: 0.53s, with no index set
# df2: 2.35s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 2.12s, deep copy of df2
#

for i, df in enumerate(dataframes):
    out = StringIO()
    t1 = time.time()
    df.to_csv(out)
    t2 = time.time()
    if (i % 2) == 1:
        print(t2-t1)

#### WRITE TO CSV WITH INCLUDED CALL TO RESET INDEX
#
# df1: 0.55s, with no index set
# df2: 0.53s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 0.54s, deep copy of df2

for i, df in enumerate(dataframes):
    out = StringIO()
    t1 = time.time()
    df.reset_index().to_csv(out)
    t2 = time.time()
    if (i % 2) == 1:
        print(t2-t1)

Problem description

The speed of to_csv varies significantly depending on whether an Index was set, and when/how that Index was set. This is not the case with e.g. to_parquet. The original case where I noticed this speed difference is not so trivial, but it involves a MultiIndexed dataframe that has been manipulated before being output to CSV, taking approximately 50 minutes to output a ~30GB file.

(Yes, CSV is slow and horrible for many other reasons even at the best of times, and we avoid it as much as possible.)

In the example above, I create a million row dataframe source_df, with three integer columns (two have the value 1 throughout), and 50 float columns. I then create four subsets of the first 10000 rows, doing different things with the index:

df1: Just the source dataframe, no index modification (so it'll have the default RangeIndex)
df2: Set the three integer columns as a MultiIndex, and then take .head(10000)
df3: Take .head(10000), and then set the three integer columns as a MultiIndex
df4: Deep copy of df2. (The Pandas documentation suggests that at least the index won't actually be copied).

I then time how long it takes to write each out to CSV (using a StringIO, to avoid IO speed concerns). df1 (no index) and df3 (index set after taking .head) are fast; df2 (index set before taking .head) and df4 (deepcopy of df2) are much slower. It appears that the index being a subset of a larger dataframe significantly slows down the writing.

I then do a separate test, whereby I reset the index of each dataframe immediately before writing. In this case, all write out as fast as df1/df3 in the original test.

I have also tested with to_parquet by changing to_csv to to_parquet and providing a BytesIO. All four dataframes are output in 0.08s, regardless of index origin and whether the index is reset or not.

Surprisingly (for me), profiling indicates that {method 'astype' of 'numpy.ndarray' objects} is where the extra time originates from. There aren't many extra calls - they just take much, much longer. If you're lucky, this might even be an upstream issue!

Summarised profiling output for df1, df2, and df3 in the don't-reset-index case below.


df1 - Default index, fast

         1928 function calls (1910 primitive calls) in 0.530 seconds

   Ordered by: internal time
   List reduced from 126 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12    0.285    0.024    0.285    0.024 {method 'astype' of 'numpy.ndarray' objects}
        6    0.242    0.040    0.242    0.040 {pandas._libs.writers.write_csv_rows}
       19    0.001    0.000    0.001    0.000 missing.py:193(_isna_ndarraylike)
        6    0.000    0.000    0.530    0.088 csvs.py:330(_save_chunk)
    20/13    0.000    0.000    0.000    0.000 {built-in method numpy.array}
  384/382    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      278    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        6    0.000    0.000    0.282    0.047 blocks.py:1938(to_native_types)
        1    0.000    0.000    0.530    0.530 csvs.py:313(_save)
      200    0.000    0.000    0.000    0.000 generic.py:10(_check)
       12    0.000    0.000    0.000    0.000 blocks.py:292(getitem_block)
       19    0.000    0.000    0.001    0.000 missing.py:130(_isna)
        6    0.000    0.000    0.000    0.000 generic.py:3551(_slice)
       12    0.000    0.000    0.000    0.000 range.py:697(__getitem__)
        6    0.000    0.000    0.000    0.000 managers.py:757(get_slice)
       19    0.000    0.000    0.000    0.000 dtypes.py:903(is_dtype)
       12    0.000    0.000    0.000    0.000 range.py:153(_data)
        7    0.000    0.000    0.004    0.001 base.py:984(_format_native_types)
        6    0.000    0.000    0.000    0.000 {built-in method numpy.arange}
        6    0.000    0.000    0.001    0.000 blocks.py:656(to_native_types)


df2 - MultiIndex set before head, slow

         10629 function calls (10518 primitive calls) in 2.135 seconds

   Ordered by: internal time
   List reduced from 182 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    1.630    0.082    1.630    0.082 {method 'astype' of 'numpy.ndarray' objects}
        5    0.247    0.049    0.247    0.049 {pandas._libs.writers.write_csv_rows}
   122/91    0.163    0.001    0.163    0.002 {built-in method numpy.array}
        6    0.042    0.007    1.588    0.265 base.py:954(to_native_types)
       15    0.027    0.002    0.027    0.002 {pandas._libs.lib.infer_dtype}
        1    0.016    0.016    2.135    2.135 csvs.py:313(_save)
       15    0.001    0.000    0.001    0.000 {pandas._libs.algos.take_1d_object_object}
       21    0.001    0.000    0.002    0.000 missing.py:193(_isna_ndarraylike)
2323/2321    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
        5    0.000    0.000    0.001    0.000 {pandas._libs.lib.fast_zip}
    30/15    0.000    0.000    0.192    0.013 base.py:293(__new__)
     1386    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       16    0.000    0.000    1.350    0.084 base.py:984(_format_native_types)
        5    0.000    0.000    2.119    0.424 csvs.py:330(_save_chunk)
        5    0.000    0.000    1.546    0.309 multi.py:1201(_format_native_types)
      910    0.000    0.000    0.000    0.000 generic.py:10(_check)
       90    0.000    0.000    0.000    0.000 base.py:498(_shallow_copy)
      105    0.000    0.000    0.000    0.000 base.py:463(_simple_new)
      171    0.000    0.000    0.000    0.000 common.py:1460(is_extension_array_dtype)
      217    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}

df3 - MultiIndex set after head, fast

         10629 function calls (10518 primitive calls) in 0.554 seconds

   Ordered by: internal time
   List reduced from 182 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.301    0.015    0.301    0.015 {method 'astype' of 'numpy.ndarray' objects}
        5    0.243    0.049    0.243    0.049 {pandas._libs.writers.write_csv_rows}
   122/91    0.002    0.000    0.002    0.000 {built-in method numpy.array}
       15    0.001    0.000    0.001    0.000 {pandas._libs.lib.infer_dtype}
       21    0.001    0.000    0.001    0.000 missing.py:193(_isna_ndarraylike)
2323/2321    0.000    0.000    0.001    0.000 {built-in method builtins.isinstance}
     1386    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        6    0.000    0.000    0.028    0.005 base.py:954(to_native_types)
    30/15    0.000    0.000    0.004    0.000 base.py:293(__new__)
        5    0.000    0.000    0.000    0.000 {pandas._libs.lib.fast_zip}
        1    0.000    0.000    0.554    0.554 csvs.py:313(_save)
        5    0.000    0.000    0.554    0.111 csvs.py:330(_save_chunk)
      910    0.000    0.000    0.000    0.000 generic.py:10(_check)
        5    0.000    0.000    0.027    0.005 multi.py:1201(_format_native_types)
       15    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_object_object}
      171    0.000    0.000    0.000    0.000 common.py:1460(is_extension_array_dtype)
       90    0.000    0.000    0.000    0.000 base.py:498(_shallow_copy)
      105    0.000    0.000    0.000    0.000 base.py:463(_simple_new)
      171    0.000    0.000    0.000    0.000 base.py:413(find)
       15    0.000    0.000    0.001    0.000 algorithms.py:1616(take_nd)

The simplest (but silliest) suggestion would just be to call .reset_index() within df.to_csv() if index=True is set. However, I suspect this indicates something more fundamental that can be improved.

Expected Output

Speed of outputting to CSV does not vary significantly depending on how the index was created, as is the case when outputting to Parquet, etc.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : db08276
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.19.1.el7.x86_64
Version : #1 SMP Thu Aug 20 14:39:03 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 45.2.0
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0

The text was updated successfully, but these errors were encountered:

DanielFEvans · 2020-10-29T14:38:24Z

(I was unsure whether you'd consider this a Bug or an Enhancement request - in retrospect, perhaps the latter. Reclassify as you see fit!)

jreback · 2020-10-29T16:21:38Z

if you would test this in master would be very helpful

jreback · 2020-10-29T16:22:36Z

also if would like to do a PR with these as benchmarks would be great (i think coverage for to_csv is not huge today)

DanielFEvans · 2021-01-21T15:25:35Z

Apologies for the long delay - getting together a working Python 3.7+ environment put me off for a while. The PR above adds some benchmarks, and shows the behaviour still exists in Master.

trenton3983 · 2021-01-26T16:22:36Z

The following takes an inordinate amount of time

import pandas as pd
from datetime import datetime

cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S')

This does not

import pandas as pd
from datetime import datetime

cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.index = df.index.strftime('%Y-%m-%dT%H:%M:%S')
df.to_csv('df.csv', sep=',')

An alternate workaround is to reset the index

df.reset_index(inplace=True)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S', index=False)

jbrockmendel · 2021-06-06T19:53:57Z

Looks like a fairly simple fix to call index.strftime at [the appropriate place tbd] within to_csv

DanielFEvans added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 29, 2020

DanielFEvans mentioned this issue Jan 21, 2021

TST: Benchmarks for CSV writing for various index variations #39318

Merged

3 tasks

StefRe mentioned this issue Jan 26, 2021

BUG: Performance regression to_csv when formatting datatime index #39413

Closed

3 tasks

jbrockmendel added IO CSV read_csv, to_csv Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

phofl mentioned this issue Dec 17, 2021

PERF: Performance improvement in to_csv with unused levels in multiindex #44943

Merged

4 tasks

jreback added this to the 1.4 milestone Dec 17, 2021

jreback closed this as completed in #44943 Dec 17, 2021

phofl mentioned this issue Jan 21, 2022

PERF: to_csv very slow when exporting a small slice of a large multiindexed dataframe #45535

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

DanielFEvans commented Oct 29, 2020

INSTALLED VERSIONS

DanielFEvans commented Oct 29, 2020

jreback commented Oct 29, 2020

jreback commented Oct 29, 2020

DanielFEvans commented Jan 21, 2021

trenton3983 commented Jan 26, 2021 •

edited

Loading

jbrockmendel commented Jun 6, 2021

BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

Comments

DanielFEvans commented Oct 29, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

DanielFEvans commented Oct 29, 2020

jreback commented Oct 29, 2020

jreback commented Oct 29, 2020

DanielFEvans commented Jan 21, 2021

trenton3983 commented Jan 26, 2021 • edited Loading

jbrockmendel commented Jun 6, 2021

Output of `pd.show_versions()`

trenton3983 commented Jan 26, 2021 •

edited

Loading