Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484

Closed
2 of 3 tasks
DanielFEvans opened this issue Oct 29, 2020 · 6 comments · Fixed by #44943
Closed
2 of 3 tasks
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@DanielFEvans
Copy link
Contributor

  • I have checked that this issue has not already been reported.

There are some historic "to_csv is slow" reports, but none mention the specific behaviour I've seen with regard to indices.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import cProfile
import time
import pstats
from io import BytesIO, StringIO

import pandas as pd
import numpy as np

NUM_INDEX_COLS = 3
NUM_COLS = 50
NUM_ROWS = 1000000

# Generate a large-ish dummy dataframe
data = {}
idx_cols = []
for idx_id in range(NUM_INDEX_COLS):
    idx_name = f"Index_{idx_id}"
    idx_cols.append(idx_name)
    if idx_id == 0:
        data[idx_name] = np.random.randint(0, NUM_ROWS, NUM_ROWS)
    else:
        data[idx_name] = np.full(NUM_ROWS, 1, dtype=np.int)

for col_id in range(NUM_COLS):
    col_name = f"Column_{col_id}"
    data[col_name] = np.random.uniform(0, 100000.0, NUM_ROWS)

source_df = pd.DataFrame(data)
source_df_indexed = source_df.set_index(idx_cols)

# Default index - fast
df1 = source_df.head(10000)
# MultiIndex from idx_cols, then take head - slow
df2 = source_df_indexed.head(10000)
# Take head, then set MultiIndex from idx_cols - fast
df3 = source_df.head(10000).set_index(idx_cols)
# Deepcopy of df2 - slow (docs say that Index is not copied, so not surprising)
df4 = df2.copy(deep=True)

# Will run each df twice, and take timings from second run.
dataframes = [df1, df1, df2, df2, df3, df3, df4, df4]

#### WRITE TO CSV WITH EXISTING INDEX
#
# df1: 0.53s, with no index set
# df2: 2.35s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 2.12s, deep copy of df2
#

for i, df in enumerate(dataframes):
    out = StringIO()
    t1 = time.time()
    df.to_csv(out)
    t2 = time.time()
    if (i % 2) == 1:
        print(t2-t1)

#### WRITE TO CSV WITH INCLUDED CALL TO RESET INDEX
#
# df1: 0.55s, with no index set
# df2: 0.53s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 0.54s, deep copy of df2

for i, df in enumerate(dataframes):
    out = StringIO()
    t1 = time.time()
    df.reset_index().to_csv(out)
    t2 = time.time()
    if (i % 2) == 1:
        print(t2-t1)

Problem description

The speed of to_csv varies significantly depending on whether an Index was set, and when/how that Index was set. This is not the case with e.g. to_parquet. The original case where I noticed this speed difference is not so trivial, but it involves a MultiIndexed dataframe that has been manipulated before being output to CSV, taking approximately 50 minutes to output a ~30GB file.

(Yes, CSV is slow and horrible for many other reasons even at the best of times, and we avoid it as much as possible.)

In the example above, I create a million row dataframe source_df, with three integer columns (two have the value 1 throughout), and 50 float columns. I then create four subsets of the first 10000 rows, doing different things with the index:

  • df1: Just the source dataframe, no index modification (so it'll have the default RangeIndex)
  • df2: Set the three integer columns as a MultiIndex, and then take .head(10000)
  • df3: Take .head(10000), and then set the three integer columns as a MultiIndex
  • df4: Deep copy of df2. (The Pandas documentation suggests that at least the index won't actually be copied).

I then time how long it takes to write each out to CSV (using a StringIO, to avoid IO speed concerns). df1 (no index) and df3 (index set after taking .head) are fast; df2 (index set before taking .head) and df4 (deepcopy of df2) are much slower. It appears that the index being a subset of a larger dataframe significantly slows down the writing.

I then do a separate test, whereby I reset the index of each dataframe immediately before writing. In this case, all write out as fast as df1/df3 in the original test.

I have also tested with to_parquet by changing to_csv to to_parquet and providing a BytesIO. All four dataframes are output in 0.08s, regardless of index origin and whether the index is reset or not.

Surprisingly (for me), profiling indicates that {method 'astype' of 'numpy.ndarray' objects} is where the extra time originates from. There aren't many extra calls - they just take much, much longer. If you're lucky, this might even be an upstream issue!

Summarised profiling output for df1, df2, and df3 in the don't-reset-index case below.


df1 - Default index, fast

         1928 function calls (1910 primitive calls) in 0.530 seconds

   Ordered by: internal time
   List reduced from 126 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12    0.285    0.024    0.285    0.024 {method 'astype' of 'numpy.ndarray' objects}
        6    0.242    0.040    0.242    0.040 {pandas._libs.writers.write_csv_rows}
       19    0.001    0.000    0.001    0.000 missing.py:193(_isna_ndarraylike)
        6    0.000    0.000    0.530    0.088 csvs.py:330(_save_chunk)
    20/13    0.000    0.000    0.000    0.000 {built-in method numpy.array}
  384/382    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      278    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        6    0.000    0.000    0.282    0.047 blocks.py:1938(to_native_types)
        1    0.000    0.000    0.530    0.530 csvs.py:313(_save)
      200    0.000    0.000    0.000    0.000 generic.py:10(_check)
       12    0.000    0.000    0.000    0.000 blocks.py:292(getitem_block)
       19    0.000    0.000    0.001    0.000 missing.py:130(_isna)
        6    0.000    0.000    0.000    0.000 generic.py:3551(_slice)
       12    0.000    0.000    0.000    0.000 range.py:697(__getitem__)
        6    0.000    0.000    0.000    0.000 managers.py:757(get_slice)
       19    0.000    0.000    0.000    0.000 dtypes.py:903(is_dtype)
       12    0.000    0.000    0.000    0.000 range.py:153(_data)
        7    0.000    0.000    0.004    0.001 base.py:984(_format_native_types)
        6    0.000    0.000    0.000    0.000 {built-in method numpy.arange}
        6    0.000    0.000    0.001    0.000 blocks.py:656(to_native_types)


df2 - MultiIndex set before head, slow

         10629 function calls (10518 primitive calls) in 2.135 seconds

   Ordered by: internal time
   List reduced from 182 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    1.630    0.082    1.630    0.082 {method 'astype' of 'numpy.ndarray' objects}
        5    0.247    0.049    0.247    0.049 {pandas._libs.writers.write_csv_rows}
   122/91    0.163    0.001    0.163    0.002 {built-in method numpy.array}
        6    0.042    0.007    1.588    0.265 base.py:954(to_native_types)
       15    0.027    0.002    0.027    0.002 {pandas._libs.lib.infer_dtype}
        1    0.016    0.016    2.135    2.135 csvs.py:313(_save)
       15    0.001    0.000    0.001    0.000 {pandas._libs.algos.take_1d_object_object}
       21    0.001    0.000    0.002    0.000 missing.py:193(_isna_ndarraylike)
2323/2321    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
        5    0.000    0.000    0.001    0.000 {pandas._libs.lib.fast_zip}
    30/15    0.000    0.000    0.192    0.013 base.py:293(__new__)
     1386    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       16    0.000    0.000    1.350    0.084 base.py:984(_format_native_types)
        5    0.000    0.000    2.119    0.424 csvs.py:330(_save_chunk)
        5    0.000    0.000    1.546    0.309 multi.py:1201(_format_native_types)
      910    0.000    0.000    0.000    0.000 generic.py:10(_check)
       90    0.000    0.000    0.000    0.000 base.py:498(_shallow_copy)
      105    0.000    0.000    0.000    0.000 base.py:463(_simple_new)
      171    0.000    0.000    0.000    0.000 common.py:1460(is_extension_array_dtype)
      217    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}

df3 - MultiIndex set after head, fast

         10629 function calls (10518 primitive calls) in 0.554 seconds

   Ordered by: internal time
   List reduced from 182 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.301    0.015    0.301    0.015 {method 'astype' of 'numpy.ndarray' objects}
        5    0.243    0.049    0.243    0.049 {pandas._libs.writers.write_csv_rows}
   122/91    0.002    0.000    0.002    0.000 {built-in method numpy.array}
       15    0.001    0.000    0.001    0.000 {pandas._libs.lib.infer_dtype}
       21    0.001    0.000    0.001    0.000 missing.py:193(_isna_ndarraylike)
2323/2321    0.000    0.000    0.001    0.000 {built-in method builtins.isinstance}
     1386    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        6    0.000    0.000    0.028    0.005 base.py:954(to_native_types)
    30/15    0.000    0.000    0.004    0.000 base.py:293(__new__)
        5    0.000    0.000    0.000    0.000 {pandas._libs.lib.fast_zip}
        1    0.000    0.000    0.554    0.554 csvs.py:313(_save)
        5    0.000    0.000    0.554    0.111 csvs.py:330(_save_chunk)
      910    0.000    0.000    0.000    0.000 generic.py:10(_check)
        5    0.000    0.000    0.027    0.005 multi.py:1201(_format_native_types)
       15    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_object_object}
      171    0.000    0.000    0.000    0.000 common.py:1460(is_extension_array_dtype)
       90    0.000    0.000    0.000    0.000 base.py:498(_shallow_copy)
      105    0.000    0.000    0.000    0.000 base.py:463(_simple_new)
      171    0.000    0.000    0.000    0.000 base.py:413(find)
       15    0.000    0.000    0.001    0.000 algorithms.py:1616(take_nd)

The simplest (but silliest) suggestion would just be to call .reset_index() within df.to_csv() if index=True is set. However, I suspect this indicates something more fundamental that can be improved.

Expected Output

Speed of outputting to CSV does not vary significantly depending on how the index was created, as is the case when outputting to Parquet, etc.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.19.1.el7.x86_64
Version : #1 SMP Thu Aug 20 14:39:03 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 45.2.0
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0

@DanielFEvans DanielFEvans added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 29, 2020
@DanielFEvans
Copy link
Contributor Author

(I was unsure whether you'd consider this a Bug or an Enhancement request - in retrospect, perhaps the latter. Reclassify as you see fit!)

@jreback
Copy link
Contributor

jreback commented Oct 29, 2020

if you would test this in master would be very helpful

@jreback
Copy link
Contributor

jreback commented Oct 29, 2020

also if would like to do a PR with these as benchmarks would be great (i think coverage for to_csv is not huge today)

@DanielFEvans
Copy link
Contributor Author

Apologies for the long delay - getting together a working Python 3.7+ environment put me off for a while. The PR above adds some benchmarks, and shows the behaviour still exists in Master.

@trenton3983
Copy link

trenton3983 commented Jan 26, 2021

  • The following takes an inordinate amount of time
import pandas as pd
from datetime import datetime

cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S')
  • This does not
import pandas as pd
from datetime import datetime

cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.index = df.index.strftime('%Y-%m-%dT%H:%M:%S')
df.to_csv('df.csv', sep=',')
  • An alternate workaround is to reset the index
df.reset_index(inplace=True)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S', index=False)

@jbrockmendel jbrockmendel added IO CSV read_csv, to_csv Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@jbrockmendel
Copy link
Member

Looks like a fairly simple fix to call index.strftime at [the appropriate place tbd] within to_csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants