-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484
Comments
(I was unsure whether you'd consider this a Bug or an Enhancement request - in retrospect, perhaps the latter. Reclassify as you see fit!) |
if you would test this in master would be very helpful |
also if would like to do a PR with these as benchmarks would be great (i think coverage for to_csv is not huge today) |
Apologies for the long delay - getting together a working Python 3.7+ environment put me off for a while. The PR above adds some benchmarks, and shows the behaviour still exists in Master. |
import pandas as pd
from datetime import datetime
cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S')
import pandas as pd
from datetime import datetime
cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.index = df.index.strftime('%Y-%m-%dT%H:%M:%S')
df.to_csv('df.csv', sep=',')
df.reset_index(inplace=True)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%dT%H:%M:%S', index=False) |
Looks like a fairly simple fix to call index.strftime at [the appropriate place tbd] within to_csv |
There are some historic "to_csv is slow" reports, but none mention the specific behaviour I've seen with regard to indices.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
The speed of
to_csv
varies significantly depending on whether an Index was set, and when/how that Index was set. This is not the case with e.g.to_parquet
. The original case where I noticed this speed difference is not so trivial, but it involves a MultiIndexed dataframe that has been manipulated before being output to CSV, taking approximately 50 minutes to output a ~30GB file.(Yes, CSV is slow and horrible for many other reasons even at the best of times, and we avoid it as much as possible.)
In the example above, I create a million row dataframe
source_df
, with three integer columns (two have the value1
throughout), and 50 float columns. I then create four subsets of the first 10000 rows, doing different things with the index:I then time how long it takes to write each out to CSV (using a StringIO, to avoid IO speed concerns). df1 (no index) and df3 (index set after taking .head) are fast; df2 (index set before taking .head) and df4 (deepcopy of df2) are much slower. It appears that the index being a subset of a larger dataframe significantly slows down the writing.
I then do a separate test, whereby I reset the index of each dataframe immediately before writing. In this case, all write out as fast as df1/df3 in the original test.
I have also tested with
to_parquet
by changingto_csv
toto_parquet
and providing a BytesIO. All four dataframes are output in 0.08s, regardless of index origin and whether the index is reset or not.Surprisingly (for me), profiling indicates that
{method 'astype' of 'numpy.ndarray' objects}
is where the extra time originates from. There aren't many extra calls - they just take much, much longer. If you're lucky, this might even be an upstream issue!Summarised profiling output for df1, df2, and df3 in the don't-reset-index case below.
The simplest (but silliest) suggestion would just be to call
.reset_index()
withindf.to_csv()
ifindex=True
is set. However, I suspect this indicates something more fundamental that can be improved.Expected Output
Speed of outputting to CSV does not vary significantly depending on how the index was created, as is the case when outputting to Parquet, etc.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : db08276
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.19.1.el7.x86_64
Version : #1 SMP Thu Aug 20 14:39:03 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.1.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 45.2.0
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0
The text was updated successfully, but these errors were encountered: