ENH: improve performance of df.to_csv GH3054 #3059

ghost · 2013-03-15T03:26:12Z

Needs more testing before merging.

Following SO question mentioned in #3054:

In [7]: def df_to_csv(df,fname):
   ...:     fh=open(fname,'w')
   ...:     fh.write(','.join(df.columns) + '\n')
   ...:     for row in df.itertuples(index=False):
   ...:         slist = [str(x) for x in row]
   ...:         ss = ','.join(slist) + "\n"
   ...:         fh.write(ss)
   ...:     fh.close()
   ...: 
   ...: aa=pd.DataFrame({'A':range(100000)})
   ...: aa['B'] = aa.A + 1.0
   ...: aa['C'] = aa.A + 2.0
   ...: aa['D'] = aa.A + 3.0
   ...: 
   ...: %timeit -r10 aa.to_csv('/tmp/junk1',index=False)   
   ...: %timeit -r10 df_to_csv(aa,'/tmp/junk2') 
   ...: from hashlib import sha1
   ...: print sha1(open("/tmp/junk1").read()).hexdigest()
   ...: print sha1(open("/tmp/junk2").read()).hexdigest()

current pandas	with PR	example code
2.3 s	1.29 s	1.28 s

wins:

convert numpy numerics to native types to eliminate expensive numpy specific
stringify calls.
if number of columns is < 10000, precompute the cols loop range rather
then creating and walking a generator at each iteration of the inner loop.
some cargo cult stuff that's probably in the noise.

jreback · 2013-03-15T15:03:27Z

Case 2

aa=pd.DataFrame({'A':range(100000),'B' : pd.Timestamp('20010101')})
aa.ix[100:2000,'A'] = np.nan

Current pandas

In [4]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 1.57 s per loop

case 2

In [7]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 2.66 s per loop

With change (starting with yours + my patch)

In [3]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 825 ms per loop

case 2

In [5]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 1.96 s per loop

ghost · 2013-03-15T20:04:14Z

Thanks jeff. the patch was garbled by GH. Can you pull this branch, add your patch
as a new commit and push to your fork? I'll pick up the commit and add it to the PR.

jreback · 2013-03-15T20:19:00Z

jreback@5c4d0bf

ghost · 2013-03-15T20:31:22Z

After jeff's commit:

current pandas	with PR	example code
2.3 s	0.58s	1.28 s

ghost · 2013-03-15T20:32:48Z

L1389 sure looks like one loop too many.

jreback · 2013-03-15T21:26:30Z

I played around more with this, but all of the pre-allocation and array indexing seems to be working, (even the map around asscalar makes a big diff), so unless we write this loop in cython, not sure how much better we can do

ghost · 2013-03-16T00:12:04Z

Another 25%:

current pandas	with PR	example code
2.3 s	0.44s	1.28 s

jreback · 2013-03-16T00:17:27Z

I get about 20% more with cython.....(on my test case that has lots of dates).....
jreback@5335a80
jreback@fa12e63

jreback · 2013-03-16T00:25:19Z

I think you win! (I get 0.47s) with the same test (an 2.3s with current pandas) :(
hmm....though If i were to pre-allocate the rows...., I am using a single row and overwriting (and copy to the writer)

ghost · 2013-03-16T01:03:54Z

wait, I'm following your lead and cythonizing the latest python version. let's see...

ghost · 2013-03-16T01:06:41Z

Remember we're benchmarking on different machines, timings differ. I get 0.407 with your cython branch.

ghost · 2013-03-16T01:27:44Z

310ms

jreback · 2013-03-16T01:29:27Z

whoosh
it's the preallocation of rows
pretty good 8x speedup

ghost · 2013-03-16T01:30:16Z

hang on, I'm working on a linux kernel patch...

ghost · 2013-03-16T02:00:28Z

the list() cast in the call into cython eliminates some test failures that
occur when passing in the index ndarray, have no idea why.

another 20% (relative):

current pandas	with PR	example code
2.3 s	0.35s	1.28 s

any more tricks? probably heavily IO bound at this point.
edit: not really.

jreback · 2013-03-16T02:19:36Z

could try doing a tolist() on the numeric arrays (in helper, when creating the series)
so that can eliminate the np.asscalar test and conversions

ghost · 2013-03-16T02:51:27Z

If it works, post the commit hash and i'll update.

jreback · 2013-03-16T03:18:45Z

try this out

jreback@d78f4f6

ghost · 2013-03-16T03:28:34Z

240ms. very nice.
But this doubles the memory footprint doesn't it? definitely add as option.

…emory usage by writing in chunks

jreback · 2013-03-16T03:54:09Z

your wish is my command :) had the same issue in HDFStore, so chunked it

jreback@55adfb7

ghost · 2013-03-16T04:14:23Z

very good, and that also bounds the memory used by the list() copy in the call to cython.

I hereby declare these games over. 10x will have to do...

👍

jreback · 2013-03-19T12:04:21Z

AWESOME!

jreback · 2013-03-19T15:12:25Z

typos in RELEASE.rst the 3059 link is pointing to 3039, and df.to_csv

•Improved performance of dv.to_csv() by up to 10x in some cases. (GH3059)

ghost · 2013-03-19T15:20:00Z

fixed.

jreback · 2013-03-19T15:24:49Z

fyi vs 0.10.1

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_get_dtype_counts                      0.0988    217.2718     0.0005
frame_wide_repr                             0.5526    216.5370     0.0026
groupby_first_float32                       3.0029    341.3520     0.0088
groupby_last_float32                        3.1525    339.0419     0.0093
frame_to_csv2                             190.5260   2244.4260     0.0849
indexing_dataframe_boolean                 13.6755    126.9212     0.1077
write_csv_standard                         38.1940    234.2570     0.1630
frame_reindex_axis0                         0.3215      1.1042     0.2911
frame_to_csv_mixed                        369.0670   1123.0412     0.3286
frame_to_csv                              112.2720    226.7549     0.4951
frame_mult                                 22.6785     42.7152     0.5309
frame_add                                  24.3593     41.8012     0.5827
frame_reindex_upcast                       11.8235     17.0124     0.6950
frame_fancy_lookup_all                     15.0496     19.4497     0.7738

ghost · 2013-03-19T15:33:11Z

We should jigger the code so the first column of the test names spells "AwesomeRelease".

If I had to pick one thing, I'd say iloc/loc is the best addition in 0.11, though.
ix and I have been playing "20 questions" for far too long.

jreback · 2013-03-19T15:36:34Z

hahah....fyi I put up a commit on the centered mov stuff when you have a chance....all fixed except for rolling_count, which I am not quite sure what the results should be...

ghost · 2013-03-19T15:44:27Z

btw, the final perf tweak was all about avoiding iteritems and it's reliance
on icol. there might be a fast path hiding there which would get you another
0.0 entry on test_perf.sh.

jreback · 2013-03-19T16:08:04Z

I saw your data pre-allocation, did you change iteritems somewhere?

fyi....as a side issue, that's the problem in the duplicate col names, essentially the col label maps to the column in the block, but there isn't a simple map, in theory should be a positional map, but very tricky

ghost · 2013-03-19T16:15:49Z

@jreback , can you follow up on the SO question, and make a pandas user happy?

ghost · 2013-03-19T16:21:03Z

Your refactor calculated series using iteritems, and that was the cause of the O(ncols)
behaviour I noted. Eliminating that in favor of yanking the data out directly from the blocks
turned that into ~ O(1), but breaks encpsulation. would be nice to use a proper interface
rather then rifle the bowels of the underlying block manager.

Come to think of it, iteritems is col oriented, I wonder if all that could have
been avoided... with iterrows. oh boy.

jreback · 2013-03-19T16:30:56Z

iterrows does a lot of stuff....and the block based approach is better anyhow...I will answer the so question

ghost · 2013-03-19T17:03:01Z

The tests I added for duplicate columns was buggy, and didn't catch the fact
that dupe columns are disabled for to_csv.

v10.1 to_csv() mangles dupe names into "dupe.1,dupe.2," etc. That's an ok workaround,
but what's the fundamental reason we can't just do it straight? is there one?

correction: it's read_csv that does the mangling.

jreback · 2013-03-19T17:14:47Z

the issue is that the blocks have a 2-d of the items in a particular block, the mapping between where it is in the block and the frame is depedent on a unique name of the columns (in the case of a mi this is fine of course).

There isn't a positional map (from columns in the frame to in the block). Keeping one is pretty tricky.

Wes has a failry complicated routing to find the correct column even with the duplicates, and it succeeds unless they are across blocks (which is the case that I am testing here).

I suppose you could temporarily do a rename, on the frame, (no idea if that works), then iterate on the blocks. which will solve the problem. As I said, 0.10.1 actually prints the data out twice. I think raising is ok here, very unlikely to happen, and if it does you can just put a mi in the first place.

ghost · 2013-03-19T18:14:15Z

I don't follow. If you can display a repr for a frame with dupe columns, you can write it out to csv.

jreback · 2013-03-19T18:22:13Z

can't display a repr either......I will create an issue for this..

jreback · 2013-03-19T18:25:52Z

#3092

ghost · 2013-03-19T18:27:31Z

We're probably thinking of different things. what do you mean you can't repr a frame
with dupe columns?

In [19]: pd.DataFrame([[1,1]],columns=['a','a'])
Out[19]: 
   a  a
0  1  1

on the other hand

In [4]: df.to_csv("/tmp/a")
    807         if len(set(self.cols)) != len(self.cols):
--> 808             raise Exception("duplicate columns are not permitted in to_csv")
    809 
    810         self.colname_map = dict((k,i) for i,k in  enumerate(obj.columns))

Exception: duplicate columns are not permitted in to_csv

That's completely messed up.

ghost · 2013-03-19T18:28:40Z

I think I can rejig CSVWriter to do this properly, or at least discover what it is that I'm
failing to understand.

jreback · 2013-03-19T19:03:43Z

a bit contrived, but this is the case that's the issue (yours we should allow)
the problem is that detecting this case is hard

In [32]: df1 = pd.DataFrame([[1]],columns=['a'],dtype='float64')

In [33]: df2 = pd.DataFrame([[1]],columns=['a'],dtype='int64')

In [34]: df3 = pd.concat([df1,df2],axis=1)

In [35]: df3.columns
Out[35]: Index([a, a], dtype=object)

In [36]: df3.index
Out[36]: Int64Index([0], dtype=int64)

In [37]: df3
Out[37]: ----------------------------------------------
Exception: ('Cannot have duplicate column names split across dtypes', u'occurred at index a')

ghost · 2013-03-19T19:08:47Z

So:

the test that raises an exception should be finer grained, and fail only if dupes
are across blocks.
I should fix up the way the data object is constructed to handle dupe columns

Then we can think about the general case.

ghost · 2013-03-19T19:15:18Z

continued in #3095.

Thank you #3059, it's been fun.

ghost · 2013-03-19T21:19:33Z

dupe columns common case fixed in master via 1f138a4.

ghost · 2013-03-26T13:21:03Z

Changed 'legacy' keyword to engine=='python', to be consistent with c_parser.
in case it sticks around.

dragoljub · 2013-04-06T19:00:16Z

Guys this is truly amazing work. With the performance of reading/writing CSVs Pandas has become truly has enterprise leading I/O performance. Recently I was reading some zipped CSV files to DF at ~40MB/sec thinking to myself how much faster this is than many distributed 'Hadoop' solutions I've seen... :)

ENH: improve performance of df.to_csv GH3054

ca83d5e

ENH: to_csv using masking to simplify dtype processing

d46fa22

ENH: more perf tweaks in df.to_csv

7c67776

jreback and others added 3 commits March 16, 2013 03:56

PERF: cythonized parts of to_csv for increased perf

9349681

PERF: more cython tweaks

10857b0

PERF: cythonize improved python version

6d4e0bb

jreback added 2 commits March 15, 2013 23:01

PERF: cythonized parts of to_csv for increased perf

7ac83eb

PERF: more speedups

d78f4f6

ENH: add chunksize parameter to DataFrame.to_csv to enable constant m…

55adfb7

…emory usage by writing in chunks

ghost deleted the GH3054/to_csv_perf branch March 19, 2013 12:18

ghost restored the GH3054/to_csv_perf branch March 19, 2013 19:09

ghost mentioned this pull request Mar 19, 2013

Allow duplicate columns in df.to_csv #3095

Closed

ghost deleted the GH3054/to_csv_perf branch March 19, 2013 19:15

jreback mentioned this pull request Mar 19, 2013

ENH: create BlockManager positional indexer (for easier dupe cols support) #3092

Closed

ghost mentioned this pull request Mar 20, 2013

optimize to_csv if formatting isn't needed #3054

Closed

This pull request was closed.

ENH: improve performance of df.to_csv GH3054 #3059

ENH: improve performance of df.to_csv GH3054 #3059

Conversation

ghost commented Mar 15, 2013

jreback commented Mar 15, 2013

ghost commented Mar 15, 2013

jreback commented Mar 15, 2013

ghost commented Mar 15, 2013

ghost commented Mar 15, 2013

jreback commented Mar 15, 2013

ghost commented Mar 16, 2013

jreback commented Mar 16, 2013

jreback commented Mar 16, 2013

ghost commented Mar 16, 2013

ghost commented Mar 16, 2013

ghost commented Mar 16, 2013

jreback commented Mar 16, 2013

ghost commented Mar 16, 2013

ghost commented Mar 16, 2013

jreback commented Mar 16, 2013

ghost commented Mar 16, 2013

jreback commented Mar 16, 2013

ghost commented Mar 16, 2013

jreback commented Mar 16, 2013

ghost commented Mar 16, 2013

jreback commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

ghost commented Mar 19, 2013

ghost commented Mar 19, 2013

ghost commented Mar 26, 2013

dragoljub commented Apr 6, 2013