-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: improve performance of df.to_csv GH3054 #3059
Conversation
Case 2
Current pandas
case 2
With change (starting with yours + my patch)
case 2
|
Thanks jeff. the patch was garbled by GH. Can you pull this branch, add your patch |
After jeff's commit:
|
L1389 sure looks like one loop too many. |
I played around more with this, but all of the pre-allocation and array indexing seems to be working, (even the map around asscalar makes a big diff), so unless we write this loop in cython, not sure how much better we can do |
Another 25%:
|
I get about 20% more with cython.....(on my test case that has lots of dates)..... |
I think you win! (I get 0.47s) with the same test (an 2.3s with current pandas) :( |
wait, I'm following your lead and cythonizing the latest python version. let's see... |
Remember we're benchmarking on different machines, timings differ. I get 0.407 with your cython branch. |
310ms |
whoosh |
hang on, I'm working on a linux kernel patch... |
the another 20% (relative):
any more tricks? probably heavily IO bound at this point. |
could try doing a tolist() on the numeric arrays (in helper, when creating the series) |
If it works, post the commit hash and i'll update. |
try this out |
240ms. very nice. |
…emory usage by writing in chunks
your wish is my command :) had the same issue in HDFStore, so chunked it |
very good, and that also bounds the memory used by the list() copy in the call to cython. I hereby declare these games over. 10x will have to do... 👍 |
AWESOME! |
typos in RELEASE.rst the 3059 link is pointing to 3039, and df.to_csv
|
fixed. |
fyi vs 0.10.1
|
We should jigger the code so the first column of the test names spells "AwesomeRelease". If I had to pick one thing, I'd say |
hahah....fyi I put up a commit on the centered mov stuff when you have a chance....all fixed except for rolling_count, which I am not quite sure what the results should be... |
btw, the final perf tweak was all about avoiding iteritems and it's reliance |
I saw your data pre-allocation, did you change iteritems somewhere? fyi....as a side issue, that's the problem in the duplicate col names, essentially the col label maps to the column in the block, but there isn't a simple map, in theory should be a positional map, but very tricky |
@jreback , can you follow up on the SO question, and make a pandas user happy? |
Your refactor calculated Come to think of it, iteritems is col oriented, I wonder if all that could have |
iterrows does a lot of stuff....and the block based approach is better anyhow...I will answer the so question |
The tests I added for duplicate columns was buggy, and didn't catch the fact v10.1 to_csv() mangles dupe names into "dupe.1,dupe.2," etc. That's an ok workaround, correction: it's read_csv that does the mangling. |
the issue is that the blocks have a 2-d of the items in a particular block, the mapping between where it is in the block and the frame is depedent on a unique name of the columns (in the case of a mi this is fine of course). There isn't a positional map (from columns in the frame to in the block). Keeping one is pretty tricky. Wes has a failry complicated routing to find the correct column even with the duplicates, and it succeeds unless they are across blocks (which is the case that I am testing here). I suppose you could temporarily do a rename, on the frame, (no idea if that works), then iterate on the blocks. which will solve the problem. As I said, 0.10.1 actually prints the data out twice. I think raising is ok here, very unlikely to happen, and if it does you can just put a mi in the first place. |
I don't follow. If you can display a repr for a frame with dupe columns, you can write it out to csv. |
can't display a repr either......I will create an issue for this.. |
We're probably thinking of different things. what do you mean you can't repr a frame
on the other hand
That's completely messed up. |
I think I can rejig CSVWriter to do this properly, or at least discover what it is that I'm |
a bit contrived, but this is the case that's the issue (yours we should allow)
|
So:
Then we can think about the general case. |
dupe columns common case fixed in master via 1f138a4. |
Changed 'legacy' keyword to engine=='python', to be consistent with c_parser. |
Guys this is truly amazing work. With the performance of reading/writing CSVs Pandas has become truly has enterprise leading I/O performance. Recently I was reading some zipped CSV files to DF at ~40MB/sec thinking to myself how much faster this is than many distributed 'Hadoop' solutions I've seen... :) |
Needs more testing before merging.
Following SO question mentioned in #3054:
wins:
stringify calls.
cols
loop range ratherthen creating and walking a generator at each iteration of the inner loop.