-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Read mutiple excel sheets in single API call #9450
Conversation
Does it make sense to return a panel instead of a dict of dicts? Just something to think about... |
@shoyer I think it should at least be an option? Personally I would rather get a dict than a Panel, but that is more because I never worked with Panels before. Plus, there is no requirement that there are common indices in the different sheets, so a Panel is not really desired in that case I think? @jnmclarty I think this is nice new functionality! To your questions:
|
where the keys match the sheet specified. Set to None to get | ||
all sheets, and use sheetnames as the dictionary keys. | ||
|
||
Undefined -> Defaults to 0 -> 1st sheet as a DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will have to use bullet points here to have this formatted nicely in the html docs.
@shoyer Thought about it, and my logic was the same as @jorisvandenbossche, plus, I'd wager that there are many times when the sheetnames are not something where an index makes sense. Eg. "Sheet3", "Sheet4", "Sheet5" have data, but Sheet1 and Sheet2 got used for aggregation/reporting and were renamed properly. Also, returning a dict forces the user to realize that the resulting dataframes are not ordered. And...final point...returning a panel, we would have to make an assumption about the order for the sheetname=None case. @jorisvandenbossche Thanks for the input. I'll give this another 32 hours or so to let other feedback trickle in, and then begin implementing your comments. |
this should for sure be a agree with @jorisvandenbossche |
Agreed, a dict makes more sense than a panel :). |
b1a15d2
to
8bf71cc
Compare
Rebased. Ready to merge. (Assuming Travis agrees) |
|
||
- The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move these first 2 points to a separate note::
box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, commit on it's way.
@jnmclarty I am a bit -1 on accepting mixed int/str for So you would have to account for duplicates, and raise on ambiguity. I would maybe suggest making the user specify either all ints (e.g. position based), or all names (e.g. by name). |
It's currently just as ambiguous as it's always been. You could have made the same comment, about The moment we add exclusivity to the object type, THAT's when If you pass Let's say I've got a sheet named "MetaData", in my excel file (but I don't know the position), and I know I want the 5th sheet (but don't know the name). With my PR, you'd just go ['MetaData', 4]. Prohibiting mixing, you'd have no way of doing this (without more code) |
@jnmclarty ok as long as the behavior is well-defined and obvious. With a single sheetname it is either positional (if the name doesn't match), or is there (or a KeyError)? with multiples then you can get dups and/or ambiguity. Maybe add a couple of tests to cement the behavior (ok with accepting mixed if its useful). |
The bahviour is the same as before, just inside the list.
Assume the excel file has a sheet named "1", in the 1st position. I handled 2 of 3 duplicates that would result from The duplicates the user would get, are definitely not ambiguous (unless the user doesn't understand object types?) |
Oh, and @jreback my original tests were already written to ensure the mixed-mode int/strings. See line 447 of test_excel.py. |
Ok, @jreback I think we're good, see 3556753 and a73645b. I'll squash momentarily. |
3556753
to
3bb2f6b
Compare
@jnmclarty duplicates are ok, but a KeyError is still a problem e.g. a file with 2 elements, what should [1,2] do? raise a KeyError. also need a rebase. |
3bb2f6b
to
8fe4454
Compare
Implemented the key errors in d8775d7. While it is appropriate, but technically changes previous behaviour. |
5670823
to
ffa0dbf
Compare
Fixed the Dat_e_Frames, |
62d13ec
to
003bba1
Compare
Thanks @jreback, you're a machine. |
@jnmclarty hahha....happen to be home and its cold out! anyhow, lgtm. @jorisvandenbossche |
1783636
to
a443a5e
Compare
|
dfs = [tdf(s) for s in sheets] | ||
dfs = dict(zip(sheets,dfs)) | ||
|
||
w = pd.ExcelWriter(self.multisheet2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a temporary filepath here, see like this; you cannot assume that you can write to the test dirctory (e.g. the code); travis lets you , but on a regular install you normally cant'.
Ok, @jreback this PR is back to a state where it does not change any behaviour associated with any Exceptions anywhere. It merely extends the same behaviour that would have existed before this PR. So, I also implemented ensure_clean() in the one test. Acknowledge my latest, and I'll squash sometime within 5 minutes to 12h time. |
lgtm |
2ee006a
to
c92b87f
Compare
Explicitly, unless anybody else has any objections, this is ready to be merged. |
just needs a look for @jorisvandenbossche / @shoyer |
I don't use Excel for serious analytics (thank goodness!) so I don't have much to add here. My main thought is that this might be a good time to clean up the |
Can we merge this, and open an issue for the |
@jnmclarty sounds good to me. |
~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. versionadded:: 0.16 | ||
``read_excel`` can read more than one sheet, by setting ``sheetname`` to either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add a blank line between .. versionadded:: 0.16
and the explanation (otherwise sphinx gives warnings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the same with the two occurences below
f043f78
to
e977432
Compare
@jorisvandenbossche all fixed, see e977432 (which, after I rebase momentarily, will be an outdated commit) |
e977432
to
d8a2893
Compare
ENH Read mutiple excel sheets in single API call
@jnmclarty very nice enhancement. thanks for this! pls have a look at the built docs and verify that everything looks kosher. If not, pls submit a followup! http://pandas-docs.github.io/pandas-docs-travis/ (usually takes an hour; these are built on the 2.7 build from travis) |
I don't think the index is built on the travis docs. Only the 'main' parts are built (e.g. API is also not built). These looks ok to me. |
Enables reading of multiple excel sheets in a single API call, reducing read time substantially.
Essentially, 2O(n) becomes O(1) + O(n).
Before
This PR
...but as a bonus...
Return Objects
For the sheetname argument, specifying an int or a string, gets you a DataFrame (same as before the PR). But, specify a list of strings/int, returns a dictionary of dataframes. Specify None, and get all sheets in dictionary of dataframes, where the keys are the sheetnames.
More Discussion
The first commit (2db986d) implements 100% of the functionality, while the next two commits (5c2304c and 1a53b01) layer on optional changes to the documentation and API (of the two, only one is actually mandatory, but keeping both is the most explicit). Since the argument, sheetname, in read_excel was not plural, and defaults to 0, to maintain backwards compatibility AND create the “read all sheets” functionality associated with the non-default of None, the documentation of the argument gets really awkward. Hence, creating read_excel_sheets with an argument sheetnames which defaults to None instead of 0.
Things I need feedback on:
Things I still have to do:
Note: There might have been another way of doing this, but I couldn't easily figure it out, (SO/google wasn't helpful). I just wanted to read my 623 sheets without it taking hours.