-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read from multiple <tbody> within a <table> #20891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change
pandas/io/html.py
Outdated
return self._parse_raw_data(res) | ||
raw_data = [] | ||
|
||
if len(tbodies) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PEP8 compliance use the boolean-ness of the sequence, so just if tbodies:
pandas/tests/io/test_html.py
Outdated
</tbody> | ||
</table>''' | ||
expected = DataFrame({'A': [1, 3], 'B': [2, 4]}) | ||
result = self.read_html(StringIO(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the indexer to this line and make the subsequent line just tm.assert_frame_equal(result, expected)
. Obviously doesn't change result just reads better with the rest of the tests
Codecov Report
@@ Coverage Diff @@
## master #20891 +/- ##
==========================================
+ Coverage 91.78% 91.78% +<.01%
==========================================
Files 153 153
Lines 49337 49338 +1
==========================================
+ Hits 45285 45286 +1
Misses 4052 4052
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes. ping on green.
pandas/tests/io/test_html.py
Outdated
@@ -396,6 +396,34 @@ def test_empty_tables(self): | |||
res2 = self.read_html(StringIO(data2)) | |||
assert_framelist_equal(res1, res2) | |||
|
|||
def test_multiple_tbody(self): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here, you can use use #
and not a triple-quotes here for the description of the test
@adamhooper do you know if / when you'll be able to update? I'm cutting the RC soon (hopefully a couple hours from now). If you're busy now I can push the changes and then we'll get this merged before the RC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging on green
Test failures from travis were unrelated. Fixing in #20906 |
Failures should be fixed in masater. |
Thanks @adamhooper! |
Er ... finally I'm available to fix things. You beat me to it -- thanks! |
Thank you! I'm glad we could get this in before the release.
…On Tue, May 1, 2018 at 2:05 PM, Adam Hooper ***@***.***> wrote:
Er ... finally I'm available to fix things. You beat me to it -- thanks!
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#20891 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvxQ3bQ_7Hz1bb2hlEW2DG-4gkXgks5tuLIKgaJpZM4TtdTg>
.
|
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
git diff upstream/master -u -- "*.py" | flake8 --diff