-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_html ignores paragraphs in table cells #24766
Comments
Can you write the exact expected output? |
I have updated the issue with the requested information. Couldn't find a way to remove the "Needs Info" label. |
Thanks. Can you check if the HTML parsing libraries (lxml, bs4) typically convert p tags to newlines? Do they provide options to do that? |
That wouldn't help as the below example shows: import pandas as pd
html = """
<html>
<body>
<table>
<tr>
<td>
Field 1
Field 2
</td>
<td>
Value 1
Value 2
</td>
</tr>
</table>
</body>
</html>
"""
tables = pd.read_html(html)
print(tables[0].iat[0, 0]) Still returns "Field 1 Field 2" |
I'm just wondering if our behavior matches the expected behavior of the
underlying parsing libraries, and whether they have ways of dealing with
it. Presumably they've had requests for similar features around whitespace
normalization.
…On Thu, Jan 17, 2019 at 10:38 AM sasan00 ***@***.***> wrote:
That wouldn't help as the below example shows:
import pandas as pd
html = """<html><body><table> <tr> <td> Field 1 Field 2 </td> <td> Value 1 Value 2 </td> </tr></table></body></html>"""
tables = pd.read_html(html)print(tables[0].iat[0, 0])
Still returns "Field 1 Field 2"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInwNpzOb1pvbnIjUgP387yVDxBzwks5vEKbvgaJpZM4Z-i-t>
.
|
lxml respects whitespaces. import pandas as pd
from lxml.etree import fromstring
from lxml.html import HTMLParser
html = """
<html>
<body>
<table>
<tr>
<td>
Field 1
Field 2
</td>
<td>
Value 1
Value 2
</td>
</tr>
</table>
</body>
</html>
"""
tables = pd.read_html(html)
print(tables[0].iat[0, 0])
parser = HTMLParser()
root = fromstring(html, parser)
for elem in root.iter('td'):
print(repr(elem.text)) Result: Field 1 Field 2 |
Thanks. Can you check if pandas explicitly strips / normalizes whitespace
in read_html then? If so, this would be a good parameter to add to
read_html.
…On Thu, Jan 17, 2019 at 10:49 AM sasan00 ***@***.***> wrote:
lxml respects whitespaces.
import pandas as pdfrom lxml.etree import fromstringfrom lxml.html import HTMLParser
html = """<html><body><table> <tr> <td> Field 1 Field 2 </td> <td> Value 1 Value 2 </td> </tr></table></body></html>"""
tables = pd.read_html(html)print(tables[0].iat[0, 0])
parser = HTMLParser()
root = fromstring(html, parser)for elem in root.iter('td'):
print(repr(elem.text))
Result:
Field 1 Field 2
'\n Field 1\n \n Field 2\n '
'\n Value 1\n \n Value 2\n '
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIt86UF1iJOARwLdx1oJBZzcjMMWWks5vEKmjgaJpZM4Z-i-t>
.
|
Yes. In I think whether whitespaces are "cleaned up" (i.e., replaced with a single space character) should be an optional functionality. |
Thanks for investigating. I think an option to disable that behavior makes sense. You've given two examples now, one with newlines in the text, and one with |
I think adding an extra argument as a function that takes the raw text of a cell, and returns the "cleaned up" version would work best. Its default value would be |
Hi, wondering if this issue was ever resolved? In my case, I have a |
still encountering this bug. |
@Derekt2 this is open |
Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766
Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766 Co-authored-by: Romain Lebbadi-Breteau <[email protected]>
TST: Added a simple test for issue pandas-dev#24766 Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766 Co-authored-by: Romain Lebbadi-Breteau <[email protected]> Co-authored-by: Fredrik Wallner <[email protected]>
Hi. I'm experiencing the same bug with newlines in cells. I see that there are some existing contributions, will they be merged? |
same |
take |
Code Sample, a copy-pastable example if possible
Problem description
In the current implementation, the p tags are ignored, and therefore it's not possible to infer that field 1 has value 1 and field 2 has value 2.
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: