Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html ignores paragraphs in table cells #24766

Open
sasan00 opened this issue Jan 14, 2019 · 16 comments · May be fixed by #59455
Open

read_html ignores paragraphs in table cells #24766

sasan00 opened this issue Jan 14, 2019 · 16 comments · May be fixed by #59455
Assignees
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@sasan00
Copy link

sasan00 commented Jan 14, 2019

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            <p>Field 1</p>
            <p>Field 2</p>
        </td>
        <td>
            <p>Value 1</p>
            <p>Value 2</p>
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Problem description

In the current implementation, the p tags are ignored, and therefore it's not possible to infer that field 1 has value 1 and field 2 has value 2.

Expected Output

tables[0].iat[0, 0] == r'Field 1\nField 2'
tables[0].iat[0, 1] == r'Value 1\nValue 2'

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you write the exact expected output?

@TomAugspurger TomAugspurger added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 14, 2019
@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jan 14, 2019
@sasan00
Copy link
Author

sasan00 commented Jan 17, 2019

I have updated the issue with the requested information. Couldn't find a way to remove the "Needs Info" label.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2019

Thanks. Can you check if the HTML parsing libraries (lxml, bs4) typically convert p tags to newlines? Do they provide options to do that?

@sasan00
Copy link
Author

sasan00 commented Jan 17, 2019

That wouldn't help as the below example shows:

import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1
            
            Field 2
        </td>
        <td>
            Value 1
            
            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Still returns "Field 1 Field 2"

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2019 via email

@sasan00
Copy link
Author

sasan00 commented Jan 17, 2019

lxml respects whitespaces.

import pandas as pd
from lxml.etree import fromstring
from lxml.html import HTMLParser

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1
            
            Field 2
        </td>
        <td>
            Value 1
            
            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])
parser = HTMLParser()
root = fromstring(html, parser)
for elem in root.iter('td'):
    print(repr(elem.text))

Result:

Field 1 Field 2
'\n Field 1\n \n Field 2\n '
'\n Value 1\n \n Value 2\n '

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2019 via email

@sasan00
Copy link
Author

sasan00 commented Jan 17, 2019

Yes. In _parse_raw_data, calls are made to _remove_whitespace for each column in each row using the default value of the regex argument which is _RE_WHITESPACE whose value is re.compile(r'[\r\n]+|\s{2,}').

I think whether whitespaces are "cleaned up" (i.e., replaced with a single space character) should be an optional functionality.

@TomAugspurger TomAugspurger removed the Needs Info Clarification about behavior needed to assess issue label Jan 17, 2019
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Jan 17, 2019
@TomAugspurger
Copy link
Contributor

Thanks for investigating. I think an option to disable that behavior makes sense.

You've given two examples now, one with newlines in the text, and one with <p> tags. Do you expect to normalize the <p> tags to newlines, so that the two would give the same output? Do we have any prior art to copy here?

@sasan00
Copy link
Author

sasan00 commented Jan 18, 2019

I think adding an extra argument as a function that takes the raw text of a cell, and returns the "cleaned up" version would work best. Its default value would be _remove_whitespace to ensure backwards compatibility.

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Jan 21, 2019
@mroeschke mroeschke added the Bug label May 7, 2020
@markmbaum
Copy link

Hi, wondering if this issue was ever resolved? In my case, I have a <ul> inside the HTML table and all the elements of each list are squished together after the table is parsed by read_html.

@Derekt2
Copy link

Derekt2 commented Feb 18, 2021

still encountering this bug.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2021

@Derekt2 this is open
you are welcome to submit a pull request to patch

Derekt2 pushed a commit to Derekt2/pandas that referenced this issue Feb 20, 2021
Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766
fredrikw added a commit to fredrikw/pandas that referenced this issue Jun 4, 2021
RomainL972 added a commit to RomainL972/pandas that referenced this issue Feb 19, 2022
Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766

Co-authored-by: Romain Lebbadi-Breteau <[email protected]>
RomainL972 pushed a commit to RomainL972/pandas that referenced this issue Feb 19, 2022
RomainL972 added a commit to RomainL972/pandas that referenced this issue Feb 19, 2022
TST: Added a simple test for issue pandas-dev#24766

Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766

Co-authored-by: Romain Lebbadi-Breteau <[email protected]>
Co-authored-by: Fredrik Wallner <[email protected]>
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@SuryaThiru
Copy link

Hi. I'm experiencing the same bug with newlines in cells. I see that there are some existing contributions, will they be merged?

@iamef
Copy link

iamef commented Dec 8, 2023

same

@RomainL972
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
10 participants