Stop parsing HTML with regex #235

elad661 · 2013-04-20T08:10:50Z

Parsing HTML with regex is ugly and leads to a lot of problems, yet we do that too much.

We should either use a dedicated HTML parser (say lxml with cssselect or libsoup) or switch to JSON-based APIs wherever we can.

Issue #221 and #235

elad661 · 2013-04-27T10:15:36Z

If we take fuckingweather as an example, using cssselect would change

    page = web.get("http://thefuckingweather.com/?where=%s" % (text))
    re_mark = re.compile('<p class="remark">(.*?)</p>')
    results = re_mark.findall(page)

to

    page = lxml.html.parse("http://thefuckingweather.com/?where=%s" % (text)).getroot()
    result = page.cssselect('p.remark')[0].text

Which is simpler, and will work even if more attributes would be added to the <p>

Therefor I'm in favor of introducing cssselect as a dependency.

kinda related to issue #235

elad661 · 2013-05-31T18:00:24Z

I guess we can close this.

Issue sopel-irc#221 and sopel-irc#235

kinda related to issue sopel-irc#235

embolalia added a commit that referenced this issue Apr 25, 2013

[wikipedia] Complete rewrite to use MediaWiki API

934a150

Issue #221 and #235

elad661 pushed a commit that referenced this issue May 31, 2013

[youtube] Use youtube JSON API instead of parsing ATOM XML with regex

bf338d8

kinda related to issue #235

elad661 closed this as completed May 31, 2013

maxpowa pushed a commit to maxpowa/Inumuta that referenced this issue Feb 20, 2015

[wikipedia] Complete rewrite to use MediaWiki API

645b288

Issue sopel-irc#221 and sopel-irc#235

maxpowa pushed a commit to maxpowa/Inumuta that referenced this issue Feb 20, 2015

[youtube] Use youtube JSON API instead of parsing ATOM XML with regex

7579d25

kinda related to issue sopel-irc#235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop parsing HTML with regex #235

Stop parsing HTML with regex #235

elad661 commented Apr 20, 2013

elad661 commented Apr 27, 2013

elad661 commented May 31, 2013

Stop parsing HTML with regex #235

Stop parsing HTML with regex #235

Comments

elad661 commented Apr 20, 2013

elad661 commented Apr 27, 2013

elad661 commented May 31, 2013