Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop parsing HTML with regex #235

Closed
elad661 opened this issue Apr 20, 2013 · 2 comments
Closed

Stop parsing HTML with regex #235

elad661 opened this issue Apr 20, 2013 · 2 comments
Labels
Milestone

Comments

@elad661
Copy link
Contributor

elad661 commented Apr 20, 2013

Parsing HTML with regex is ugly and leads to a lot of problems, yet we do that too much.

We should either use a dedicated HTML parser (say lxml with cssselect or libsoup) or switch to JSON-based APIs wherever we can.

embolalia added a commit that referenced this issue Apr 25, 2013
@elad661
Copy link
Contributor Author

elad661 commented Apr 27, 2013

If we take fuckingweather as an example, using cssselect would change

    page = web.get("http://thefuckingweather.com/?where=%s" % (text))
    re_mark = re.compile('<p class="remark">(.*?)</p>')
    results = re_mark.findall(page)

to

    page = lxml.html.parse("http://thefuckingweather.com/?where=%s" % (text)).getroot()
    result = page.cssselect('p.remark')[0].text

Which is simpler, and will work even if more attributes would be added to the <p>

Therefor I'm in favor of introducing cssselect as a dependency.

elad661 pushed a commit that referenced this issue May 31, 2013
@elad661
Copy link
Contributor Author

elad661 commented May 31, 2013

I guess we can close this.

@elad661 elad661 closed this as completed May 31, 2013
maxpowa pushed a commit to maxpowa/Inumuta that referenced this issue Feb 20, 2015
maxpowa pushed a commit to maxpowa/Inumuta that referenced this issue Feb 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant