-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop parsing HTML with regex #235
Comments
embolalia
added a commit
that referenced
this issue
Apr 25, 2013
If we take fuckingweather as an example, using cssselect would change page = web.get("http://thefuckingweather.com/?where=%s" % (text))
re_mark = re.compile('<p class="remark">(.*?)</p>')
results = re_mark.findall(page) to page = lxml.html.parse("http://thefuckingweather.com/?where=%s" % (text)).getroot()
result = page.cssselect('p.remark')[0].text Which is simpler, and will work even if more attributes would be added to the Therefor I'm in favor of introducing cssselect as a dependency. |
I guess we can close this. |
maxpowa
pushed a commit
to maxpowa/Inumuta
that referenced
this issue
Feb 20, 2015
maxpowa
pushed a commit
to maxpowa/Inumuta
that referenced
this issue
Feb 20, 2015
kinda related to issue sopel-irc#235
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Parsing HTML with regex is ugly and leads to a lot of problems, yet we do that too much.
We should either use a dedicated HTML parser (say lxml with cssselect or libsoup) or switch to JSON-based APIs wherever we can.
The text was updated successfully, but these errors were encountered: