-
-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed unwanted content in some websites #509
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #509 +/- ##
=======================================
Coverage 97.16% 97.16%
=======================================
Files 22 22
Lines 3425 3425
=======================================
Hits 3328 3328
Misses 97 97 ☔ View full report in Codecov by Sentry. |
Thanks, everything works. |
Hey @adbar, I just tested it more, and I think |
style="display: none" and style="display:none"
Thanks for testing it, should I process the PR as it is now then? There is still an odd error in certain tests but it should be OK. |
Yes you can proceed with this one, I think the tests started failing when I did the rebase. |
I don't understand the bug on two test series, it doesn't make sense. I'll just wait a bit a restart the tests, maybe the problem will be solved or propagated to all versions by then. |
@felipehertzer The problem is solved, can I merge the PR? |
@adbar Sure, all good for me. |
I just removed an expression due to accuracy concerns, maybe "-comments" is just too broad. The rest work fine and even add a bit of precision on my benchmark, thanks! |
Hey @adbar,
I did a check in the majors Australian websites and I added some classes and ids to the xpaths.py to avoid get unwanted content.
Canberra Times: (It was getting the newsletter box)
https://urlis.net/jw9sm5d2
The Australian (It was getting the related stories and most popular stories)
https://urlis.net/69cnu4ze
Courier Mail (It was getting an Amp box in the bottom)
https://urlis.net/qeuknv7n
Daily Mail (It was getting an 'exclusive' box and the comments)
https://urlis.net/z6gmds29
AFR (It was getting the author description in the bottom of the article)
https://urlis.net/ms64hmi5
ABC (It was getting a form in the bottom)
https://urlis.net/2yakvo39
The independent (it was showing comments)
https://urlis.net/rx3wq49x
Onya Magazine (It was getting the siderbar content instead of body)
https://urlis.net/m5mdzetf
Thank you.