Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed unwanted content in some websites #509

Merged
merged 10 commits into from
Feb 27, 2024
Merged

Conversation

felipehertzer
Copy link
Contributor

@felipehertzer felipehertzer commented Feb 15, 2024

Hey @adbar,

I did a check in the majors Australian websites and I added some classes and ids to the xpaths.py to avoid get unwanted content.

Canberra Times: (It was getting the newsletter box)
https://urlis.net/jw9sm5d2

The Australian (It was getting the related stories and most popular stories)
https://urlis.net/69cnu4ze

Courier Mail (It was getting an Amp box in the bottom)
https://urlis.net/qeuknv7n

Daily Mail (It was getting an 'exclusive' box and the comments)
https://urlis.net/z6gmds29

AFR (It was getting the author description in the bottom of the article)
https://urlis.net/ms64hmi5

ABC (It was getting a form in the bottom)
https://urlis.net/2yakvo39

The independent (it was showing comments)
https://urlis.net/rx3wq49x

Onya Magazine (It was getting the siderbar content instead of body)
https://urlis.net/m5mdzetf

Thank you.

Copy link

codecov bot commented Feb 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.16%. Comparing base (20d291f) to head (f1fdea1).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #509   +/-   ##
=======================================
  Coverage   97.16%   97.16%           
=======================================
  Files          22       22           
  Lines        3425     3425           
=======================================
  Hits         3328     3328           
  Misses         97       97           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Feb 15, 2024

Thanks, everything works.

@adbar adbar mentioned this pull request Feb 15, 2024
@felipehertzer
Copy link
Contributor Author

felipehertzer commented Feb 19, 2024

Hey @adbar, I just tested it more, and I think or contains(@class, "-article") will create more problems than solve. Example of a few classes that will match in the real_words_test: 'print-article-meta', 'category-articles'.

@adbar
Copy link
Owner

adbar commented Feb 19, 2024

Thanks for testing it, should I process the PR as it is now then?

There is still an odd error in certain tests but it should be OK.

@felipehertzer
Copy link
Contributor Author

Yes you can proceed with this one, I think the tests started failing when I did the rebase.

@adbar
Copy link
Owner

adbar commented Feb 20, 2024

I don't understand the bug on two test series, it doesn't make sense. I'll just wait a bit a restart the tests, maybe the problem will be solved or propagated to all versions by then.

@adbar
Copy link
Owner

adbar commented Feb 26, 2024

@felipehertzer The problem is solved, can I merge the PR?

@felipehertzer
Copy link
Contributor Author

@adbar Sure, all good for me.

@adbar
Copy link
Owner

adbar commented Feb 27, 2024

I just removed an expression due to accuracy concerns, maybe "-comments" is just too broad. The rest work fine and even add a bit of precision on my benchmark, thanks!

@adbar adbar merged commit 0d7d5a6 into adbar:master Feb 27, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants