Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SITES] patchwork.kernel.org #627

Open
palfrey opened this issue Mar 27, 2024 · 2 comments
Open

[SITES] patchwork.kernel.org #627

palfrey opened this issue Mar 27, 2024 · 2 comments

Comments

@palfrey
Copy link
Contributor

palfrey commented Mar 27, 2024

First please check that it is really an issue with the library, and not some special case of website:

[x] There is no paywall
[x] You do not have to be logged in to see the articles
[x] You tried using a common browser user agent in your configuration / call
[x] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://patchwork.kernel.org/

Some sample urls that I have tried

https://patchwork.kernel.org/project/linux-kselftest/cover/[email protected]/

The exact code i used to test this articles/website

  article = Article(url, fetch_images=False, follow_meta_refresh=True)
  article.download()
  article.parse()

** What parts of the article are missing / not parsed correctly **

[ ] Title
[x] Text Content
[ ] Publication Date
[ ] Authors
[ ] Images
[ ] Movies
[ ] Other, please specify:

Other information, remarks, messages, etc:

@gzanfardino
Copy link

Doesn't really look like a newspaper, is it within the project scope to support this kind of websites?

@AndyTheFactory
Copy link
Owner

image
The problem is that all replies are in divs with "comment" class

The module removes all these since in most news sites these are down in the comments section, and thus removed from the article body

These classes are defined here

self.remove_nodes_re = (
"^side$|combx|retweet|mediaarticlerelated|menucontainer|"
"navbar|storytopbar-bucket|utility-bar|inline-share-tools"
"|comment|PopularQuestions|contact|foot|footer|Footer|footnote"
"|cnn_strycaptiontxt|cnn_html_slideshow|cnn_strylftcntnt"
"|links|meta$|shoutbox|sponsor"
"|tags|socialnetworking|socialNetworking|cnnStryHghLght"
"|cnn_stryspcvbx|^inset$|pagetools|post-attributes"
"|welcome_form|contentTools2|the_answers"
"|communitypromo|runaroundLeft|subscribe|vcard|articleheadings"
"|date|^print$|popup|author-dropdown|tools|socialtools|byline"
"|konafilter|KonaFilter|breadcrumbs|^fn$|wp-caption-text"
"|legende|ajoutVideo|timestamp|js_replies"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants