Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SITES] https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/ #643

Open
4 tasks done
palfrey opened this issue Jun 21, 2024 · 0 comments

Comments

@palfrey
Copy link
Contributor

palfrey commented Jun 21, 2024

First please check that it is really an issue with the library, and not some special case of website:

  • There is no paywall
  • You do not have to be logged in to see the articles
  • You tried using a common browser user agent in your configuration / call
  • The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

Some sample urls that I have tried

https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

The exact code i used to test this articles/website

Standard article.parse() just gets 403. Feeding in the raw HTML with Playwright OTOH, gets the error (and I've checked that page.content() just spits out HTML)

        with sync_playwright() as p:
            browser = p.firefox.launch()
            page = browser.new_page()
            page.goto(url)
            article.html = page.content()
            article.parse()

** What parts of the article are missing / not parsed correctly **

Everything because exception

Other information, remarks, messages, etc:

<in my code as per the above>
    article.parse()
lib/python3.11/site-packages/newspaper/article.py:466: in parse
    authors = self.extractor.get_authors(self.doc)
lib/python3.11/site-packages/newspaper/extractors/content_extractor.py:59: in get_authors
    return self.author_extractor.parse(doc)
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in parse
    authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in <listcomp>
    authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pattern = '[\n\t\r\xa0]', repl = ' '
string = {'biography': "<p>First published in 1869, <b><i>Nature</i></b> is the world's leading multidisciplinary science journ.../p>", 'contacts': [], 'contentful_id': '7Ek1B681o6mb6QOBg14RKO', 'mura_id': 'A7F2375E-BB3B-4896-8F706A83EEA765D7', ...}
count = 0, flags = 0

    def sub(pattern, repl, string, count=0, flags=0):
        """Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the Match object and must return
        a replacement string to be used."""
>       return _compile(pattern, flags).sub(repl, string, count)
E       TypeError: expected string or bytes-like object, got 'dict'

lib/python3.11/re/__init__.py:185: TypeError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant