You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The exact code i used to test this articles/website
Standard article.parse() just gets 403. Feeding in the raw HTML with Playwright OTOH, gets the error (and I've checked that page.content() just spits out HTML)
** What parts of the article are missing / not parsed correctly **
Everything because exception
Other information, remarks, messages, etc:
<in my code as per the above>
article.parse()
lib/python3.11/site-packages/newspaper/article.py:466: in parse
authors = self.extractor.get_authors(self.doc)
lib/python3.11/site-packages/newspaper/extractors/content_extractor.py:59: in get_authors
return self.author_extractor.parse(doc)
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in parse
authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in <listcomp>
authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pattern = '[\n\t\r\xa0]', repl = ' '
string = {'biography': "<p>First published in 1869, <b><i>Nature</i></b> is the world's leading multidisciplinary science journ.../p>", 'contacts': [], 'contentful_id': '7Ek1B681o6mb6QOBg14RKO', 'mura_id': 'A7F2375E-BB3B-4896-8F706A83EEA765D7', ...}
count = 0, flags = 0
def sub(pattern, repl, string, count=0, flags=0):
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the Match object and must return
a replacement string to be used."""
> return _compile(pattern, flags).sub(repl, string, count)
E TypeError: expected string or bytes-like object, got 'dict'
lib/python3.11/re/__init__.py:185: TypeError
The text was updated successfully, but these errors were encountered:
First please check that it is really an issue with the library, and not some special case of website:
Your report as follows:
Website that does not parse correctly:
Some sample urls that I have tried
The exact code i used to test this articles/website
Standard
article.parse()
just gets 403. Feeding in the raw HTML with Playwright OTOH, gets the error (and I've checked thatpage.content()
just spits out HTML)** What parts of the article are missing / not parsed correctly **
Everything because exception
Other information, remarks, messages, etc:
The text was updated successfully, but these errors were encountered: