You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over here in the Media Cloud project we're seeing poor performance on the content extraction task for a variety of pages that include links to other "related" stories at the end of article content. Our use case is trying to extract only article content as text. Do you have advice on tweaks to make to improve that performance? This might be the opposite of #518, because we do not want related links as part of content.
Here's sample code with real examples parsed in a way that looks very similar to our usage. The function returns true if the supplied text is included in the extracted content (the erroneous results, in our use case). Each of these incorrectly includes text that is part of a "related links" type callout that appears after article content. Any advice appreciated.
importtrafilaturaimportrequestsMEDIA_CLOUD_USER_AGENT='Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'defis_text_in_webpage_content(txt:str, url:str) ->bool:
req=requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
parsed=trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
include_images=False, include_comments=False)
content_text=parsed['text']
returntxtincontent_textprint(is_text_in_webpage_content(
'Thai Official', # item on bottom of page in "Latest News" section'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
'HIV from Terrence Higgins to Today', # <li> under the "listen on sounds" banner after article'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
'Madhuri Dixit', # title of an item in the featured movie below the main content area'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
'Immigration, Ukraine', # title of an item in the "most popular" sidebar content'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
The text was updated successfully, but these errors were encountered:
Hi @rahulbot, thanks for your feedback, I'll need to check the webpages and the current approach to see if I can find a way to exclude related links. It can be confusing since links are sometimes part of the article and sometimes not.
In the meantime, can you try using favor_precision=True on those pages? This option allows for more restrictive content filtering.
adbar
changed the title
best approaches to removing related links at end of article/sidebar?
Removing related links at end of article/sidebar on news websites?
May 6, 2024
Over here in the Media Cloud project we're seeing poor performance on the content extraction task for a variety of pages that include links to other "related" stories at the end of article content. Our use case is trying to extract only article content as text. Do you have advice on tweaks to make to improve that performance? This might be the opposite of #518, because we do not want related links as part of content.
Here's sample code with real examples parsed in a way that looks very similar to our usage. The function returns true if the supplied text is included in the extracted content (the erroneous results, in our use case). Each of these incorrectly includes text that is part of a "related links" type callout that appears after article content. Any advice appreciated.
The text was updated successfully, but these errors were encountered: