Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerous duplicate posts in BBC feed #1745

Open
slammer99uk opened this issue Apr 15, 2024 · 2 comments
Open

Numerous duplicate posts in BBC feed #1745

slammer99uk opened this issue Apr 15, 2024 · 2 comments

Comments

@slammer99uk
Copy link

I leave my iMac on overnight to download articles every 60 minutes. I have noticed recently (about 3 months) that there are continual duplicate or even triplicate posts in the BBC news feed window. As far as i am aware I changed nothing at my end.

Screenshot 2024-04-15 at 07 49 04

This si the RSS link - https://feeds.bbci.co.uk/news/world/rss.xml

@Eitot
Copy link
Contributor

Eitot commented Jun 24, 2024

I can confirm this. I have checked the database and found that the articles are identical except for the GUID (globally unique identifier) that the feed itself provides. For instance:

SELECT title, message_id, link, date, revised_flag FROM messages WHERE title LIKE '%Belarus opposition leader warns%';
title message_id link date revised_flag
Belarus opposition leader warns Poland over borders https://www.bbc.com/news/articles/cldddvpgk90o#0 https://www.bbc.com/news/articles/cldddvpgk90o 1719161442.0 0
Belarus opposition leader warns Poland over borders https://www.bbc.com/news/articles/cldddvpgk90o#1 https://www.bbc.com/news/articles/cldddvpgk90o 1719161442.0 0

Note the final #0 and #1 of the message_id column. I have checked the source of the RSS feed and this is where those GUID values originate. According to the specification, the provided GUID value ought to be treated as a string as-is.

@Eitot
Copy link
Contributor

Eitot commented Jun 24, 2024

I don't think that this is something that can or ought to be fixed by Vienna. The entries appear to be identical except for the GUID value. Even the body text is identical. There is no other way to distinguish those articles. I don't understand why the BBC would publish the same entries multiple times. I haven't found a source for this practice of using #<number> either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants