Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

Open
ajparsons opened this issue Dec 1, 2022 · 4 comments
Open

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

ajparsons opened this issue Dec 1, 2022 · 4 comments
Assignees

Comments

@ajparsons
Copy link
Contributor

ajparsons commented Dec 1, 2022

London has a new website: https://www.london.gov.uk/

This breaks the previous scraper we were using to get Mayor's questions. https://www.theyworkforyou.com/london/

New site has doesn't have a page per session like the previous one - would need to use query date ranges through the search for the equivalent. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer

The members feed comes via wikidata and is unaffected.

As a first action, we should contact them, raise awareness of the issue, and see if we can get a nicer data feed to work with rather than writing a new scraper. Assigning myself to keep track.

@ajparsons ajparsons self-assigned this Dec 1, 2022
@ajparsons
Copy link
Contributor Author

Have sent a message about a data feed.

@ajparsons
Copy link
Contributor Author

Had a reply, and they've suggested they should be creating an RSS feed:

It’s a good suggestion re adding a feed for this information - while it sounds like this has happened as you were scraping our previous site for this data, our technical team have advised the best solution going forward is that we include Mayor’s Questions as an RSS feed, so that the data is available in a machine readable/accessible format.

Let us know if this will work for you and we can look into building that in over the next month or two. It would be good to understand what would be most useful for you in terms of the info provided.

I think this would work well for us? We'd still have to make a new scraper, but it should be more stable over the long term.

I can have a look at the fields the scraper was originally extracting to pass back to them - do we have any other suggestions about format? e.g. ability to query by day/month?

@dracos
Copy link
Member

dracos commented Jan 11, 2023

I guess the main issue is the one we had with their site - if it’s an RSS feed of questions, how do we get the answers? Question appear before there are answers, e.g. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/lfb-staff-progression-2 (as opposed to https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/responding-climate-breakdown with an answer on that page.)
If they just translated their search page into RSS, we’d get a feed of links to the questions, then have to store/fetch them all every day to see if it yet had an answer in the HTML?
How would we know how far the RSS feed went back, and what it contained? I guess if it accepts parameters and so is more an API that happens to output RSS, that might be okay (would we still have to parse out the speaker/question/answer from the RSS, it’s not a very rich output format, after all)
Ideally, we’d want to, each morning be able to say “give us an RSS feed for anything that got an answer (or even better, “was updated”) yesterday” and get back something with all that in it. What the format is then doesn’t really matter.

@ajparsons
Copy link
Contributor Author

I've had a go at the scraper, the big inefficiency is having to requery all non-answered questions. If we merge it, I'll go back to london assembly and see if we can still get a feed for that (speeds us up, less queries for them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants