New london.gov.uk breaks TheyWorkForYou questions scraper #1687

ajparsons · 2022-12-01T09:26:54Z

London has a new website: https://www.london.gov.uk/

This breaks the previous scraper we were using to get Mayor's questions. https://www.theyworkforyou.com/london/

New site has doesn't have a page per session like the previous one - would need to use query date ranges through the search for the equivalent. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer

The members feed comes via wikidata and is unaffected.

As a first action, we should contact them, raise awareness of the issue, and see if we can get a nicer data feed to work with rather than writing a new scraper. Assigning myself to keep track.

ajparsons · 2022-12-09T14:20:02Z

Have sent a message about a data feed.

ajparsons · 2023-01-11T14:11:39Z

Had a reply, and they've suggested they should be creating an RSS feed:

It’s a good suggestion re adding a feed for this information - while it sounds like this has happened as you were scraping our previous site for this data, our technical team have advised the best solution going forward is that we include Mayor’s Questions as an RSS feed, so that the data is available in a machine readable/accessible format.

Let us know if this will work for you and we can look into building that in over the next month or two. It would be good to understand what would be most useful for you in terms of the info provided.

I think this would work well for us? We'd still have to make a new scraper, but it should be more stable over the long term.

I can have a look at the fields the scraper was originally extracting to pass back to them - do we have any other suggestions about format? e.g. ability to query by day/month?

dracos · 2023-01-11T14:22:10Z

I guess the main issue is the one we had with their site - if it’s an RSS feed of questions, how do we get the answers? Question appear before there are answers, e.g. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/lfb-staff-progression-2 (as opposed to https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/responding-climate-breakdown with an answer on that page.)
If they just translated their search page into RSS, we’d get a feed of links to the questions, then have to store/fetch them all every day to see if it yet had an answer in the HTML?
How would we know how far the RSS feed went back, and what it contained? I guess if it accepts parameters and so is more an API that happens to output RSS, that might be okay (would we still have to parse out the speaker/question/answer from the RSS, it’s not a very rich output format, after all)
Ideally, we’d want to, each morning be able to say “give us an RSS feed for anything that got an answer (or even better, “was updated”) yesterday” and get back something with all that in it. What the format is then doesn’t really matter.

ajparsons · 2023-04-18T12:12:41Z

I've had a go at the scraper, the big inefficiency is having to requery all non-answered questions. If we merge it, I'll go back to london assembly and see if we can still get a feed for that (speeds us up, less queries for them).

ajparsons self-assigned this Dec 1, 2022

ajparsons mentioned this issue Apr 18, 2023

London assembly scraper mysociety/parlparse#167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

ajparsons commented Dec 1, 2022 •

edited

Loading

ajparsons commented Dec 9, 2022

ajparsons commented Jan 11, 2023

dracos commented Jan 11, 2023

ajparsons commented Apr 18, 2023

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

Comments

ajparsons commented Dec 1, 2022 • edited Loading

ajparsons commented Dec 9, 2022

ajparsons commented Jan 11, 2023

dracos commented Jan 11, 2023

ajparsons commented Apr 18, 2023

ajparsons commented Dec 1, 2022 •

edited

Loading