get_legistar_content_uris initial media URL extraction inaccurate for some events #145

gregoryfoster · 2023-11-27T17:34:50Z

Describe the Bug

In get_legistar_content_uris, the BeautifulSoup code to extract_url from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]) misses available video links in some circumstances.

Expected Behavior

The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock get_content_uris call didn't result in matches.

Here's an example Olympia Planning Commission event detail screen and the corresponding valid "Media" anchor tag:

<a id="ctl00_ContentPlaceHolder1_gridMain_ctl00_ctl06_hypVideo" onclick="window.open('Video.aspx?Mode=Granicus&amp;ID1=1536&amp;ID2=120417&amp;G=19510D34-31FB-48B8-9C02-4D026953451C&amp;Mode2=Video','video');return false;" href="#" style="color:Blue;font-family:Tahoma;font-size:10pt;">Media</a>

I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        class_="videolink",
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
    # the <a> tag will not have this attribute if there is no video
    if "onclick" not in extract_url.attrs:
        return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)

videolink class - City of Olympia Media links do not have a videolink class assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?
find only identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do a find_all and iterate through, but a different approach might be...
onclick is a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of the onclick attribute to more quickly identify a valid Media link.

Here's how I suggest modifying the code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        onclick=True,
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)

Reproduction

You can see where the Event Gather workflow is failing on the cdp-usa-wa-olympia instance here; while not specifically pointing out this issue, this is the next hiccup:
https://github.com/CannObserv/cdp-usa-wa-city-olympia/actions/runs/6999433306/job/19038863304

If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override get_legistar_content_uris in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.

Environment

OS Version: [e.g. macOS 11.3.1]
cdp-scrapers Version: [e.g. 0.5.0]

The text was updated successfully, but these errors were encountered:

…media URL extraction

evamaxfield · 2023-12-02T20:47:06Z

I believe you have already made a PR for this so this will be closed when that PR is closed! Thanks again

gregoryfoster added the bug Something isn't working label Nov 27, 2023

gregoryfoster added a commit to gregoryfoster/cdp-scrapers that referenced this issue Dec 2, 2023

CouncilDataProject#145: get_legistar_content_uris - adjusting inital …

8c1403d

…media URL extraction

gregoryfoster linked a pull request Dec 2, 2023 that will close this issue

#145: bug/get_legistar_content_uris - adjusting inital media URL extraction #146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

gregoryfoster commented Nov 27, 2023

evamaxfield commented Dec 2, 2023

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

Comments

gregoryfoster commented Nov 27, 2023

Describe the Bug

Expected Behavior

Reproduction

Environment

evamaxfield commented Dec 2, 2023