Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

Open
gregoryfoster opened this issue Nov 27, 2023 · 1 comment · May be fixed by #146
Open

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

gregoryfoster opened this issue Nov 27, 2023 · 1 comment · May be fixed by #146
Labels
bug Something isn't working

Comments

@gregoryfoster
Copy link

Describe the Bug

In get_legistar_content_uris, the BeautifulSoup code to extract_url from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]) misses available video links in some circumstances.

Expected Behavior

The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock get_content_uris call didn't result in matches.

Here's an example Olympia Planning Commission event detail screen and the corresponding valid "Media" anchor tag:

<a id="ctl00_ContentPlaceHolder1_gridMain_ctl00_ctl06_hypVideo" onclick="window.open('Video.aspx?Mode=Granicus&amp;ID1=1536&amp;ID2=120417&amp;G=19510D34-31FB-48B8-9C02-4D026953451C&amp;Mode2=Video','video');return false;" href="#" style="color:Blue;font-family:Tahoma;font-size:10pt;">Media</a>

I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        class_="videolink",
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
    # the <a> tag will not have this attribute if there is no video
    if "onclick" not in extract_url.attrs:
        return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)
  1. videolink class - City of Olympia Media links do not have a videolink class assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?
  2. find only identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do a find_all and iterate through, but a different approach might be...
  3. onclick is a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of the onclick attribute to more quickly identify a valid Media link.

Here's how I suggest modifying the code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        onclick=True,
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)

Reproduction

You can see where the Event Gather workflow is failing on the cdp-usa-wa-olympia instance here; while not specifically pointing out this issue, this is the next hiccup:
https://github.com/CannObserv/cdp-usa-wa-city-olympia/actions/runs/6999433306/job/19038863304

If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override get_legistar_content_uris in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.

Environment

  • OS Version: [e.g. macOS 11.3.1]
  • cdp-scrapers Version: [e.g. 0.5.0]
@gregoryfoster gregoryfoster added the bug Something isn't working label Nov 27, 2023
gregoryfoster added a commit to gregoryfoster/cdp-scrapers that referenced this issue Dec 2, 2023
@evamaxfield
Copy link
Member

I believe you have already made a PR for this so this will be closed when that PR is closed! Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants