You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In get_legistar_content_uris, the BeautifulSoup code to extract_url from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]) misses available video links in some circumstances.
Expected Behavior
The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock get_content_uris call didn't result in matches.
I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:
extract_url = soup.find(
"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
class_="videolink",
)
if extract_url is None:
return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
# the <a> tag will not have this attribute if there is no videoif"onclick" not in extract_url.attrs:
return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)
videolink class - City of Olympia Media links do not have a videolink class assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?
find only identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do a find_all and iterate through, but a different approach might be...
onclick is a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of the onclick attribute to more quickly identify a valid Media link.
Here's how I suggest modifying the code:
extract_url = soup.find(
"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
onclick=True,
)
if extract_url is None:
return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override get_legistar_content_uris in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.
Environment
OS Version: [e.g. macOS 11.3.1]
cdp-scrapers Version: [e.g. 0.5.0]
The text was updated successfully, but these errors were encountered:
Describe the Bug
In
get_legistar_content_uris
, theBeautifulSoup
code toextract_url
from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]
) misses available video links in some circumstances.Expected Behavior
The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock
get_content_uris
call didn't result in matches.Here's an example Olympia Planning Commission event detail screen and the corresponding valid "Media" anchor tag:
I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:
videolink
class - City of Olympia Media links do not have avideolink
class assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?find
only identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do afind_all
and iterate through, but a different approach might be...onclick
is a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of theonclick
attribute to more quickly identify a valid Media link.Here's how I suggest modifying the code:
Reproduction
You can see where the Event Gather workflow is failing on the
cdp-usa-wa-olympia
instance here; while not specifically pointing out this issue, this is the next hiccup:https://github.com/CannObserv/cdp-usa-wa-city-olympia/actions/runs/6999433306/job/19038863304
If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override
get_legistar_content_uris
in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.Environment
The text was updated successfully, but these errors were encountered: