Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#145: bug/get_legistar_content_uris - adjusting inital media URL extraction #146

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions cdp_scrapers/legistar_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,15 +515,10 @@ def get_legistar_content_uris(client: str, legistar_ev: dict) -> ContentUriScrap
# return false;"
# href="#" style="color:Blue;font-family:Tahoma;font-size:10pt;">Video</a>
extract_url = soup.find(
"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
class_="videolink",
"a", id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"), onclick=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by the PR description, that the videolink class filtering is too strict, I agree it should be OK to remove it. But adding this new criteria for the onclick attribute is incorrect, because we will return UnrecognizedPatternError when this result is None. That exception is meant to tell us that something has changed on these web pages.

You have the check for this onclick attribute below, where you correctly return ContentNotProvidedError to say video is not available.

i.e. I think we simply need to remove this onclick=True query term. I tested locally by checking out this branch, merging upstream/main then removing this onclick filtering. The test failed but because my connection got throttled and was forcibly reset by peer. 😅

)
if extract_url is None:
return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
# the <a> tag will not have this attribute if there is no video
if "onclick" not in extract_url.attrs:
return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)

# NOTE: after this point, failing to scrape video url should raise an exception.
# we need to be alerted that we probabaly have a new web page structure.
Expand Down
Loading