Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#145: bug/get_legistar_content_uris - adjusting inital media URL extraction #146

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gregoryfoster
Copy link

Link to Relevant Issue

This pull request resolves #145.

Description of Changes

  1. Removes videolink class criteria - some media links do not have a videolink class assigned. The highly specific ID should be enough to distinguish potentially relevant links.
  2. The BeautifulSoup find call only identifies the first media link instance - an example provided in the issue shows the first media link is not always associated with a video, resulting in a failure to identify video for the entire event.
  3. onclick is a distinguishing attribute for links with video content - while checked subsequently to provide a unique error, this code tests for the presence of the onclick attribute to more quickly identify the first valid media link.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. I think tests are already failing but that is for a different reason. If no additional tests fail I think this is all good.

@evamaxfield
Copy link
Member

Hmmmm it seems like something new has broken with the king county scraper. cc @dphoria @gregoryfoster can you two look at this and let me know if this looks like a failure due to the change or a failure due to king county changing stuff on their side.

@dphoria
Copy link
Collaborator

dphoria commented Dec 2, 2023

This existing PR might resolve these failures.

@gregoryfoster
Copy link
Author

The modified code appears to have created an issue with the King County scraper, but I'm unclear on a few things to debug more - mainly I am unsure which events from June 2021, in what order, are being retrieved for testing.

In the second run of test_king_county_scrapers() in test_scrapers.py, we're making it as far as the king_county.get_events(start_date_time, end_date_time) line. The start_date_time is parametrized as 06.16.2021 with no time or timezone, so that's midnight - but not sure the TZ (local to test machine? UTC?).

Those values are passed to the default get_events in legistar_utils.py which is then shipped to get_legistar_events_for_timespan(). That's where the dates are formatted into an API request where begin (the start_date_time) is specified with an operand of ge for >=.

When I look at the June 2021 calendar for King County, there are multiple events on June 16th, but the expected video in the tests is for the Regional Transit Committee meeting at 3pm - the third meeting that day, but it looks like the first indexed by get_events. Could be that the unspecified timezone results in that being the first event indexed?

We're not making it that far to test it, I'm just wanting to be sure that we're supposed to be looking for that particular meeting so I can look more closely at the page HTML to see why the video anchor tag is not matching.

It's also possible we're throwing/returning the ContentUriScrapeResult.Status.UnrecognizedPatternError earlier if the API response doesn't include the LEGISTAR_EV_SITE_URL.

Also suspicious to me: there are three errors in the brower console when navigating to the actual media/video screen for the Regional Transit Committee meeting on the 16th, and manual download of the video doesn't work for me:
https://king.granicus.com/player/clip/8820?view_id=4&redirect=true&h=24ecedd7d1004e4ce1e224bb77f01041

If someone can confirm which meeting we're testing against, I'll look into this some more.

@dphoria dphoria self-requested a review December 7, 2023 21:08
@dphoria
Copy link
Collaborator

dphoria commented Dec 7, 2023

Sorry for this delay! I will do my best to get back to you this weekend.

@dphoria
Copy link
Collaborator

dphoria commented Dec 18, 2023

First of all, @gregoryfoster , I know talk is cheap but I want to apologize again for this delay.

As I suspected, the MR I referenced (#143) seems to resolve all test failures except the King County case that you mentioned. So, I think first step that I would suggest is for you to update this PR branch.

As for the now-failing King County case, I will comment my thoughts on the code.

"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
class_="videolink",
"a", id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"), onclick=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by the PR description, that the videolink class filtering is too strict, I agree it should be OK to remove it. But adding this new criteria for the onclick attribute is incorrect, because we will return UnrecognizedPatternError when this result is None. That exception is meant to tell us that something has changed on these web pages.

You have the check for this onclick attribute below, where you correctly return ContentNotProvidedError to say video is not available.

i.e. I think we simply need to remove this onclick=True query term. I tested locally by checking out this branch, merging upstream/main then removing this onclick filtering. The test failed but because my connection got throttled and was forcibly reset by peer. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_legistar_content_uris initial media URL extraction inaccurate for some events
3 participants