-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#145: bug/get_legistar_content_uris - adjusting inital media URL extraction #146
base: main
Are you sure you want to change the base?
#145: bug/get_legistar_content_uris - adjusting inital media URL extraction #146
Conversation
…media URL extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. I think tests are already failing but that is for a different reason. If no additional tests fail I think this is all good.
Hmmmm it seems like something new has broken with the king county scraper. cc @dphoria @gregoryfoster can you two look at this and let me know if this looks like a failure due to the change or a failure due to king county changing stuff on their side. |
This existing PR might resolve these failures. |
The modified code appears to have created an issue with the King County scraper, but I'm unclear on a few things to debug more - mainly I am unsure which events from June 2021, in what order, are being retrieved for testing. In the second run of Those values are passed to the default When I look at the June 2021 calendar for King County, there are multiple events on June 16th, but the expected video in the tests is for the Regional Transit Committee meeting at 3pm - the third meeting that day, but it looks like the first indexed by We're not making it that far to test it, I'm just wanting to be sure that we're supposed to be looking for that particular meeting so I can look more closely at the page HTML to see why the video anchor tag is not matching. It's also possible we're throwing/returning the Also suspicious to me: there are three errors in the brower console when navigating to the actual media/video screen for the Regional Transit Committee meeting on the 16th, and manual download of the video doesn't work for me: If someone can confirm which meeting we're testing against, I'll look into this some more. |
Sorry for this delay! I will do my best to get back to you this weekend. |
First of all, @gregoryfoster , I know talk is cheap but I want to apologize again for this delay. As I suspected, the MR I referenced (#143) seems to resolve all test failures except the King County case that you mentioned. So, I think first step that I would suggest is for you to update this PR branch. As for the now-failing King County case, I will comment my thoughts on the code. |
"a", | ||
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"), | ||
class_="videolink", | ||
"a", id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"), onclick=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going by the PR description, that the videolink
class filtering is too strict, I agree it should be OK to remove it. But adding this new criteria for the onclick attribute is incorrect, because we will return UnrecognizedPatternError
when this result is None
. That exception is meant to tell us that something has changed on these web pages.
You have the check for this onclick attribute below, where you correctly return ContentNotProvidedError
to say video is not available.
i.e. I think we simply need to remove this onclick=True
query term. I tested locally by checking out this branch, merging upstream/main
then removing this onclick filtering. The test failed but because my connection got throttled and was forcibly reset by peer
. 😅
Link to Relevant Issue
This pull request resolves #145.
Description of Changes
videolink
class criteria - some media links do not have avideolink
class assigned. The highly specific ID should be enough to distinguish potentially relevant links.BeautifulSoup
find
call only identifies the first media link instance - an example provided in the issue shows the first media link is not always associated with a video, resulting in a failure to identify video for the entire event.onclick
is a distinguishing attribute for links with video content - while checked subsequently to provide a unique error, this code tests for the presence of theonclick
attribute to more quickly identify the first valid media link.