-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216
Conversation
…ble fury and consternation.
…ractor._VALID_URL duplication
As for me, this approach have just introduced unnecessary complexity and ruined the maintainability. And here is an uncomplete list of issues that immediately appear:
These are not embeds therefore should not be even tried.
|
1: It's checking <meta content="..." which may have been too greedy but I figured better safe than sorry, as some of the regex's did just that. I'm not sure if all use og:video? At any rate it doesn't match the images as no extractor matches an image in _VALID_URL 2: EMBED_VALID_URL was something I considered for a future update (to take precidence over _VALID_URL), as well as skipping any plugin with an _extract_url. 3: The order of what? It's processing iframes & embeds in the order they appear in the html, which I imagine is important. As is, generic simply gives up after the first static extension scraper finds a match, frequently when a single match is found (see #6054). This is something I'd like to fix, as youtube-dl is incapable of downloading videos because some of the scrapers use re.search vs re.findall, and it's winner takes all for the first extractor match (I had intended for my next pr to correct this). 4: This isn't meant to be a jack of all trades to replace all the extractor scrapers. Originally it was meant as a last attempt for embed src and iframe src, which is are far the most common places video urls lurk. Removing the repeated examples of re.search/findall iframes/embeds was an added bonus in my view, because:
which fails on
Even without removing the static re calls, as a fallback this will catch what they may miss due to quotes or an extractor that doesn't have lines added to generic 5: Those tests passed due to network failure, I will investigate |
Test 55 contains a bogus youtube embed. http://www.rtlnieuws.nl/nieuws/buitenland/aanslagen-kopenhagen This was concealed by the rtlnl check aborting further processing with
Restoring it to fix the test (I imagine the dl should fail unless -i specified) |
I think that we should reduce the code in generic and move the embed extraction to the extractors themselves, by using a staticmethod/classmethod similar to the one used for example for vimeo. With this approach only the extractors that implement it would be tried and the |
I agree that specialized scrapers should be in the extractor's class, hopefully with a common interface - so generic is not filled with foundstuff = FooIE._extract_embed(webpage); if foundstuff return. It would be good to try _VALID_URL iframes and embeds if no _EMBED_URLs are found because generic is headed to failure at this point, and the embed regexps are not updated. Case in point;
By far the biggest problem is generic's inability to handle more than 1 video: The vimeo._extract_vimeo_url function is broken: it uses re.search and causes GenericIE._real_extract to bail on the first match. Many pages have multiple videos. Some have multiple video providers. I attempted to fix vimeo when the re.search was coded into generic with #5971. The errant code was moved to vimeo.py without change in b407e17 while the PR sat there. I submitted a new PR #6054 which has not been commented on in 2 weeks. Generic is full of scrapers that cause youtube-dl to ignore all videos except the first. I was hoping to correct this. The next step was to change all of the returns to list.extend, and all the findalls to finditers so the position of each match.start() is used to return a sorted playlist. ie
...but a playlist following the same order (vimeo-youtube-youtube-vimeo) This is quite possible using regexpt's .start() functionality, so long as all the actors involved in scraping a generic page use it. , , but I may have assumed too much? Is youtube-dl intended to only grab 1 video? Or is youtube-dl not a project that encourages outside contributors? ( I ask the latter due to other PRs ) |
I don't think maintaining the order should be a high priority, after all it's the generic extractor. I have this opinion especially because there are so many ways to embed a video (iframes, javascript, meta elements, some random HTML node with specific attributes) that IMHO the simplest solution would be to do something like this: results = []
for ie in self._downloader._ies:
if hasattr(ie, '_extract_embed'):
embeds = ie._extract_embed(webpage, url)
if embeds:
results.extend(embeds)
I agree, It would work with my solution (and yours).
I'm with @dstftw, we shouldn't try to build _VALID_URLs that match embed urls because it makes maintaining them harder.
We encourage contributors (we have more that 1000 PR and a big fraction has been accepted), its just that mostly due to lack of time it may take a while until we look into PR (at least that's my case). Anyway, we gained commit access after we submitted PR and helped with some issues, so I don't think that discouraging external contributions would be reasonable. |
I agree. However - without fallbacks - this introduces new problems.
... How about this? Generic's url list is populated by: 1 ie._extract_embed(webpage, url): - specialized embed extraction functions (brightcove etc) 2 All generic, standard embeds (iframe / embed tags, etc.) with matching ie._EMBED_URL: 3: Actual generic stuff that isn't covered by IE IF nothing is found, as a last ditch effort before failure, the same list of iframe/embed/etc urls are checked against (all ies)._VALID_URL. If stuff is found, great, the user gets the video - and a warning about WhateverIE._EMBED_URL requiring an update. My reasoning behind having either _EMBED_URL or _extract_embed(): The majority of embeds are iframes/embeds/etc with urls both in the code, in the wild, and in how browsers handle them. The code for finding iframe and embed urls should be consistent, resilient to syntax differences, and not terminate after 1 match. As is, it is not. Correcting this will mitigate dev and user headaches. Yes, the check for iframes and embeds does not cover all cases, and should not become a mess of provider-specific additions to it's regex - that is what ie._extract_embed is for. |
Is it correct for GenericIE to return a URL or playlist from just the first match, as now and in the PR, or should it collect all the matching items? The order of certain tests is mentioned in comments, but for the rest there is some unexplained priority in the ordering. If the former, existing, behaviour is wanted, should it have an option to do the latter (say, |
*: _extract_plugin_embeds() checks for all possible extractors once for every iframe or embed in the page. This function will return a playlist even with multiple extractors (it will handle page with 2 vimeo and 1 dailymotion embeds) and runs after pre-existing extractor-specific checks fail, as a fallback and a replacement for:
*: 18 instances of
or a variation thereof have been removed, as _extract_plugin_embeds covers each one.
*: All generic tests that passed before pass, 19 tests are now handed via _extract_plugin_embeds, 5 fail but were already failing.
*: Still a WIP: It's verbose regardless of verbose setting, and can probably replace more static checks. Ideally the checks preceeding it should add to a list instead of returning, incase a page has embeds from multiple sources.