Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

Closed
wants to merge 10 commits into from

Conversation

atomicdryad
Copy link
Contributor

*: _extract_plugin_embeds() checks for all possible extractors once for every iframe or embed in the page. This function will return a playlist even with multiple extractors (it will handle page with 2 vimeo and 1 dailymotion embeds) and runs after pre-existing extractor-specific checks fail, as a fallback and a replacement for:

*: 18 instances of

# Look for foo
mobj = re.search( r'<iframe[^>]+src=(["\'])(?P<url>copy-of-foo._VALID_URL)\1', webpage)
if mobj is not None:
    return self.url_result(mobj.group('url'))

or a variation thereof have been removed, as _extract_plugin_embeds covers each one.

*: All generic tests that passed before pass, 19 tests are now handed via _extract_plugin_embeds, 5 fail but were already failing.

*: Still a WIP: It's verbose regardless of verbose setting, and can probably replace more static checks. Ideally the checks preceeding it should add to a list instead of returning, incase a page has embeds from multiple sources.

@dstftw
Copy link
Collaborator

dstftw commented Jul 13, 2015

As for me, this approach have just introduced unnecessary complexity and ruined the maintainability. And here is an uncomplete list of issues that immediately appear:

  1. This regex is too generic and matches pieces it should not match:
python test/test_download.py TestDownload.test_Generic_44
[generic] episodetwelve: Requesting header
[generic] episodetwelve: Downloading webpage
[generic] episodetwelve: Extracting information
[generic] EMBED [-none-] http://thedailyshow.mtvnimages.com/images/shows/tds/touts/TDS_PodcastWithoutJonStewart_500_ep12.jpg?crop=true
[generic] EMBED [-none-] http://1.images.thedailyshow.com/images/tds/TDS_Placeholder.jpg

These are not embeds therefore should not be even tried.
2. It fails since InfoExtractor.suitable in general is not intended to be used with embedded URLs, i.e. _VALID_URL is not _EMBED_VALID_URL.
3. It doesn't respect the order which may be important.
4. If one will eventually need to customize the embed regex of some extractor - he will end up extracting this embed from this generic mechanism otherwise this universal regex will become just to complex.
5. It fails some tests. I didn't check more, but two of three random test have failed (they pass ok on master):

python test/test_download.py TestDownload.test_Generic_46
python test/test_download.py TestDownload.test_Generic_55

@atomicdryad
Copy link
Contributor Author

1: It's checking <meta content="..." which may have been too greedy but I figured better safe than sorry, as some of the regex's did just that. I'm not sure if all use og:video? At any rate it doesn't match the images as no extractor matches an image in _VALID_URL

2: EMBED_VALID_URL was something I considered for a future update (to take precidence over _VALID_URL), as well as skipping any plugin with an _extract_url.

3: The order of what? It's processing iframes & embeds in the order they appear in the html, which I imagine is important. As is, generic simply gives up after the first static extension scraper finds a match, frequently when a single match is found (see #6054). This is something I'd like to fix, as youtube-dl is incapable of downloading videos because some of the scrapers use re.search vs re.findall, and it's winner takes all for the first extractor match (I had intended for my next pr to correct this).

4: This isn't meant to be a jack of all trades to replace all the extractor scrapers. Originally it was meant as a last attempt for embed src and iframe src, which is are far the most common places video urls lurk. Removing the repeated examples of re.search/findall iframes/embeds was an added bonus in my view, because:

  • It reduces complexity, as far as code size.

  • Easier maintenence; 1 regex to maintain, in the extractor's .py, vs said regex + a static string that must be found amongst a huge list of regexp searches.

  • Better compatibility; Many of the regexp strings broke on single quotes or attributes before src, like mtv's:

    r'<iframe src="

which fails on

iframe class='wat' src=' 

Even without removing the static re calls, as a fallback this will catch what they may miss due to quotes or an extractor that doesn't have lines added to generic

5: Those tests passed due to network failure, I will investigate

@atomicdryad
Copy link
Contributor Author

Test 55 contains a bogus youtube embed. http://www.rtlnieuws.nl/nieuws/buitenland/aanslagen-kopenhagen

This was concealed by the rtlnl check aborting further processing with

if matches:
    return _playlist_from_matches(matches, ie='RtlNl')

Restoring it to fix the test (I imagine the dl should fail unless -i specified)

@jaimeMF
Copy link
Collaborator

jaimeMF commented Jul 14, 2015

I think that we should reduce the code in generic and move the embed extraction to the extractors themselves, by using a staticmethod/classmethod similar to the one used for example for vimeo. With this approach only the extractors that implement it would be tried and the _VALID_URL doesn't need to also match the embed url.

@atomicdryad
Copy link
Contributor Author

I agree that specialized scrapers should be in the extractor's class, hopefully with a common interface - so generic is not filled with foundstuff = FooIE._extract_embed(webpage); if foundstuff return.

It would be good to try _VALID_URL iframes and embeds if no _EMBED_URLs are found because generic is headed to failure at this point, and the embed regexps are not updated. Case in point;

  • http://youtu.be/.... embed is broken; I had a dl failure today, that I corrected by removing youtube from this pr's extractor blacklist.

By far the biggest problem is generic's inability to handle more than 1 video:

The vimeo._extract_vimeo_url function is broken: it uses re.search and causes GenericIE._real_extract to bail on the first match. Many pages have multiple videos. Some have multiple video providers. I attempted to fix vimeo when the re.search was coded into generic with #5971. The errant code was moved to vimeo.py without change in b407e17 while the PR sat there. I submitted a new PR #6054 which has not been commented on in 2 weeks.

Generic is full of scrapers that cause youtube-dl to ignore all videos except the first. I was hoping to correct this. The next step was to change all of the returns to list.extend, and all the findalls to finditers so the position of each match.start() is used to return a sorted playlist. ie

  • a page with 4 videos ( vimeo-youtube-youtube-vimeo ) would not return;
  • first vimeo embed (as is now)
  • 2 vimeo embeds (if vimeo._extract_embed gets fixed)
  • 2 vimeo then 2 youtube embeds (all videos, but with the order scrambled)

...but a playlist following the same order (vimeo-youtube-youtube-vimeo)

This is quite possible using regexpt's .start() functionality, so long as all the actors involved in scraping a generic page use it.

, , but I may have assumed too much? Is youtube-dl intended to only grab 1 video? Or is youtube-dl not a project that encourages outside contributors? ( I ask the latter due to other PRs )

@jaimeMF
Copy link
Collaborator

jaimeMF commented Jul 16, 2015

I don't think maintaining the order should be a high priority, after all it's the generic extractor. I have this opinion especially because there are so many ways to embed a video (iframes, javascript, meta elements, some random HTML node with specific attributes) that IMHO the simplest solution would be to do something like this:

results = []
for ie in self._downloader._ies:
    if hasattr(ie, '_extract_embed'):
        embeds = ie._extract_embed(webpage, url)
        if embeds:
            results.extend(embeds)

By far the biggest problem is generic's inability to handle more than 1 video

I agree, It would work with my solution (and yours).

It would be good to try _VALID_URL iframes and embeds if no _EMBED_URLs are found because generic is headed to failure at this point, and the embed regexps are not updated. Case in point;

I'm with @dstftw, we shouldn't try to build _VALID_URLs that match embed urls because it makes maintaining them harder.

Or is youtube-dl not a project that encourages outside contributors? ( I ask the latter due to other PRs )

We encourage contributors (we have more that 1000 PR and a big fraction has been accepted), its just that mostly due to lack of time it may take a while until we look into PR (at least that's my case). Anyway, we gained commit access after we submitted PR and helped with some issues, so I don't think that discouraging external contributions would be reasonable.

@atomicdryad
Copy link
Contributor Author

I'm with @dstftw, we shouldn't try to build _VALID_URLs that match embed urls because it makes maintaining them harder.

I agree. However - without fallbacks - this introduces new problems.

  • Embed urls will no longer work as arguments; 'youtube-dl http://youtube.com/embed/blargle' will fail unless /embed/ is in _VALID_URL (defeating the purpose of seperating url matching) or both _VALID_URL and _extract_embed()/_EMBED_URL are checked in ydl.extract_info
  • generic will fail when there is an embed _EMBED_URL does not cover, while checking _VALID_URL would have worked. Yes, _EMBED_URL or _extract_embed should have caught it, but as is there are many cases where re.search()es in generic fail. So perhaps it should do a print_warning if it fails to find embeds, but finds embedded ie_suitables, then downloads them?
  • This will require a massive pr, or multiple prs.

...

How about this?

Generic's url list is populated by:

1 ie._extract_embed(webpage, url): - specialized embed extraction functions (brightcove etc)

2 All generic, standard embeds (iframe / embed tags, etc.) with matching ie._EMBED_URL:

3: Actual generic stuff that isn't covered by IE

IF nothing is found, as a last ditch effort before failure, the same list of iframe/embed/etc urls are checked against (all ies)._VALID_URL. If stuff is found, great, the user gets the video - and a warning about WhateverIE._EMBED_URL requiring an update.

My reasoning behind having either _EMBED_URL or _extract_embed(): The majority of embeds are iframes/embeds/etc with urls both in the code, in the wild, and in how browsers handle them. The code for finding iframe and embed urls should be consistent, resilient to syntax differences, and not terminate after 1 match. As is, it is not. Correcting this will mitigate dev and user headaches. Yes, the check for iframes and embeds does not cover all cases, and should not become a mess of provider-specific additions to it's regex - that is what ie._extract_embed is for.

@dirkf
Copy link
Contributor

dirkf commented Aug 3, 2021

Is it correct for GenericIE to return a URL or playlist from just the first match, as now and in the PR, or should it collect all the matching items?

The order of certain tests is mentioned in comments, but for the rest there is some unexplained priority in the ordering.

If the former, existing, behaviour is wanted, should it have an option to do the latter (say, --generic-extractor-all)?

@dirkf dirkf mentioned this pull request Jan 23, 2023
11 tasks
@dirkf dirkf closed this Aug 1, 2023
@dirkf dirkf added the defunct PR source branch is not accessible label Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defunct PR source branch is not accessible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants