generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

atomicdryad · 2015-07-13T18:09:34Z

*: _extract_plugin_embeds() checks for all possible extractors once for every iframe or embed in the page. This function will return a playlist even with multiple extractors (it will handle page with 2 vimeo and 1 dailymotion embeds) and runs after pre-existing extractor-specific checks fail, as a fallback and a replacement for:

*: 18 instances of

# Look for foo
mobj = re.search( r'<iframe[^>]+src=(["\'])(?P<url>copy-of-foo._VALID_URL)\1', webpage)
if mobj is not None:
    return self.url_result(mobj.group('url'))

or a variation thereof have been removed, as _extract_plugin_embeds covers each one.

*: All generic tests that passed before pass, 19 tests are now handed via _extract_plugin_embeds, 5 fail but were already failing.

*: Still a WIP: It's verbose regardless of verbose setting, and can probably replace more static checks. Ideally the checks preceeding it should add to a list instead of returning, incase a page has embeds from multiple sources.

…methods fail

…ble fury and consternation.

…ractor._VALID_URL duplication

dstftw · 2015-07-13T20:13:45Z

As for me, this approach have just introduced unnecessary complexity and ruined the maintainability. And here is an uncomplete list of issues that immediately appear:

This regex is too generic and matches pieces it should not match:

python test/test_download.py TestDownload.test_Generic_44
[generic] episodetwelve: Requesting header
[generic] episodetwelve: Downloading webpage
[generic] episodetwelve: Extracting information
[generic] EMBED [-none-] http://thedailyshow.mtvnimages.com/images/shows/tds/touts/TDS_PodcastWithoutJonStewart_500_ep12.jpg?crop=true
[generic] EMBED [-none-] http://1.images.thedailyshow.com/images/tds/TDS_Placeholder.jpg

These are not embeds therefore should not be even tried.
2. It fails since InfoExtractor.suitable in general is not intended to be used with embedded URLs, i.e. _VALID_URL is not _EMBED_VALID_URL.
3. It doesn't respect the order which may be important.
4. If one will eventually need to customize the embed regex of some extractor - he will end up extracting this embed from this generic mechanism otherwise this universal regex will become just to complex.
5. It fails some tests. I didn't check more, but two of three random test have failed (they pass ok on master):

python test/test_download.py TestDownload.test_Generic_46
python test/test_download.py TestDownload.test_Generic_55

atomicdryad · 2015-07-13T21:43:17Z

1: It's checking <meta content="..." which may have been too greedy but I figured better safe than sorry, as some of the regex's did just that. I'm not sure if all use og:video? At any rate it doesn't match the images as no extractor matches an image in _VALID_URL

2: EMBED_VALID_URL was something I considered for a future update (to take precidence over _VALID_URL), as well as skipping any plugin with an _extract_url.

3: The order of what? It's processing iframes & embeds in the order they appear in the html, which I imagine is important. As is, generic simply gives up after the first static extension scraper finds a match, frequently when a single match is found (see #6054). This is something I'd like to fix, as youtube-dl is incapable of downloading videos because some of the scrapers use re.search vs re.findall, and it's winner takes all for the first extractor match (I had intended for my next pr to correct this).

4: This isn't meant to be a jack of all trades to replace all the extractor scrapers. Originally it was meant as a last attempt for embed src and iframe src, which is are far the most common places video urls lurk. Removing the repeated examples of re.search/findall iframes/embeds was an added bonus in my view, because:

It reduces complexity, as far as code size.
Easier maintenence; 1 regex to maintain, in the extractor's .py, vs said regex + a static string that must be found amongst a huge list of regexp searches.
Better compatibility; Many of the regexp strings broke on single quotes or attributes before src, like mtv's:

r'<iframe src="

which fails on

iframe class='wat' src='

Even without removing the static re calls, as a fallback this will catch what they may miss due to quotes or an extractor that doesn't have lines added to generic

5: Those tests passed due to network failure, I will investigate

…URL"s)

atomicdryad · 2015-07-13T22:24:49Z

Test 55 contains a bogus youtube embed. http://www.rtlnieuws.nl/nieuws/buitenland/aanslagen-kopenhagen

This was concealed by the rtlnl check aborting further processing with

if matches:
    return _playlist_from_matches(matches, ie='RtlNl')

Restoring it to fix the test (I imagine the dl should fail unless -i specified)

http://www.rtlnieuws.nl/nieuws/buitenland/aanslagen-kopenhagen test url

jaimeMF · 2015-07-14T11:13:51Z

I think that we should reduce the code in generic and move the embed extraction to the extractors themselves, by using a staticmethod/classmethod similar to the one used for example for vimeo. With this approach only the extractors that implement it would be tried and the _VALID_URL doesn't need to also match the embed url.

atomicdryad · 2015-07-14T12:55:07Z

I agree that specialized scrapers should be in the extractor's class, hopefully with a common interface - so generic is not filled with foundstuff = FooIE._extract_embed(webpage); if foundstuff return.

It would be good to try _VALID_URL iframes and embeds if no _EMBED_URLs are found because generic is headed to failure at this point, and the embed regexps are not updated. Case in point;

http://youtu.be/.... embed is broken; I had a dl failure today, that I corrected by removing youtube from this pr's extractor blacklist.

By far the biggest problem is generic's inability to handle more than 1 video:

The vimeo._extract_vimeo_url function is broken: it uses re.search and causes GenericIE._real_extract to bail on the first match. Many pages have multiple videos. Some have multiple video providers. I attempted to fix vimeo when the re.search was coded into generic with #5971. The errant code was moved to vimeo.py without change in b407e17 while the PR sat there. I submitted a new PR #6054 which has not been commented on in 2 weeks.

Generic is full of scrapers that cause youtube-dl to ignore all videos except the first. I was hoping to correct this. The next step was to change all of the returns to list.extend, and all the findalls to finditers so the position of each match.start() is used to return a sorted playlist. ie

a page with 4 videos ( vimeo-youtube-youtube-vimeo ) would not return;
first vimeo embed (as is now)
2 vimeo embeds (if vimeo._extract_embed gets fixed)
2 vimeo then 2 youtube embeds (all videos, but with the order scrambled)

...but a playlist following the same order (vimeo-youtube-youtube-vimeo)

This is quite possible using regexpt's .start() functionality, so long as all the actors involved in scraping a generic page use it.

, , but I may have assumed too much? Is youtube-dl intended to only grab 1 video? Or is youtube-dl not a project that encourages outside contributors? ( I ask the latter due to other PRs )

jaimeMF · 2015-07-16T20:34:21Z

I don't think maintaining the order should be a high priority, after all it's the generic extractor. I have this opinion especially because there are so many ways to embed a video (iframes, javascript, meta elements, some random HTML node with specific attributes) that IMHO the simplest solution would be to do something like this:

results = []
for ie in self._downloader._ies:
    if hasattr(ie, '_extract_embed'):
        embeds = ie._extract_embed(webpage, url)
        if embeds:
            results.extend(embeds)

By far the biggest problem is generic's inability to handle more than 1 video

I agree, It would work with my solution (and yours).

It would be good to try _VALID_URL iframes and embeds if no _EMBED_URLs are found because generic is headed to failure at this point, and the embed regexps are not updated. Case in point;

I'm with @dstftw, we shouldn't try to build _VALID_URLs that match embed urls because it makes maintaining them harder.

Or is youtube-dl not a project that encourages outside contributors? ( I ask the latter due to other PRs )

We encourage contributors (we have more that 1000 PR and a big fraction has been accepted), its just that mostly due to lack of time it may take a while until we look into PR (at least that's my case). Anyway, we gained commit access after we submitted PR and helped with some issues, so I don't think that discouraging external contributions would be reasonable.

atomicdryad · 2015-07-26T01:50:53Z

I'm with @dstftw, we shouldn't try to build _VALID_URLs that match embed urls because it makes maintaining them harder.

I agree. However - without fallbacks - this introduces new problems.

Embed urls will no longer work as arguments; 'youtube-dl http://youtube.com/embed/blargle' will fail unless /embed/ is in _VALID_URL (defeating the purpose of seperating url matching) or both _VALID_URL and _extract_embed()/_EMBED_URL are checked in ydl.extract_info
generic will fail when there is an embed _EMBED_URL does not cover, while checking _VALID_URL would have worked. Yes, _EMBED_URL or _extract_embed should have caught it, but as is there are many cases where re.search()es in generic fail. So perhaps it should do a print_warning if it fails to find embeds, but finds embedded ie_suitables, then downloads them?
This will require a massive pr, or multiple prs.

...

How about this?

Generic's url list is populated by:

1 ie._extract_embed(webpage, url): - specialized embed extraction functions (brightcove etc)

2 All generic, standard embeds (iframe / embed tags, etc.) with matching ie._EMBED_URL:

3: Actual generic stuff that isn't covered by IE

IF nothing is found, as a last ditch effort before failure, the same list of iframe/embed/etc urls are checked against (all ies)._VALID_URL. If stuff is found, great, the user gets the video - and a warning about WhateverIE._EMBED_URL requiring an update.

My reasoning behind having either _EMBED_URL or _extract_embed(): The majority of embeds are iframes/embeds/etc with urls both in the code, in the wild, and in how browsers handle them. The code for finding iframe and embed urls should be consistent, resilient to syntax differences, and not terminate after 1 match. As is, it is not. Correcting this will mitigate dev and user headaches. Yes, the check for iframes and embeds does not cover all cases, and should not become a mess of provider-specific additions to it's regex - that is what ie._extract_embed is for.

dirkf · 2021-08-03T17:35:00Z

Is it correct for GenericIE to return a URL or playlist from just the first match, as now and in the PR, or should it collect all the matching items?

The order of certain tests is mentioned in comments, but for the rest there is some unexplained priority in the ordering.

If the former, existing, behaviour is wanted, should it have an option to do the latter (say, --generic-extractor-all)?

atomicdryad added 8 commits July 13, 2015 10:47

generic: dynamically find extractor for iframes/embeds/etc if static …

7cdeaca

…methods fail

< AGSPhoenix> AtomicDryad, your use of 'alot' fills me with unreasona…

e882958

…ble fury and consternation.

handle "//toolazy.for/protocol" urls

899420b

Generic: Eliminate redundant iframe/embed src regexp matching and ext…

4555a88

…ractor._VALID_URL duplication

Minor change to extractor-not-found reporting

f556534

remove soundcloud iframe scrape: handled by _extract_plugin_embeds

7cb0fb6

use self.to_screen but only if --verbose

93f380e

return playlist for multiple videos only

4d42a9d

generic: restore SVTIE._extract_url (it matches all href="SVT._VALID_…

6a058af

…URL"s)

Restoring rtl scraper to re-conceal broken youtube embed in

f578c82

http://www.rtlnieuws.nl/nieuws/buitenland/aanslagen-kopenhagen test url

dstftw force-pushed the master branch from c8f1bae to 7f9134f Compare December 31, 2015 20:38

jaimeMF mentioned this pull request Mar 17, 2016

Add experimenta lazy loading of info extractors #8497

Merged

johnhawkinson mentioned this pull request Apr 9, 2017

As a last resort, why not have GenericIE check every iframe? (vanityfair.com) #12692

Open

dstftw force-pushed the master branch from fa77986 to 0c7a631 Compare June 24, 2017 22:03

dstftw force-pushed the master branch from 4991699 to 1141e91 Compare August 5, 2017 00:41

dstftw force-pushed the master branch from 293617b to af0f742 Compare October 11, 2017 16:48

dstftw force-pushed the master branch from 37318e1 to 65220c3 Compare January 27, 2018 22:48

dstftw force-pushed the master branch from 8d14fa1 to 5399ab3 Compare February 4, 2018 00:55

dstftw force-pushed the master branch from c486aa9 to 5ee7ae5 Compare December 9, 2018 15:38

dstftw force-pushed the master branch from 8cd780c to de0359c Compare January 4, 2019 20:44

dstftw force-pushed the master branch from d99bab0 to e118a87 Compare January 23, 2019 18:40

dstftw force-pushed the master branch from 7b956a1 to 5e26784 Compare September 13, 2020 13:52

dirkf mentioned this pull request Jul 28, 2021

[generic] sort the imports of other extractors #28740

Open

11 tasks

pukkandan mentioned this pull request Jul 8, 2022

Generalized framework for webpage-based extraction yt-dlp/yt-dlp#4307

Merged

13 tasks

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf mentioned this pull request Jan 23, 2023

[VideoCdn] Add new extractor #31481

Open

11 tasks

dirkf closed this Aug 1, 2023

dirkf added the defunct PR source branch is not accessible label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

atomicdryad commented Jul 13, 2015

dstftw commented Jul 13, 2015

atomicdryad commented Jul 13, 2015

atomicdryad commented Jul 13, 2015

jaimeMF commented Jul 14, 2015

atomicdryad commented Jul 14, 2015

jaimeMF commented Jul 16, 2015

atomicdryad commented Jul 26, 2015

dirkf commented Aug 3, 2021

generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

generic: find a module for iframe/embeds using VALID_URL, eliminate static/redundant checks #6216

Conversation

atomicdryad commented Jul 13, 2015

dstftw commented Jul 13, 2015

atomicdryad commented Jul 13, 2015

atomicdryad commented Jul 13, 2015

jaimeMF commented Jul 14, 2015

atomicdryad commented Jul 14, 2015

jaimeMF commented Jul 16, 2015

atomicdryad commented Jul 26, 2015

dirkf commented Aug 3, 2021