Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pornhub] Workaround scrape detection #5930

Closed
Cupcake-iOS opened this issue Jun 9, 2015 · 9 comments
Closed

[pornhub] Workaround scrape detection #5930

Cupcake-iOS opened this issue Jun 9, 2015 · 9 comments

Comments

@Cupcake-iOS
Copy link

Hello, I try to download the video from pornhub.com, and it gives me the following error message.

➜  youtube-dl --verbose "http://www.pornhub.com/view_video.php?viewkey=1290284933"
[debug] System config: []
[debug] User config: []
[debug] Command-line args: [u'--verbose', u'http://www.pornhub.com/view_video.php?viewkey=1290284933']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2015.06.04.1
[debug] Python version 2.7.6 - Darwin-14.3.0-x86_64-i386-64bit
[debug] exe versions: none
[debug] Proxy map: {}
[PornHub] 1290284933: Downloading webpage
ERROR: Unable to extract title; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 650, in extract_info
    ie_result = ie.extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 273, in extract
    return self._real_extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/pornhub.py", line 55, in _real_extract
    video_title = self._html_search_regex(r'<h1 [^>]+>([^<]+)', webpage, 'title')
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 564, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 555, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
RegexNotFoundError: Unable to extract title; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
@yan12125
Copy link
Collaborator

yan12125 commented Jun 9, 2015

Works for me. Could you add an option --write-pages:

youtube-dl --verbose --write-pages "http://www.pornhub.com/view_video.php?viewkey=1290284933"

And upload/paste all *.dump files?

@Cupcake-iOS
Copy link
Author

Still the same. Please check the following dump file content,

<html><head><script type="text/javascript"><!--
function leastFactor(n) {
 if (isNaN(n) || !isFinite(n)) return NaN;
 if (n==0) return 0;
 if (n%1 || n*n<2) return 1;
 if (n%2==0) return 2;
 if (n%3==0) return 3;
 if (n%5==0) return 5;
 var m=Math.sqrt(n);
 for (var i=7;i<=m;i+=30) {
  if (n%i==0)      return i;
  if (n%(i+4)==0)  return i+4;
  if (n%(i+6)==0)  return i+6;
  if (n%(i+10)==0) return i+10;
  if (n%(i+12)==0) return i+12;
  if (n%(i+16)==0) return i+16;
  if (n%(i+22)==0) return i+22;
  if (n%(i+24)==0) return i+24;
 }
 return n;
}
function go() {
 var p=2012283879083; var s=2097722137; var n;
if ((s >> 9) & 1)/*
else p-=
*/p+=/*
p+= */194539741*
10;/*
else p-=
*/else /*
p+= */p-=96891998*  10;/* 120886108*
*/if ((s >> 9) & 1) p+=/*
else p-=
*/60068856*/*
p+= */10;/*
else p-=
*/else 
p-=/* 120886108*
*/125939562*/*
else p-=
*/10;   if ((s >> 14) & 1)  p+=/* 120886108*
*/116707458*/* 120886108*
*/17;/*
*13;
*/else /*
p+= */p-=/*
else p-=
*/31885004*
15;/*
else p-=
*/if ((s >> 3) & 1)
p+=/*
p+= */158004163*/*
p+= */4;/* 120886108*
*/else /*
else p-=
*/p-=/*
else p-=
*/72068438* 4;  if ((s >> 8) & 1)/*
p+= */p+=
143157909*/* 120886108*
*/11;/*
else p-=
*/else /*
else p-=
*/p-=144627391*/* 120886108*
*/9;/* 120886108*
*/ p-=6212272203;
 n=leastFactor(p);
{ document.cookie="RNKEY="+n+"*"+p/n+":"+s+":711807541:1";
  document.location.reload(true); }
}
//--></script></head>
<body onload="go()">
Loading ...
</body>
</html>

@yan12125
Copy link
Collaborator

yan12125 commented Jun 9, 2015

Is the content from

curl -v "http://www.pornhub.com/view_video.php?viewkey=1290284933"

the same as dump files?

@Cupcake-iOS
Copy link
Author

no, using diff tool, in go() function has some different results.

@yan12125
Copy link
Collaborator

yan12125 commented Jun 9, 2015

Seems from your location pornpub is blocking downloading tools, such as youtube-dl. I'm afraid there's no simple way to bypass it. It's horrible to parse this complicated page and pass the correct value to pornhub.

@Cupcake-iOS
Copy link
Author

Got it and thanks a lot!

@dstftw dstftw reopened this Oct 10, 2015
@dstftw dstftw changed the title Unable to extract title [pornhub] Workaround scrape detection Oct 10, 2015
@Hrxn
Copy link

Hrxn commented Oct 11, 2015

Does a new WAN interface IP address help?

@auggie5
Copy link

auggie5 commented Jan 21, 2019

This is really a delay mechanism used when an IP address makes too many requests too quickly. What the function is doing is requiring the client to do an expensive calculation before loading the page, to slow it down.

This is sometimes triggered by downloading a large playlist with a high percentage of private videos using --ignore-errors.

youtube-dl will try to download the page for a private video, fail immediately because it is private and go on to the next right away. When the next and the next are also private it can be making many requests in rapid succession and trigger this response. Once you start getting it, you keep getting it for some percentage of videos. The percentage seems to go up the more videos you try (and fail) to request in rapid succession, which is naturally what happens when it's downloading a playlist and you start getting a high percentage of this response.

A partial mitigation could be to avoid doing that. If it's possible to identify a video as private from the playlist itself without having to try and fail to download it, there wouldn't be so many requests all at once.

It can also happen when resuming a half-downloaded playlist because a request is made for every video in the first half of the playlist with no delay between them because they all have already been downloaded.

Parsing this would be easy with a javascript library if you're willing to take on that much of a dependency. More work without it but still possible.

What changes in each case is the contents of the go() function. Here is a second example to compare with the first above:

<html><head><script type="text/javascript"><!--
function leastFactor(n) {
if (isNaN(n) || !isFinite(n)) return NaN;
if (typeof phantom !== 'undefined') return 'phantom';
if (typeof module !== 'undefined' && module.exports) return 'node';
if (n==0) return 0;
if (n%1 || n*n<2) return 1;
if (n%2==0) return 2;
if (n%3==0) return 3;
if (n%5==0) return 5;
var m=Math.sqrt(n);
for (var i=7;i<=m;i+=30) {
if (n%i==0) return i;
if (n%(i+4)==0) return i+4;
if (n%(i+6)==0) return i+6;
if (n%(i+10)==0) return i+10;
if (n%(i+12)==0) return i+12;
if (n%(i+16)==0) return i+16;
if (n%(i+22)==0) return i+22;
if (n%(i+24)==0) return i+24;
}
return n;
}
function go() {
var p=2124985838984; var s=1542009973; var n;
if ((s >> 14) & 1)/* 120886108*
*/p+=29515459*
17;/* 120886108*
*/else p-=/*
else p-=
*/41180606*/*
else p-=
*/15;/*
else p-=
*/if ((s >> 1) & 1)p+= 476030721*/*
*13;
*/2;/*
else p-=
*/else
p-=/* 120886108*
*/358135816*/* 120886108*
*/2; if ((s >> 5) & 1)/*
else p-=
*/p+= 39310863*
6;/*
*13;
*/else p-=/*
else p-=
*/79174251*6;/*
p+= */if ((s >> 1) & 1) p+=920457937*
2;/*
else p-=
*/else /* 120886108*
*/p-=
528665489* 2;
if ((s >> 7) & 1) p+=
41707797*/*
p+= */8; else p-=/*
p+= */39753674*8;/* 120886108*
*/ p+=1416771189;
n=leastFactor(p);
{ document.cookie="RNKEY="+n+"*"+p/n+":"+s+":2676614135:1";
document.location.reload(true); }
}
//--></script></head>
<body onload="go()">
Loading ...
</body>
</html>

It's calculating some numbers and constructing a cookie from them.

@mjolnir870
Copy link

This will occur even with just downloading from a large playlist. Resolving issue #17571 would help avoid tripping high request thresholds if you use an archive file. Right now if you download a playlist of 100 files and have an archive file it will store identifiers for the 100 files. If the playlist gets updated to have 105 files and you download it again, youtube-dl still downloads 105 pages even though the archive file should allow it to ignore 100 pages. You can very rapidly get the delay mechanism this way because the 100 pages in the archive file are downloaded and discarded within a minute or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants