Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TikTok] Support Sigi-type pages, etc #30479

Closed
wants to merge 5 commits into from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Jan 7, 2022

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
    Except: this PR subsumes PR fix tiktok when logged in #30224 whose author also affirmed this.
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

TT switched (possibly partially) its framework from NextJS to Sigi, and the persisted state JSON sent in the page changed as a result. Instead of a <script> element with id __NEXT_DATA__, we get one with id sigi_persisted_state and JSON with a slightly different structure.

This PR deals with both types of page format, based on PR #30224 and this patch which gets more metadata.

Also, extraction could fail with a timeout (Error 60 in Windows, SSLError('The read operation timed out',) in Linux) or connection reset (Error 54 in Windows) due to some weird blocking by whatever fronts TikTok's pages (Akamai, apparenty). In order to download the page for parsing, some cookie has to be sent and a way to get it is to make a previous request to the site. The extractor fetched https://www.tiktok.com/ before doing anything else. In yt-dlp, the code fetches the webpage itself twice, commenting that you get 403 otherwise. This PR copies that tactic but instead of fetching the whole page (GET request) it just sends a HEAD request; if a page is actually returned, rather than an error with a Set-Cookie header, it doesn't actually have to be downloaded.

Probably resolves #28741
Resolves #30251
Resolves #30432
Resolves #30439
Resolves #30445
Resolves #30454
Resolves #30470.

Finally the non-working TikTokUserIE has been resurrected for accessing all the videos of a specific user.

Resolves #30174.

@dirkf dirkf mentioned this pull request Jan 7, 2022
5 tasks
@dirkf dirkf mentioned this pull request Jan 7, 2022
@hessijames79
Copy link

Hi!
After your patch has worked for several days, I am now encountering new problems (with the "vanilla" youtube-dl as well): #30538

Patrick

Add TikTokVM
Partial fix for TikTokUser
@dirkf dirkf force-pushed the df-wranai-tiktok-patch branch from 99a2b7c to 2f65e20 Compare April 25, 2022 14:25
@afterdelight
Copy link
Contributor

when this merge?

@sulyi sulyi mentioned this pull request May 3, 2022
14 tasks
state = self._parse_json(
get_element_by_id('SIGI_STATE', html)
or self._search_regex(
r'''(?s)<script\s[^>]*?\bid\s*=\s*(?P<q>"|'|\b)sigi-persisted-data(?P=q)[^>]*>[^=]*=\s*(?P<json>{.+?})\s*(?:;[^<]+)?</script''',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can @dirkf review this?


page_props = self._get_SIGI_STATE(user_id, webpage)
user_data = try_get(page_props, lambda x: x['UserModule']['users'], dict)
if user_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be

        if not user_data:
            raise ExtractorError(...)
        ...

If the extractor returns None, youtube-dl will just silently exit. See yt-dlp/yt-dlp#3776 (comment)

Copy link
Contributor Author

@dirkf dirkf Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was some fallback code that would run if not user_data. Don't we get an ExtractorError anyway if an IE returns a None info_dict? (No, apparently not!)

Comment on lines +214 to +216
if result:
result['display_id'] = user_id
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@dirkf
Copy link
Contributor Author

dirkf commented Jun 14, 2022

As observed in yt-dlp/yt-dlp#3776 (comment) the user pages are currently redirecting to a captcha more or less whatever we do wrt cookies and UAs.

In a browser with JS disabled and UA set to Mozilla/5.0 after clearing cookies for TT, a request to a user page gets the captcha page, and then reloading with the provided cookies opens the desired page. This doesn't happen with the extractor even with a delay between the two fetches.

pukkandan added a commit to yt-dlp/yt-dlp that referenced this pull request Jun 17, 2022
Based on #3624, ytdl-org/youtube-dl#30479

Closes #3551

Authored by dirkf, sulyi, pukkandan
@dirkf dirkf mentioned this pull request Aug 9, 2022
5 tasks
@dirkf dirkf linked an issue Aug 9, 2022 that may be closed by this pull request
5 tasks
@bvoq
Copy link

bvoq commented Dec 26, 2022

Looks like every issue is about this, when will this get merged?

@dirkf dirkf mentioned this pull request Apr 9, 2023
@dirkf dirkf mentioned this pull request Jul 24, 2023
3 tasks
@OwenMelbz
Copy link

Do we think this will see the light of day? :D Was hoping to be able to use it for a little fun project!

Thanks

@kashif-umair
Copy link

I think this is also outdated now. There is no sigi_persisted_state in the returned HTML.

@dirkf dirkf closed this Aug 5, 2023
@dirkf dirkf added the defunct PR source branch is not accessible label Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defunct PR source branch is not accessible
Projects
None yet
8 participants