Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API versioning #162

Open
Benjamin-Loison opened this issue May 4, 2023 · 5 comments
Open

Add API versioning #162

Benjamin-Loison opened this issue May 4, 2023 · 5 comments
Labels
enhancement New feature or request epic A task that is going to take more than a day to complete. high priority Issue disabling the user to use correctly the main features.

Comments

@Benjamin-Loison
Copy link
Owner

Benjamin-Loison commented May 4, 2023

Cf https://docs.github.com/en/rest/overview/api-versions

Related to CITATION.cff.

@Benjamin-Loison Benjamin-Loison added enhancement New feature or request epic A task that is going to take more than a day to complete. low priority Nice to have feature. labels May 4, 2023
@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Aug 9, 2023

Could make a new version with parts being the webpages instead of their features or could implement #28. Could also consider implementing fields (may have mentioned this in another issue).

@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Sep 14, 2023

Here is an example of YouTube data structure change providing more data going from an element to an array, how should we deal with that?
Should we:

  • break current data structure: downside everyone relying on the data structure will have to rework it
  • keep the current data structure (the question is also do we then return the first element of the array or the most important one, note that important can be understand as highest replay peak or integrated replay area, the highest replay peak would make sense compared to the latter, the highest replay peak seems better than the first potentially less interesting element in the array) but proposes a v2 (or depending on commit id or date or whatsoever) endpoint with the new data structure

Any hybrid like keeping current data structure and adding a field with the (remaining) elements of the array seems stupid.

Note that we don't have any control on YouTube UI data returned that mean any data can be removed at any point in time, so the stability provided by versioning isn't as guaranteed as usually.
The issue with versioning this way is also the fact that to keep support it we have to convert the new YouTube UI data structure to all still supported formats which may be a massive work.

In fact passing from an element to the array was a bit predictable, even if I don't much agree with keeping Most replayed wording as it was previously unclear in terms of plurality.

TODO: don't forget to update accordingly my Stack Overflow answer once have done the necessary changes.

Note that I'm in favor of per endpoint versioning as we would like to support in real time YouTube UI changes so how would this would look like while still making sense in the URL? vVERSION/endpoint as done usually doesn't seem to make sense as the version is specific to a given endpoint not all, so endpoint?version=VERSION seems to make more sense.

TODO: should verify that yt-dlp having this feature is still working fine.

Note that could return an array if we are aware of an array otherwise just the element but then it's up to the API user to manage both cases which isn't wanted, even if not doing so means that he isn't aware if there is a single element in the array if it's because we are sure that there is no other one (can we even be sure about it?). If we were able to know what YouTube UI we are facing then we could denote it in our response. Forcing some given YouTube UI nodes having specific YouTube versions could help but this is an unwanted solution as it may evolve more quickly than the YouTube UI itself.

Should treat this specific issue with versioning as it's a kind of textbook case as it's the most appreciated feature.

Previous messages to this one show that it's not just about multiple most-replayed segments but YouTube is rolling out a new update on its servers, so this issue is now very important. Assuming they have a linear update process on YouTube servers, having a field to figure out the remote server version would be nice, instead of on a per feature basis adapt our parsing.

curl -s 'https://www.youtube.com/watch?v=o8NPllzkFhE' > video.json

then open video.json in Firefox. Maybe responseContext/serviceTrackingParams/0/params/1 is what we are looking for but to verify would need two machines communicating with different YouTube servers behaving differently and checking each such JSON value.
Could also just diff both JSONs.

@Benjamin-Loison Benjamin-Loison added high priority Issue disabling the user to use correctly the main features. and removed low priority Nice to have feature. labels Sep 29, 2023
@Benjamin-Loison Benjamin-Loison pinned this issue Sep 29, 2023
@Benjamin-Loison Benjamin-Loison changed the title Could add API versioning Add API versioning Sep 29, 2023
@Benjamin-Loison Benjamin-Loison unpinned this issue Sep 30, 2023
@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Nov 7, 2023

Following this comment, as I had once the licenses returned while all the other times not, the question was do a given YouTube server (identified by its IP) has a fixed version such that we can rely on it to have a given data structure. The answer is no. Then a question could be to always have correct structure parsing to be able to versioning the returned data structure, this question could be answered by comparing, preferably whole HTML, data structure returned between two different data structures to find a data structure version identifier.

Could use this Stack Overflow answer to compare two JSON as a first step for instance (HTML comparison being a latter one). I am looking for a tool as diff -qr to compare two folders, to tell whether the highest entry is only present in one JSON file, otherwise which fields differ etc.
Proceeding with a line by line comparison is less user-friendly, in addition that filtering line by line make the JSON invalidly formatted.

Potential solution: find a cookie like the one previously used to get a channel shorts before it was vastly deployed on YouTube servers. Could try to find such a __Secure-YEC cookie in above curl.txt or try to reverse-engineer previous channel shorts cookie thanks to this method. The latter method does not work as is, only base64 -d gives a bit of result but it returns a random indentifier, so it does not help more...

WMG:     __Secure-YEC=Cgs3d0VyY1paY3pKOCjCnaiqBjIICgJGUhICEgA%3D
Not WMG: __Secure-YEC=Cgs5M1Q1WG9ORnExayiRsKqqBjIICgJGUhICEgA%3D (2a00:1450:4007:80b::200e)
Not WMG: __Secure-YEC=CgtpdGhYaUl3VEJtayi4sKqqBjIICgJGUhICEgA%3D (2a00:1450:4007:80b::200e)
clear && curl -s -v -H 'Cookie: __Secure-YEC=CgtuNjFmZlJlR0Qxcyjp3P-aBg==' 'https://www.youtube.com/watch?v=Ehoe35hTbuY' | grep 'WMG'

returns __Secure-YEC=CgtsOUE5YXh2dzlvdyiTsaqqBjIICgJGUhICEgA%3D and not WMG.

While above is a manual try, having a web-browser with previous data structure and able to copy the whole cURL URL would help.
Should try to loop with cURL to have an idea of how many requests have to be done before being able to achieve to have the previous data structure. Then could similarly proceed with a web-browser thanks to selenium for instance but I have not yet tested if modifying /etc/hosts is enough to force the same server, and how to verify it, as cURL success request.

import requests

url = 'https://www.youtube.com/watch?v=Ehoe35hTbuY'
requestIndex = 0
while True:
    response = requests.get(url, stream = True)
    ip = response.raw._fp.fp.raw._sock.getpeername()[0]
    print(requestIndex, ip, 'Shape of You' in response.text)
    text = response.text
    if 'SonyATV' in text:
        print('Found!')
        index = text.index('SonyATV')
        print(text[index - 200:index + 200])
        break
    requestIndex += 1
0 2a00:1450:4007:80d::200e
1 2a00:1450:4007:80b::200e
2 2a00:1450:4007:80d::200e
3 2a00:1450:4007:80d::200e
4 2a00:1450:4007:80d::200e
5 2a00:1450:4007:80e::200e
6 2a00:1450:4007:80d::200e
7 2a00:1450:4007:80b::200e
8 2a00:1450:4007:80b::200e
9 2a00:1450:4007:80e::200e
10 2a00:1450:4007:80e::200e
11 2a00:1450:4007:80e::200e
12 2a00:1450:4007:80b::200e
13 2a00:1450:4007:80d::200e
14 2a00:1450:4007:80d::200e
15 2a00:1450:4007:80d::200e
16 2a00:1450:4007:80d::200e
17 2a00:1450:4007:80e::200e
18 2a00:1450:4007:80e::200e
19 2a00:1450:4007:80b::200e
Found!
etadata":{"simpleText":"WMG (au nom de East West Records UK Ltd); LatinAutor - Warner Chappell, ASCAP, AMRA, Spirit Music Publishing, LatinAutorPerf, BMI - Broadcast Music Inc., AMRA BR, LatinAutor - SonyATV, LatinAutor - UMPG, CMRRA, UMPI, Sony Music Publishing, BMG Rights Management (US), LLC, ARESA, SOLAR Music Rights Management, Abramus Digital, MINT_BMG, PEDL, UMPG Publishing, UNIAO BRASILEIR
from seleniumwire import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
#options.add_argument('-headless')
options.set_preference('permissions.default.image', 2)
browser = webdriver.Firefox(options = options)

requestIndex = 0
while True:
    browser.get('https://www.youtube.com/watch?v=Ehoe35hTbuY')
    browsePageSource = browser.page_source
    print(requestIndex, 'Shape of You' in browsePageSource)
    if 'SonyATV' in browsePageSource:
        print('Found!')
        index = browsePageSource.index('SonyATV')
        print(browsePageSource[index - 200:index + 200])
        break
    requestIndex += 1

browser.close()
0 True
...
10 True

Then it is stuck but after several tens of executions I have not found the licenses...

Should verify above with something less likely to happen randomly (especially if make multiple requests) such as SonyATV instead of WMG.

At least requests takes into account /etc/hosts (so we can hope selenium does too) as I reached 50 with below script:

import requests

url = 'https://www.youtube.com/watch?v=Ehoe35hTbuY'
requestIndex = 0
while True:
    response = requests.get(url, stream = True)
    ip = response.raw._fp.fp.raw._sock.getpeername()[0]
    print(requestIndex, ip, 'Shape of You' in response.text)
    if ip != '2a00:1450:4007:80c::200e':
        print('Different IP!')
        break
    requestIndex += 1
0 2a00:1450:4007:80c::200e
...
44 2a00:1450:4007:80c::200e
Found!
etadata":{"simpleText":"WMG (au nom de East West Records UK Ltd); LatinAutor - Warner Chappell, ASCAP, AMRA, Spirit Music Publishing, LatinAutorPerf, BMI - Broadcast Music Inc., AMRA BR, LatinAutor - SonyATV, LatinAutor - UMPG, CMRRA, UMPI, Sony Music Publishing, BMG Rights Management (US), LLC, ARESA, SOLAR Music Rights Management, Abramus Digital, MINT_BMG, PEDL, UMPG Publishing, UNIAO BRASILEIR

Reached by accumulation 192 with Selenium without finding licenses... So being able to craft a following cURL request from cURL working one seems to make the most sense.

Understanding all response fields from the verbose curl could help.

Maybe could also have more chance with phone mobile or desktop view of YouTube. No more luck.

Raw note below:

$ clear && curl -s -v 'https://www.youtube.com/watch?v=Ehoe35hTbuY' > a.json

2a00:1450:4007:80d::200e
Not having WMG.

$ clear && curl -s -v 'https://www.youtube.com/watch?v=Ehoe35hTbuY' > b.json && ./getJSONPathFromKey.py b.json | grep 'WMG'

Unable to reproduce on any of my 7 YouTube operational API instances.

Let us try to force the opposite protocol used by default on instance proposing it (only 2 instances, not enabling to reproduce).

clear && curl -s -v 'https://www.youtube.com/watch?v=Ehoe35hTbuY' | grep 'WMG'

Returns same hash multiple times in a row:

$ nslookup www.youtube.com | grep Address | grep -v '127.0.0.' | cut -d' ' -f2 | sort | sha256sum

So let us try all possible addresses exhaustively.
Note that above addresses depend on each emitting machine, so try all possible addresses on all the machines I have access to.

142.250.178.142
142.250.179.110
142.250.179.78
142.250.201.174
142.250.75.238
172.217.20.174
172.217.20.206
216.58.213.78
216.58.214.174
216.58.214.78
216.58.215.46
2a00:1450:4007:813::200e
2a00:1450:4007:818::200e
2a00:1450:4007:819::200e
2a00:1450:4007:81a::200e - currently

Note that nslookup returns IPv6 even on machine not supporting IPv6.

For unknown reason this file containing WMG does not seem complete, even if the JSON payload seems complete, as it is parsable.
curl.txt
In fact it is probably incomplete as grep filter only the line from stdout containing the licenses we are looking for.

After another time (with the same YouTube server IP) does not have WMG anymore...

@Benjamin-Loison
Copy link
Owner Author

Tried hand crafted:

clear && curl -s -v 'https://www.youtube.com/watch?v=Ehoe35hTbuY' -H 'Cookie: __Secure-BUCKET=CPsB; YSC=skqAukUZG7c; __Secure-YEC=CgtGcWc1SmhiQk9FTSid2qqqBjIICgJGUhICEgA%3D; VISITOR_PRIVACY_METADATA=CgJGUhICEgA%3D; VISITOR_INFO1_LIVE=; CONSENT=PENDING+159' | grep 'SonyATV'

minimized it to:

$ minimizeCURL curl.txt 'SonyATV'
242
Removing headers
Removing URL parameters
Removing cookies
219 still fine
202 still fine
161 still fine
140 still fine
119 still fine
Removing raw data
curl 'https://www.youtube.com/watch?v=Ehoe35hTbuY' -H 'Cookie: __Secure-YEC=CgtGcWc1SmhiQk9FTSid2qqqBjIICgJGUhICEgA%3D'

@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Nov 7, 2023

This shows that possibly storing __Secure-YEC today, can make the API last tomorrow. So should in addition to HTML response, store all meta-HTML (notably response cookies), example in this commit comment. Maybe if it makes sense, could write __Secure-YEC in the source code in the future to make the API lasts automatically.

Benjamin-Loison added a commit that referenced this issue Nov 7, 2023
Note that I do not know how much time this patch will work, as it seems to rely on a previous YouTube UI.

Long investigation available [here](#162 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic A task that is going to take more than a day to complete. high priority Issue disabling the user to use correctly the main features.
Projects
None yet
Development

No branches or pull requests

1 participant