-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't work in production #38
Comments
Mind specifying what's happening? |
I'm using Next.js Supabase vercel It seems like when I go to production, it's getting blocked. I get the error Transcript is Disabled, but I know that's not true, because it works locally. I get the Error: An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included in this error instance which may provide additional details about the nature of the error. I am using server components for it. Even when I switch to API routes it still gets blocked. |
this is my code above |
I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then? |
I was encountering this issue on Vercel, too. I believe this issue exists because YouTube is returning different HTML for YouTube video pages depending on where the request is coming from. When I scrape a page like Option 1If all you need is the lines of text from the transcript, then the method described in #11 where you use the Option 2If you need all of the transcript data, including the start time and duration for every line, the easiest solution is to run your |
I see thank you. The durations are a pretty core part of my app. The commenter of the youtubei.js code said it was also possible to get the durations, so I will check that out first. |
i am just experiencing this issue too in |
@terrytjw No I haven't, I don't think it works. I'm planning to switch to youtubei.js, but it seems like the issue is prevalent over there as well. |
hey @kellenmace , other than Google Cloud Function, can aws lambda work too? |
@Donald646 have you tried the Google Cloud Function approach? |
No I haven't yet, but from another issue I was looking at on Youtubei.js they said It didn't work either, but they weren't getting transcripts so I don't know |
I can confirm it’s not working on AWS Lambda. |
Yea this library is cooked, it's kinda useless if it doesn't work in production |
Working for me with youtubei.js but no clue why 😅 I'm on Supabase Edge Functions. Transcript works, but some other properties appear broken. |
Because YouTube is always changing how it operates and because it enforces sign-in, it is making it impossible for data scrapers to obtain the data, this is a global problem. You can find a similar issue that simply returns the YouTube message "Sign in to confirm you are not a bot" if you check at a few different libraries, including ytdl and youtubei.js. |
It's working for me on a regular linux production server |
wait till their system find out that you're just a bot |
I'm getting the bot login thing too but can still scrape transcripts
…--------------
Emil Lienemann
--------------
founder
talktweak
***@***.***
emil.cx/meet ( https://emil.cx/meet )
On Tue, Aug 13, 2024 at 8:34 AM, ayan < ***@***.*** > wrote:
Because YouTube is always changing how it operates and because it enforces
sign-in, it is making it impossible for data scrapers to obtain the data,
this is a global problem.
You can find a similar issue that simply returns the YouTube message "Sign
in to confirm you are not a bot" if you check at a few different
libraries, including ytdl and youtubei.js.
—
Reply to this email directly, view it on GitHub (
#38 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/BA7YF32A2KNSNX5VIFUDFK3ZRGSI5AVCNFSM6AAAAABMK7UCRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVGQ2DEOBSG4
).
You are receiving this because you commented. Message ID: <Kakulukian/youtube-transcript/issues/38/2285442827
@ github. com>
|
it's stopped working for me now too. I'm not seeing any error message, it just hangs.... |
Edit: does not work for everything on production, I was not able to get |
It says video unavailable for me. Did this occur to you too? |
I was getting that error only with: |
@Lipcsyy Sorry to follow up, but I noticed that the code here #11 (comment) uses a URL in Here's the function I'm using to extract the ID from the URL: const getYouTubeVideoId = (input: string): string => {
const regExp: RegExp =
/(?:https?:\/\/)?(?:www\.)?(?:youtube\.com\/(?:.*[?&]v=|(?:v|e(?:mbed)?)\/|shorts\/|live\/)|youtu\.be\/)([a-zA-Z0-9_-]{11})/;
const match: RegExpMatchArray | null = input.match(regExp);
return match && match[1] ? match[1] : input;
}; There might be a simpler approach, but I hope this helps! |
I have the same issue with Next.js deployed to Vercel
|
Any solution for that? |
Hey @adnjoo I implemented the same logic as yours in my express server but when hitting the APIs it gives error InnertubeError: This video is unavailable. What could be the reason? |
By any change you ever came accros the below error When hitting the APIs locally using postman or UI locally it works fine I gets the transcript but when hosted the backend on vercel or aws it gives error https://www.youtube.com/watch?v=EGkGRs6YhoM |
Youtube has been updated, this script was cooked |
Hii @mrgoonie, The only diff is his code is in next.js and mine on express. |
I think the error in production is due to the same IP address of the hosting server. I also saw this error in the logs of the server:
|
That's exactly the reason, as I mentioned earlier, Youtube has just updated and will block your server IP address if it crawls the Youtube URL too many times. So I came up with a solution that adding a proxy layer in every fetch. Here is the demo: https://app.digicord.site/youtube/transcript There is a FREE API within the page in case someone need it. Cheers! |
It's working because your IP address has not been blocked, not working for me because mine is cooked 😄 |
so its not yet solved yet 🥲, I felt this was easy then saw the error on production |
Hi @tushar453030, try to pass the video ID instead of the whole URL. It worked on my side by following this code.
|
If anyone is still struggling with this. It's due to Youtube banning ISP Ips (why it would not work ion production) I have made a working solution using oauth2 that doesn't uses proxies and avoids Youtube ban. You can use it for free on Apify: https://apify.com/invideoiq/video-transcript-scraper. You only pay for Apify usage, however Apify gives you free credit of 5$ which will give you around 5000 transcripts |
so you are directly using google apis ? |
@swarajbachu No, There is an oauth plugin that was created by yt-dlp developers. It uses the Youtube on TV client because the token is never refreshed on TVs. All I had to do is a create a dummy account, it asked me for password once and than I saved the token and kept passing it in every request. I believe something similar can be developed here |
what are you using for the proxy? |
There are plenty of them on the internet, I picked IPRoyal, good enough for me, but if you crawl too many, you will need many proxies to rotate 😅 |
I followed the extractor wiki and grabbed a proof of origin token. Is that the token you're referring to? I'm still getting blocked on my prod server even when I pass the PO using this library. Or which oauth plugin are you referring to with the yt-dlp library? |
Fantastic! Is it possible to limit the request to only get the "content" part and not the "chunks"? Because it takes a long time to retreive the transcript for a 3 min video, but perhaps it would go quicker if i could only get "content"? Appreciate it! |
Sadly both youtube-transcript and digicords variant has now stopped working (https://app.digicord.site/youtube/transcript). Anyone else know of a working API? |
i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap |
I am trying this, but still getting blocked, what library are you using to fetch? |
your requests must not be going through the proxy import { setGlobalDispatcher, ProxyAgent } from 'undici'; // Load environment variables // Set up global proxy if configured console.log('Setting up global proxy dispatcher:', proxyUrl.replace(/:[^:@]@/, ':**@')); |
Did you get it to work? |
I was using youtube-transcript and ytdl-core with proxies and it was throwing 500 internal serve error, got it working for now with innertube! Thanks |
Yes, working with youtubei.js (innertube)! |
Awesome! Is it this one? (https://github.com/haxzie/innerTube.js/) I can't find any method that gets the transcript? |
here this one: #11 (comment) |
This gets blocked in production
The text was updated successfully, but these errors were encountered: