Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't work in production #38

Open
Donald646 opened this issue Aug 11, 2024 · 51 comments
Open

Doesn't work in production #38

Donald646 opened this issue Aug 11, 2024 · 51 comments

Comments

@Donald646
Copy link

This gets blocked in production

@emilthemaker
Copy link

Mind specifying what's happening?

@Donald646
Copy link
Author

I'm using Next.js Supabase vercel

It seems like when I go to production, it's getting blocked. I get the error Transcript is Disabled, but I know that's not true, because it works locally.

I get the

Error: An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included in this error instance which may provide additional details about the nature of the error.

I am using server components for it. Even when I switch to API routes it still gets blocked.

@Donald646
Copy link
Author

Donald646 commented Aug 11, 2024

import { YoutubeTranscript } from 'youtube-transcript';

export async function POST(request: Request) {
  const { videoUrl } = await request.json();

  try {
    const transcript = await YoutubeTranscript.fetchTranscript(videoUrl);
    let newTranscript = '';
    for (let i = 0; i < transcript.length; i++) {
      newTranscript += `Timestamp: ${transcript[i].offset}, Text: ${transcript[i].text}\n`;
    }
    return NextResponse.json({ transcript: newTranscript });
  } catch (error) {
    console.error('Error fetching transcript:', error);
    return NextResponse.json(
      { error: 'Failed to fetch video transcript. Please check the URL and try again.' },
      { status: 500 }
    );
  }
}

this is my code above

@Donald646
Copy link
Author

I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then?

@kellenmace
Copy link

I was encountering this issue on Vercel, too. I believe this issue exists because YouTube is returning different HTML for YouTube video pages depending on where the request is coming from. When I scrape a page like https://www.youtube.com/watch?v=rB9ql0L0cUQ on my local machine, the HTML page contains a script tag with a ytInitialPlayerResponse variable in it, and within that object is the URLs to the caption tracks (this is what the youtube-transcript library relies on to get the transcript URLs). When I scrape that same page in a Serverless Function on Vercel though, the ytInitialPlayerResponse object does not contain the caption track URLs, therefore the code that youtube-transcript runs errors out.

Option 1

If all you need is the lines of text from the transcript, then the method described in #11 where you use the youtubei.js NPM package to access the transcript may work just fine.

Option 2

If you need all of the transcript data, including the start time and duration for every line, the easiest solution is to run your youtube-transcript code somewhere else. You can create a Google Cloud Function that executes the code and call it from your app, for example. This isn't ideal, since you have to have this one bit of functionality live outside of your Vercel-hosted app, but it works.

@Donald646
Copy link
Author

I see thank you. The durations are a pretty core part of my app. The commenter of the youtubei.js code said it was also possible to get the durations, so I will check that out first.

@terrytjw
Copy link

i am just experiencing this issue too in prod. works perfectly fine on local. @Donald646 did you manage to get it to work on prod?

@Donald646
Copy link
Author

@terrytjw No I haven't, I don't think it works. I'm planning to switch to youtubei.js, but it seems like the issue is prevalent over there as well.

@terrytjw
Copy link

I was encountering this issue on Vercel, too. I believe this issue exists because YouTube is returning different HTML for YouTube video pages depending on where the request is coming from. When I scrape a page like https://www.youtube.com/watch?v=rB9ql0L0cUQ on my local machine, the HTML page contains a script tag with a ytInitialPlayerResponse variable in it, and within that object is the URLs to the caption tracks (this is what the youtube-transcript library relies on to get the transcript URLs). When I scrape that same page in a Serverless Function on Vercel though, the ytInitialPlayerResponse object does not contain the caption track URLs, therefore the code that youtube-transcript runs errors out.

Option 1

If all you need is the lines of text from the transcript, then the method described in #11 where you use the youtubei.js NPM package to access the transcript may work just fine.

Option 2

If you need all of the transcript data, including the start time and duration for every line, the easiest solution is to run your youtube-transcript code somewhere else. You can create a Google Cloud Function that executes the code and call it from your app, for example. This isn't ideal, since you have to have this one bit of functionality live outside of your Vercel-hosted app, but it works.

hey @kellenmace , other than Google Cloud Function, can aws lambda work too?

@terrytjw
Copy link

@terrytjw No I haven't, I don't think it works. I'm planning to switch to youtubei.js, but it seems like the issue is prevalent over there as well.

@Donald646 have you tried the Google Cloud Function approach?

@Donald646
Copy link
Author

@terrytjw No I haven't, I don't think it works. I'm planning to switch to youtubei.js, but it seems like the issue is prevalent over there as well.

@Donald646 have you tried the Google Cloud Function approach?

No I haven't yet, but from another issue I was looking at on Youtubei.js they said It didn't work either, but they weren't getting transcripts so I don't know

SuspiciousLookingOwl/youtubei#113

@leandronorcio
Copy link

leandronorcio commented Aug 12, 2024

hey @kellenmace , other than Google Cloud Function, can aws lambda work too?

I can confirm it’s not working on AWS Lambda.

@Donald646
Copy link
Author

Yea this library is cooked, it's kinda useless if it doesn't work in production

@emilthemaker
Copy link

@terrytjw No I haven't, I don't think it works. I'm planning to switch to youtubei.js, but it seems like the issue is prevalent over there as well.

@Donald646 have you tried the Google Cloud Function approach?

No I haven't yet, but from another issue I was looking at on Youtubei.js they said It didn't work either, but they weren't getting transcripts so I don't know

SuspiciousLookingOwl/youtubei#113

Working for me with youtubei.js but no clue why 😅

I'm on Supabase Edge Functions. Transcript works, but some other properties appear broken.

@metaloozee
Copy link

Because YouTube is always changing how it operates and because it enforces sign-in, it is making it impossible for data scrapers to obtain the data, this is a global problem.

You can find a similar issue that simply returns the YouTube message "Sign in to confirm you are not a bot" if you check at a few different libraries, including ytdl and youtubei.js.

@colouredFunk
Copy link

It's working for me on a regular linux production server

@metaloozee
Copy link

It's working for me on a regular linux production server

wait till their system find out that you're just a bot

@emilthemaker
Copy link

emilthemaker commented Aug 13, 2024 via email

@colouredFunk
Copy link

it's stopped working for me now too. I'm not seeing any error message, it just hangs....

@aleksa-codes
Copy link

aleksa-codes commented Aug 16, 2024

I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then?

youtubei.js is working for me on production, I switched to it based on the code in that issue: #11 (comment). I managed to get the timestamps as well, if someone is interested I can share the code with the timestamps.

Edit: does not work for everything on production, I was not able to get basicInfo like: title, duration, description etc. but getting transcription still works. Again works locally and does not on Vercel prod in my case. Found out the hard way after I migrated all my code to use youtubei.js for all of the data I was getting in different ways. I get "Sign in required". Something probably related to: LuanRT/YouTube.js#696

@Lipcsyy
Copy link

Lipcsyy commented Aug 16, 2024

I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then?

youtubei.js is working for me on production, I switched to it based on the code in that issue: #11 (comment). I managed to get the timestamps as well, if someone is interested I can share the code with the timestamps.

It says video unavailable for me. Did this occur to you too?

@aleksa-codes
Copy link

aleksa-codes commented Aug 16, 2024

I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then?

youtubei.js is working for me on production, I switched to it based on the code in that issue: #11 (comment). I managed to get the timestamps as well, if someone is interested I can share the code with the timestamps.

It says video unavailable for me. Did this occur to you too?

I was getting that error only with: enable_safety_mode: true as option in Innertube.create({}). Remove it or set it to false, if that is the case.

@aleksa-codes
Copy link

aleksa-codes commented Aug 16, 2024

I saw the other issue #11 and it seems like it's the issue I'm facing, but recommending it because there seems to be no good solution. If YouTube is blocking sites, then would this library be useless then?

youtubei.js is working for me on production, I switched to it based on the code in that issue: #11 (comment). I managed to get the timestamps as well, if someone is interested I can share the code with the timestamps.

It says video unavailable for me. Did this occur to you too?

@Lipcsyy Sorry to follow up, but I noticed that the code here #11 (comment) uses a URL in youtube.getInfo(). However, the YouTube.js documentation doesn't mention using a URL as target. According to the docs, the function expects the video ID.

Here's the function I'm using to extract the ID from the URL:

const getYouTubeVideoId = (input: string): string => {
  const regExp: RegExp =
    /(?:https?:\/\/)?(?:www\.)?(?:youtube\.com\/(?:.*[?&]v=|(?:v|e(?:mbed)?)\/|shorts\/|live\/)|youtu\.be\/)([a-zA-Z0-9_-]{11})/;
  const match: RegExpMatchArray | null = input.match(regExp);

  return match && match[1] ? match[1] : input;
};

There might be a simpler approach, but I hope this helps!

@adnjoo
Copy link

adnjoo commented Aug 20, 2024

I have the same issue with Next.js deployed to Vercel

rn [Error]: [YoutubeTranscript] 🚨 Transcript is disabled on this video (Osj9tv8aOqM)

@tonymanh-dev
Copy link

Any solution for that?

@tushar453030
Copy link

Hey @adnjoo

I implemented the same logic as yours in my express server
image

but when hitting the APIs it gives error InnertubeError: This video is unavailable.

What could be the reason?

@tushar453030
Copy link

tushar453030 commented Aug 22, 2024

@adnjoo

By any change you ever came accros the below error

When hitting the APIs locally using postman or UI locally it works fine I gets the transcript but when hosted the backend on vercel or aws it gives error

https://www.youtube.com/watch?v=EGkGRs6YhoM
YoutubeTranscriptDisabledError: [YoutubeTranscript] 🚨 Transcript is disabled on this video

image

@mrgoonie
Copy link

Youtube has been updated, this script was cooked

@tushar453030
Copy link

tushar453030 commented Aug 23, 2024

Hii @mrgoonie,
Checkout this repo it has hosted site shared by @adnjoo. Its working the code is same as of mine.
https://github.com/adnjoo/fast-youtube-summary/tree/main

The only diff is his code is in next.js and mine on express.

@M-YasirGhaffar
Copy link

I think the error in production is due to the same IP address of the hosting server. I also saw this error in the logs of the server:

Error: YoutubeTranscriptTooManyRequestError: [YoutubeTranscript] 🚨 YouTube is receiving too many requests from this IP and now requires solving a captcha to continue

@mrgoonie
Copy link

I think the error in production is due to the same IP address of the hosting server. I also saw this error in the logs of the server:

Error: YoutubeTranscriptTooManyRequestError: [YoutubeTranscript] 🚨 YouTube is receiving too many requests from this IP and now requires solving a captcha to continue

That's exactly the reason, as I mentioned earlier, Youtube has just updated and will block your server IP address if it crawls the Youtube URL too many times.

So I came up with a solution that adding a proxy layer in every fetch.

Here is the demo: https://app.digicord.site/youtube/transcript

There is a FREE API within the page in case someone need it. Cheers!

@mrgoonie
Copy link

Hii @mrgoonie, Checkout this repo it has hosted site shared by @adnjoo. Its working the code is same as of mine. https://github.com/adnjoo/fast-youtube-summary/tree/main

The only diff is his code is in next.js and mine on express.

It's working because your IP address has not been blocked, not working for me because mine is cooked 😄

@swarajbachu
Copy link

so its not yet solved yet 🥲, I felt this was easy then saw the error on production

@ngocsangyem
Copy link

Hi @tushar453030, try to pass the video ID instead of the whole URL. It worked on my side by following this code.

const init = async () => {
    const youtube = await Innertube.create({
        lang: "en",
        location: "US",
        retrieve_player: false,
    });

    try {
        const info = await youtube.getInfo('axYAW7PuSIM');
        const transcriptData = await info.getTranscript();
        const mappedData = transcriptData.transcript.content.body.initial_segments.map(
            (segment) => segment.snippet.text
        );

        console.log('transcript', mappedData);

    } catch (error) {
        console.error("Error fetching transcript:", error);
        throw error;
    }
}

@hatemmezlini
Copy link

If anyone is still struggling with this. It's due to Youtube banning ISP Ips (why it would not work ion production) I have made a working solution using oauth2 that doesn't uses proxies and avoids Youtube ban. You can use it for free on Apify: https://apify.com/invideoiq/video-transcript-scraper. You only pay for Apify usage, however Apify gives you free credit of 5$ which will give you around 5000 transcripts

@swarajbachu
Copy link

If anyone is still struggling with this. It's due to Youtube banning ISP Ips (why it would not work ion production) I have made a working solution using oauth2 that doesn't uses proxies and avoids Youtube ban. You can use it for free on Apify: https://apify.com/invideoiq/video-transcript-scraper. You only pay for Apify usage, however Apify gives you free credit of 5$ which will give you around 5000 transcripts

so you are directly using google apis ?

@hatemmezlini
Copy link

@swarajbachu No, There is an oauth plugin that was created by yt-dlp developers. It uses the Youtube on TV client because the token is never refreshed on TVs. All I had to do is a create a dummy account, it asked me for password once and than I saved the token and kept passing it in every request. I believe something similar can be developed here

@blake41
Copy link

blake41 commented Nov 23, 2024

I think the error in production is due to the same IP address of the hosting server. I also saw this error in the logs of the server:

Error: YoutubeTranscriptTooManyRequestError: [YoutubeTranscript] 🚨 YouTube is receiving too many requests from this IP and now requires solving a captcha to continue

That's exactly the reason, as I mentioned earlier, Youtube has just updated and will block your server IP address if it crawls the Youtube URL too many times.

So I came up with a solution that adding a proxy layer in every fetch.

Here is the demo: https://app.digicord.site/youtube/transcript

There is a FREE API within the page in case someone need it. Cheers!

what are you using for the proxy?

@mrgoonie
Copy link

There are plenty of them on the internet, I picked IPRoyal, good enough for me, but if you crawl too many, you will need many proxies to rotate 😅

@blake41
Copy link

blake41 commented Nov 23, 2024

@swarajbachu No, There is an oauth plugin that was created by yt-dlp developers. It uses the Youtube on TV client because the token is never refreshed on TVs. All I had to do is a create a dummy account, it asked me for password once and than I saved the token and kept passing it in every request. I believe something similar can be developed here

I followed the extractor wiki and grabbed a proof of origin token. Is that the token you're referring to? I'm still getting blocked on my prod server even when I pass the PO using this library. Or which oauth plugin are you referring to with the yt-dlp library?

@petesampras12
Copy link

petesampras12 commented Dec 2, 2024

I think the error in production is due to the same IP address of the hosting server. I also saw this error in the logs of the server:

Error: YoutubeTranscriptTooManyRequestError: [YoutubeTranscript] 🚨 YouTube is receiving too many requests from this IP and now requires solving a captcha to continue

That's exactly the reason, as I mentioned earlier, Youtube has just updated and will block your server IP address if it crawls the Youtube URL too many times.

So I came up with a solution that adding a proxy layer in every fetch.

Here is the demo: https://app.digicord.site/youtube/transcript

There is a FREE API within the page in case someone need it. Cheers!

Fantastic! Is it possible to limit the request to only get the "content" part and not the "chunks"? Because it takes a long time to retreive the transcript for a 3 min video, but perhaps it would go quicker if i could only get "content"? Appreciate it!

@petesampras12
Copy link

Sadly both youtube-transcript and digicords variant has now stopped working (https://app.digicord.site/youtube/transcript).

Anyone else know of a working API?

@blake41
Copy link

blake41 commented Dec 16, 2024

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

@savnani5
Copy link

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

I am trying this, but still getting blocked, what library are you using to fetch?

@blake41
Copy link

blake41 commented Dec 19, 2024

your requests must not be going through the proxy

import { setGlobalDispatcher, ProxyAgent } from 'undici';

// Load environment variables
dotenv.config();

// Set up global proxy if configured
if (process.env.PROXY_HOST && process.env.PROXY_PORT) {
const proxyUrl = process.env.PROXY_USERNAME && process.env.PROXY_PASSWORD
? http://${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT}
: http://${process.env.PROXY_HOST}:${process.env.PROXY_PORT};

console.log('Setting up global proxy dispatcher:', proxyUrl.replace(/:[^:@]@/, ':**@'));
const dispatcher = new ProxyAgent({ uri: proxyUrl });
setGlobalDispatcher(dispatcher);
}

@petesampras12
Copy link

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

I am trying this, but still getting blocked, what library are you using to fetch?

Did you get it to work?

@savnani5
Copy link

your requests must not be going through the proxy

import { setGlobalDispatcher, ProxyAgent } from 'undici';

// Load environment variables dotenv.config();

// Set up global proxy if configured if (process.env.PROXY_HOST && process.env.PROXY_PORT) { const proxyUrl = process.env.PROXY_USERNAME && process.env.PROXY_PASSWORD ? http://${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT} : http://${process.env.PROXY_HOST}:${process.env.PROXY_PORT};

console.log('Setting up global proxy dispatcher:', proxyUrl.replace(/:[^:@]@/, ':**@')); const dispatcher = new ProxyAgent({ uri: proxyUrl }); setGlobalDispatcher(dispatcher); }

I was using youtube-transcript and ytdl-core with proxies and it was throwing 500 internal serve error, got it working for now with innertube! Thanks

@savnani5
Copy link

savnani5 commented Dec 19, 2024

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

I am trying this, but still getting blocked, what library are you using to fetch?

Did you get it to work?

Yes, working with youtubei.js (innertube)!

@petesampras12
Copy link

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

I am trying this, but still getting blocked, what library are you using to fetch?

Did you get it to work?

Yes, working with youutbei.js (innertube)!

Awesome! Is it this one? (https://github.com/haxzie/innerTube.js/)

I can't find any method that gets the transcript?

@savnani5
Copy link

savnani5 commented Dec 23, 2024

i run all my requests through a residential proxy. unless you are doing an insane amount of requests, quite cheap

I am trying this, but still getting blocked, what library are you using to fetch?

Did you get it to work?

Yes, working with youutbei.js (innertube)!

Awesome! Is it this one? (https://github.com/haxzie/innerTube.js/)

I can't find any method that gets the transcript?

here this one: #11 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests