Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSDelivr CDN not accessible in China #899

Open
Balearica opened this issue Mar 3, 2024 · 8 comments
Open

JSDelivr CDN not accessible in China #899

Balearica opened this issue Mar 3, 2024 · 8 comments

Comments

@Balearica
Copy link
Member

Balearica commented Mar 3, 2024

Chinese users are reporting that JSDelivr--the CDN used by default for corePath/langPath/workerPath--does not work in China.

I checked the JSDelivr issues, and the maintainers appear to have given up on supporting China.

We had an ICP license until it was revoked for no stated reason. Getting a new one is basically impossible for us.
jsdelivr/jsdelivr#18407 (comment)

We won't transfer our domain to a Chinese domain registrar. This breaks most options to get an ICP license.
We don't want to create a second domain for Chinese traffic, like cdn.jsdelivr-cn.com. It will completely miss the point of a single unified and global service. Anyone can just mirror jsDelivr to do exactly that. This breaks most other options.
We don't have the money or resources to hire Chinese law firms to establish Chinese corporations to run our free service
We plan to update our website in the new redesign to note the revoked ICP license
We are willing to block any content we need to comply with the local law but we dont know how to get the infringing URLs.
jsdelivr/jsdelivr#18407 (comment)

Therefore, the only option would be to add a fallback CDN. I do not want to switch CDNs entirely unless there is an option that will be unequivocally better than JSDelivr for all users, globally. This is because other services that people report currently work in China have previously caused us to receive complaints due to outages (specifically, GitHub Pages and unpkg).

Additionally, it's worth noting that unless any CDN specifically claims to have a relationship with the Chinese government, the fact that it currently works in China does not guarantee it will work in the future. JSDelivr claimed to support China when we started using it.

@ivysrono
Copy link

ivysrono commented Mar 3, 2024

Convert *.traineddata.gz files to pure JavaScript could solve this issue.

*.js files could save and sync to anywhere.

@Balearica
Copy link
Member Author

@ivysrono There are 3 types of files loaded from the CDN by default for browser (workerPath, langPath, corePath) and 1 type of files loaded from the CDN for Node.js (langPath). For all of these files, the JSDelivr CDN is simply the default value--you do not need to use it. All of these files can be hosted on your site (for browser) or local file system (for Node.js), and workerPath, langPath, corePath can be changed to point to those. This is explained in the following document.

https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md

@ivysrono
Copy link

ivysrono commented Mar 3, 2024

Users may use them in browsers but without own site, for example, userscript: https://greasyfork.org/scripts/482236/code

@Balearica
Copy link
Member Author

If you do not want to host these files yourself, you can set workerPath/langPath/corePath to an alternative CDN. For example, you could try unpkg.

I agree that having a default CDN that does not support China is not ideal, and would be open to adding some fallback for the default CDN in the future. However, if individual developers want to be sure that mainland China is supported in their applications, I believe that the existing options in Tesseract.js do allow them to do that.

@ivysrono
Copy link

ivysrono commented Mar 3, 2024

unpkg support langPath?

@Balearica
Copy link
Member Author

unpkg support langPath?

Here is a working example that uses unpkg for all 3 resources: corePath/langPath/workerPath.

  const lang = 'eng';
  const langPath = `https://unpkg.com/@tesseract.js-data/${lang}/4.0.0_best_int`;

  // A worker is created once and used every time a user uploads a new file.  
  const worker = await Tesseract.createWorker(lang, 1, {
      corePath: 'https://unpkg.com/tesseract.js-core@v5',
      workerPath: 'https://unpkg.com/tesseract.js@v5/dist/worker.min.js',
      langPath: langPath,
      logger: function(m){console.log(m);}
    });

This loads the LSTM-only data, so will only work with oem set to 1 (the default). To use the Legacy model, you would replace 4.0.0_best_int with 4.0.0. That data is significantly larger, so do not do that unless you are actually using the Legacy model.

@ivysrono
Copy link

ivysrono commented Mar 3, 2024

Thank you very much, I will try.

@MarketingPip
Copy link

@Balearica - just wanted to say thank you for saving me some time with that comment. Cheers 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants