Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support languages like Chinese, Japanese, Thai, etc. #1

Open
saginadir opened this issue May 8, 2018 · 19 comments
Open

Support languages like Chinese, Japanese, Thai, etc. #1

saginadir opened this issue May 8, 2018 · 19 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@saginadir
Copy link

It's a cool library, but i'm fearful that it won't slugify everything.

Chinese characters are just deleted.

slugify('你好'); // results in an empty string
@Sigmus
Copy link

Sigmus commented May 8, 2018

I'm curious, what would be the preferable result in this case?

@saginadir
Copy link
Author

It is possible to convert Chinese to pinyin for example:
https://stackoverflow.com/questions/4813086/how-to-convert-chinese-characters-to-pinyin

@caraya
Copy link

caraya commented Jun 11, 2018

Reading that answer it appears that there is no single way to slugify Chinese characters. Even converting them to Pinyin it would be very hard to provide the correct conversion, as the last answer in the question you linked to points out.

If you have the translations handy you can add them to your project and then slugify the translation. That would probably be easier than asking slugify to also convert from one language to the other. I believe that's outside the scope of what the library was designed to do

@XieJiSS
Copy link

XieJiSS commented Feb 13, 2019

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

@ghost
Copy link

ghost commented Mar 27, 2019

Wikipedia URLs contain unicode characters in their paths, so I figured that was OK and I was looking for a lib to do the same for my non-English site.

@sindresorhus
Copy link
Owner

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

PR welcome for an opt-in options for it.

@lizhengnacl
Copy link

mark

@sindresorhus sindresorhus transferred this issue from sindresorhus/slugify Feb 17, 2020
@sindresorhus sindresorhus added enhancement New feature or request help wanted Extra attention is needed labels Feb 17, 2020
@sindresorhus sindresorhus changed the title Ignores Chinese Support Chinese Feb 17, 2020
@brandonpittman
Copy link

@sindresorhus Yeah, "Ignores Chinese" is a bad title.

The Japanese get no love either. 残念。。。

@sindresorhus
Copy link
Owner

@brandonpittman I definitely intend to support languages like Chinese, Japanese, Thai, etc, but it's more work and will take some time. Help is always welcome though.

@sindresorhus sindresorhus changed the title Support Chinese Support languages like Chinese, Japanese, Thai, etc. Feb 18, 2020
@sindresorhus
Copy link
Owner

If anyone wants to work on this, see the feedback given in sindresorhus/slugify#30.

@alfaproject
Copy link

We are currently using https://www.npmjs.com/package/transliteration but I'd love to use this library instead. Even basic/minimal support for Chinese/Japanese characters would be good enough for what we need.

@xiao99xiao
Copy link

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

@saginadir
Copy link
Author

saginadir commented Oct 24, 2021

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

as the original author of this issue, this popped up in my email. I read & write in basic Chinese.

I have 2 thoughts about this:

  1. Who said a slug has to be unique? most of the time, slugs are an additional way to represent the text to an ascii only system and it doesn't necessarily has to be reversed back to the utf8 format.

  2. I can float some ideas of making unique slugs if they are needed. For example 你 could be changed into ni3 which is pinyin + tone. 你好 can be ni3hao3 - still a perfectly valid slug. Another way to make it unique is to use stroke number for example: 你好 would be ni7hao6. Still not unique enough? how about a mix of the two: ni37hao36. Using this format, I still can't guarantee uniqueness - because my input can be the same from 2 different sources but it'll be better than just a pure nihao slug.

@xiao99xiao
Copy link

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

@saginadir
Copy link
Author

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

What I can say is that I needed slugs for URLs. For example someone writes a post titled “我的冬季“ or something like that. So instead of having a URL with an ID like this: mywebsite.com/post/421321812131 you can make it nicer + nicers for SEO like this: mywebsite.com/post/wo-de-dong-ji. uniqueness can be solved by appending the ID: mywebsite.com/post/wo-de-dong-ji-421321812131

I guess everyone will have a different use case.

I've already started looking into developing a unique solution with strokes and tones. But this will be just for fun and will be a heavy library which most likely won't be front-end friendly.

@nhoizey
Copy link

nhoizey commented Jan 6, 2022

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

@RiddMa
Copy link

RiddMa commented Jan 3, 2024

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example:
image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

@saginadir
Copy link
Author

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example: image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

It's an interesting idea indeed :-)

@deltoro05
Copy link

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

This URL has changed to https://symbl.cc/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests