Words counting issue for mixed languages #21

liushuping · 2014-04-29T03:07:45Z

It is very common that we could have content of mixed languages such as a paragraph mixed of English and Chinese. A big difference of counting English and CJK words is that CJK does not separate words with spaces (actually "word" and "character" are the same concept in CJK) but they are just adjacent.
For example
The quick brown fox jumps over the lazy dog will be counted as 9 words.
The Chinese translation of that sentence is 敏捷的棕毛狐狸从懒狗身上跃过 it should be counted as 14 words, but the actual result is 1.

This results the issue like TryGhost/Ghost#2656 when writing a blog post of mixed languages.

RadLikeWhoa · 2014-04-29T11:43:32Z

This is a tough one. I've been reading through TryGhost/Ghost#2656 and cgiffard/Downsize#15 and this stuff just completely blows my mind. I don't speak any non-latin language so I will need all the help I can get on this one.

I'll keep up with the two issues mentioned above and I will do my best to figure out a usable solution.

iamzifei · 2014-09-09T02:13:45Z

Hi, not sure if you come up with a solution or not. but I've done a small quick fix for at least Chinese language, you can have a look at here:

https://github.com/iamzifei/gust/blob/master/assets/countable.js

RadLikeWhoa · 2014-09-15T06:13:09Z

Thanks for the input. It looks quite interesting, but could you please explain how your solution works exactly?

iamzifei · 2014-09-18T07:04:38Z

Hi, it's a little bit tricky and not quite completed actually, but hopefully could give some ideas here.

so I defined 3 regex for different unicode range, for asian languages.
r1 (line 176) is for ideographic chars, such as ideographic space, space between "a　b".
r2 (line 177) is for common CJK unicode chars
r3 (line 178) is for Thai chars

I replace r1, r2, r3 with one English word. the rest is counting as it is. so 敏捷的棕毛 will become " {CJK} {CJK} {CJK} {CJK} {CJK} ", which would count as 5 words.

some references:
http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode
http://en.wikipedia.org/wiki/CJK_Unified_Ideographs#CJK_Unified_Ideographs
http://www.unicode.org/charts/PDF/U0E00.pdf

manchumahara · 2016-07-12T07:06:22Z

see how they did this https://github.com/AAlakkad/jQuery-smsHelper/blob/master/jquery-smshelper.js

any jQuery support ?

liushuping mentioned this issue Apr 29, 2014

Word Count Function Doesn't Work in Chinese TryGhost/Ghost#2656

Closed

RadLikeWhoa self-assigned this Oct 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words counting issue for mixed languages #21

Words counting issue for mixed languages #21

liushuping commented Apr 29, 2014

RadLikeWhoa commented Apr 29, 2014

iamzifei commented Sep 9, 2014

RadLikeWhoa commented Sep 15, 2014

iamzifei commented Sep 18, 2014

manchumahara commented Jul 12, 2016

Words counting issue for mixed languages #21

Words counting issue for mixed languages #21

Comments

liushuping commented Apr 29, 2014

RadLikeWhoa commented Apr 29, 2014

iamzifei commented Sep 9, 2014

RadLikeWhoa commented Sep 15, 2014

iamzifei commented Sep 18, 2014

manchumahara commented Jul 12, 2016