.WordCount not accurate for Japanese pages #1266

RickCogley · 2015-07-11T03:27:03Z

Hello - when I look at .WordCount results against an English page and a Japanese page, the count is correct for English of course, but incorrect for Japanese, returning a very small number.

I assume it is counting words using spaces as the delimiter, but Japanese and other CJK languages are inherently space-less. This also causes trouble for search engines, as an aside.

I am wondering if .WordCount could detect the language, and count characters for CJK language, instead of returning an incorrect number.

Best regards,
Rick

bep · 2015-07-11T10:11:57Z

It can hardly detect the language ...? but should be able to use the languageCode. What is the correct way to count words in Japanese?

RickCogley · 2015-07-11T14:16:10Z

Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations.

I think it is just better to stick to counting the double-byte characters and giving a count of those.

Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters.

RickCogley · 2015-07-11T14:22:06Z

An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP).

And that's not conducive to using for "locale" because:

Facebook og - uses underbar like en_US, ja_JP
Schema.org - uses hyphen like en-US, ja-JP
(Twitter card uses no locale)

What I ended up doing was to settle on the hyphen version in locale in site and page params, then use the replace function like {{ replace . "-" "_" }} to change to the underscore version, for Facebook og, as needed.

bep · 2015-07-11T16:06:58Z

@RickCogley one part of me just loves having a language guy like you on the team coming up with problems like these, the other part ...

bep · 2015-07-11T18:26:19Z

Reading what you say, I guess what we need here is to skip the discussion about what a word is -- and export a new method on page: RuneCount.

https://golang.org/pkg/unicode/utf8/#RuneCount
http://blog.golang.org/strings

RickCogley · 2015-07-12T00:15:37Z

@bep, hehe, "sorry". :-)
RuneCount, yes!, that would work, because we can check for locale then show either WordCount or RuneCount as appropriate.

bep · 2015-07-12T08:49:49Z

@RickCogley just to check: In my head it wouldn't make sense to include whitespace in that count, right?

Do not create it unless used. See #1266

RickCogley · 2015-07-12T09:55:37Z

@bep, no, I don't think we need whitespace counted.

Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names:

田中太郎
田中　太郎

Those have a single byte space and a double byte space between the last and first names.

But this usage is pretty rare.

bep · 2015-07-12T09:58:53Z

Whitespace also includes newlines and tabs etc, so I think it would give a skewed count for small texts with lots of paragraphs. I will keep it as implemented.

Fixes gohugoio#1266

Do not create it unless used. See gohugoio#1266

4c81c6c2a live reload: add section about `--navigateToChanged` 271014257 Update netify hugo version to 0.83.1 14199cff1 Add pull_request event 0c33b05de Hosting on GitHub: Little wording fixes and update Ubuntu runner in example workflow to 20.04 (#1457) e47b6c33a Hugo Modules plural typo (#1266) 0f2bbacdd Add node_modules to .gitignore 1d645d79f Overhaul scratch.md (#1451) 572766889 Add link to golang regex syntax, change modified date 21b0c7459 Add info about contentType config de7d96fa2 Document Go template's multiline support 0c8f2dcb1 Avoid scratch usage 696fa92e1 Rename scratch var 44193f267 Update usage instructions 4230f8fa5 Rename and refactor shortcode e9953751e Strip leading whitespaces d61a58010 Add `insertpages` shortcode 04d30677d Mention WebP under 'Target Format' (#1431) 946784508 Update lookup-order.md (#1443) a7b587988 Update index.md 27907f7ea netlify: Hugo 0.83.1 044d37e57 Merge branch 'tempv0.83.1' b81aedb03 Fix page `.Kind` fcf7775ad releaser: Add release notes to /docs for release of 0.83.1 9b39c77c8 fix typo in 0.83 release notes 1c38993ce Update index.md 45b8aefa6 Update index.md 43902dfaa Update index.md 3d959c7ae Merge branch 'tempv0.83.0' 6c22dc327 Fix URL 497ea3224 Use Hugo version badge shortcode a182d10dd releaser: Add release notes to /docs for release of 0.83.0 287fd9ac0 docs: Fix shortcode e789c879a docs: Regenerate docs helper 1666c7f31 docs: Regenerate CLI docs 117de1d12 Merge commit 'c239c643fee10bfa217cb108755b798f8f5f3b10' a6bf3f7d9 docs: Regen docs helper git-subtree-dir: docs git-subtree-split: 4c81c6c2ace6c23d0d5d24ee37e6a2f30acba01e

github-actions · 2022-04-14T02:12:39Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

bep added the Bug label Jul 11, 2015

bep closed this as completed in 77c60a3 Jul 12, 2015

bep added a commit that referenced this issue Jul 12, 2015

Optimize RuneCount

3663828

Do not create it unless used. See #1266

tychoish pushed a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Add RuneCount to Page

3707a79

Fixes gohugoio#1266

tychoish pushed a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Optimize RuneCount

3ab9ac1

Do not create it unless used. See gohugoio#1266

jmooring mentioned this issue Jan 2, 2022

Hugo's word count is wrong in the HTML content written in Japanese even if hasCJKLanguage is true. #9335

Closed

github-actions bot added the Outdated label Apr 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.WordCount not accurate for Japanese pages #1266

.WordCount not accurate for Japanese pages #1266

RickCogley commented Jul 11, 2015

bep commented Jul 11, 2015

RickCogley commented Jul 11, 2015

RickCogley commented Jul 11, 2015

bep commented Jul 11, 2015

bep commented Jul 11, 2015

RickCogley commented Jul 12, 2015

bep commented Jul 12, 2015

RickCogley commented Jul 12, 2015

bep commented Jul 12, 2015

github-actions bot commented Apr 14, 2022

.WordCount not accurate for Japanese pages #1266

.WordCount not accurate for Japanese pages #1266

Comments

RickCogley commented Jul 11, 2015

bep commented Jul 11, 2015

RickCogley commented Jul 11, 2015

RickCogley commented Jul 11, 2015

bep commented Jul 11, 2015

bep commented Jul 11, 2015

RickCogley commented Jul 12, 2015

bep commented Jul 12, 2015

RickCogley commented Jul 12, 2015

bep commented Jul 12, 2015

github-actions bot commented Apr 14, 2022