-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.WordCount not accurate for Japanese pages #1266
Comments
It can hardly detect the language ...? but should be able to use the |
Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations. I think it is just better to stick to counting the double-byte characters and giving a count of those. Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters. |
An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP). And that's not conducive to using for "locale" because:
What I ended up doing was to settle on the hyphen version in |
@RickCogley one part of me just loves having a language guy like you on the team coming up with problems like these, the other part ... |
Reading what you say, I guess what we need here is to skip the discussion about what a word is -- and export a new method on page: https://golang.org/pkg/unicode/utf8/#RuneCount |
@bep, hehe, "sorry". :-) |
@RickCogley just to check: In my head it wouldn't make sense to include whitespace in that count, right? |
@bep, no, I don't think we need whitespace counted. Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names: 田中 太郎 Those have a single byte space and a double byte space between the last and first names. But this usage is pretty rare. |
Whitespace also includes newlines and tabs etc, so I think it would give a skewed count for small texts with lots of paragraphs. I will keep it as implemented. |
Do not create it unless used. See gohugoio#1266
4c81c6c2a live reload: add section about `--navigateToChanged` 271014257 Update netify hugo version to 0.83.1 14199cff1 Add pull_request event 0c33b05de Hosting on GitHub: Little wording fixes and update Ubuntu runner in example workflow to 20.04 (#1457) e47b6c33a Hugo Modules plural typo (#1266) 0f2bbacdd Add node_modules to .gitignore 1d645d79f Overhaul scratch.md (#1451) 572766889 Add link to golang regex syntax, change modified date 21b0c7459 Add info about contentType config de7d96fa2 Document Go template's multiline support 0c8f2dcb1 Avoid scratch usage 696fa92e1 Rename scratch var 44193f267 Update usage instructions 4230f8fa5 Rename and refactor shortcode e9953751e Strip leading whitespaces d61a58010 Add `insertpages` shortcode 04d30677d Mention WebP under 'Target Format' (#1431) 946784508 Update lookup-order.md (#1443) a7b587988 Update index.md 27907f7ea netlify: Hugo 0.83.1 044d37e57 Merge branch 'tempv0.83.1' b81aedb03 Fix page `.Kind` fcf7775ad releaser: Add release notes to /docs for release of 0.83.1 9b39c77c8 fix typo in 0.83 release notes 1c38993ce Update index.md 45b8aefa6 Update index.md 43902dfaa Update index.md 3d959c7ae Merge branch 'tempv0.83.0' 6c22dc327 Fix URL 497ea3224 Use Hugo version badge shortcode a182d10dd releaser: Add release notes to /docs for release of 0.83.0 287fd9ac0 docs: Fix shortcode e789c879a docs: Regenerate docs helper 1666c7f31 docs: Regenerate CLI docs 117de1d12 Merge commit 'c239c643fee10bfa217cb108755b798f8f5f3b10' a6bf3f7d9 docs: Regen docs helper git-subtree-dir: docs git-subtree-split: 4c81c6c2ace6c23d0d5d24ee37e6a2f30acba01e
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hello - when I look at .WordCount results against an English page and a Japanese page, the count is correct for English of course, but incorrect for Japanese, returning a very small number.
I assume it is counting words using spaces as the delimiter, but Japanese and other CJK languages are inherently space-less. This also causes trouble for search engines, as an aside.
I am wondering if .WordCount could detect the language, and count characters for CJK language, instead of returning an incorrect number.
Best regards,
Rick
The text was updated successfully, but these errors were encountered: