word2vector准确率测试，貌似和C没有什么区别 #699

tiandiweizun · 2017-11-29T06:38:13Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：1.5.2
我使用的版本是：1.5.2

我的问题

hanlp中word2vector的参数配置问题
对于C版本，准确率比你的测试结果低了10%
我对word2vector的各方版本进行了测试，发现准确率差别并不大

对于1，源码中参数只要发现有cbow和hs，就直接设为true，无关0与1的值，所以当测试了hs=0的时候，其实hanlp使用hs，而c版本没有，在《word2vec原理推导与代码分析》中尽管参数一样，但实际训练过程不一样，不知道这是不是造成准确率差别比较大的原因。我分别测试了hanlp在hs=1和没有添加hs这个参数时的准确率。
对于2，对于c版本，采用的c进行训练，gensim计算accuracy，我看过源码和跑过c的accuracy，两个结果一致，没有问题，但是gensim的更快，log更清晰，就跑了gensim的。
这是测试结果：比《Accuracy rate seems to be 10% lower than the original version》中的c低了10%，不知道为什么？

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)

对于3，分别测试了google的c版本，gensim，hanlp，deeplearning4j。除了deeplearning4j，没有测试hs=0的情况，其他都测试了。统计发现使用hs的准确率更差一下，猜测是数据较少，太稀疏导致的。对于hs=0，各家大概43%，hs=1，各家大概35%。

:gensim
model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=1, sample=0.0001, workers=8, iter=15)
2017-11-29 11:49:46,647 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,520 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,599 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:50:04,786 : INFO : capital-common-countries: 76.5% (387/506)
2017-11-29 11:50:33,871 : INFO : capital-world: 37.9% (1349/3564)
2017-11-29 11:50:38,687 : INFO : currency: 7.0% (42/596)
2017-11-29 11:50:57,526 : INFO : city-in-state: 40.4% (942/2330)
2017-11-29 11:51:01,313 : INFO : family: 47.4% (199/420)
2017-11-29 11:51:09,776 : INFO : gram1-adjective-to-adverb: 10.8% (107/992)
2017-11-29 11:51:16,038 : INFO : gram2-opposite: 9.0% (68/756)
2017-11-29 11:51:26,976 : INFO : gram3-comparative: 51.4% (685/1332)
2017-11-29 11:51:34,859 : INFO : gram4-superlative: 19.8% (196/992)
2017-11-29 11:51:43,236 : INFO : gram5-present-participle: 25.5% (269/1056)
2017-11-29 11:51:55,519 : INFO : gram6-nationality-adjective: 73.0% (1111/1521)
2017-11-29 11:52:07,953 : INFO : gram7-past-tense: 35.5% (554/1560)
2017-11-29 11:52:18,648 : INFO : gram8-plural: 49.2% (655/1332)
2017-11-29 11:52:25,628 : INFO : gram9-plural-verbs: 21.8% (190/870)
2017-11-29 11:52:25,628 : INFO : total: 37.9% (6754/17827)

model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=0, sample=0.0001, workers=8, iter=15)
2017-11-29 11:53:14,415 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,427 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,505 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:53:31,894 : INFO : capital-common-countries: 72.9% (369/506)
2017-11-29 11:54:01,937 : INFO : capital-world: 51.1% (1822/3564)
2017-11-29 11:54:06,974 : INFO : currency: 18.0% (107/596)
2017-11-29 11:54:26,329 : INFO : city-in-state: 41.5% (966/2330)
2017-11-29 11:54:29,640 : INFO : family: 59.3% (249/420)
2017-11-29 11:54:37,565 : INFO : gram1-adjective-to-adverb: 14.3% (142/992)
2017-11-29 11:54:43,559 : INFO : gram2-opposite: 13.6% (103/756)
2017-11-29 11:54:54,144 : INFO : gram3-comparative: 64.3% (857/1332)
2017-11-29 11:55:02,068 : INFO : gram4-superlative: 23.1% (229/992)
2017-11-29 11:55:10,453 : INFO : gram5-present-participle: 36.0% (380/1056)
2017-11-29 11:55:22,509 : INFO : gram6-nationality-adjective: 73.7% (1121/1521)
2017-11-29 11:55:34,861 : INFO : gram7-past-tense: 34.3% (535/1560)
2017-11-29 11:55:45,290 : INFO : gram8-plural: 49.8% (664/1332)
2017-11-29 11:55:52,154 : INFO : gram9-plural-verbs: 31.5% (274/870)
2017-11-29 11:55:52,155 : INFO : total: 43.9% (7818/17827)

:hanlp
-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt -size 200 -window 8 -negative 25 -hs 0 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-28 16:53:03,293 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,493 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,553 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:53:19,831 : INFO : capital-common-countries: 69.8% (353/506)
2017-11-28 16:53:49,194 : INFO : capital-world: 30.3% (1079/3564)
2017-11-28 16:53:54,053 : INFO : currency: 4.9% (29/596)
2017-11-28 16:54:12,895 : INFO : city-in-state: 35.7% (831/2330)
2017-11-28 16:54:16,322 : INFO : family: 31.9% (134/420)
2017-11-28 16:54:24,401 : INFO : gram1-adjective-to-adverb: 7.7% (76/992)
2017-11-28 16:54:30,487 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-28 16:54:41,328 : INFO : gram3-comparative: 38.3% (510/1332)
2017-11-28 16:54:49,278 : INFO : gram4-superlative: 13.5% (134/992)
2017-11-28 16:54:58,219 : INFO : gram5-present-participle: 21.6% (228/1056)
2017-11-28 16:55:10,444 : INFO : gram6-nationality-adjective: 72.4% (1101/1521)
2017-11-28 16:55:22,950 : INFO : gram7-past-tense: 28.5% (445/1560)
2017-11-28 16:55:33,730 : INFO : gram8-plural: 45.9% (612/1332)
2017-11-28 16:55:40,694 : INFO : gram9-plural-verbs: 17.1% (149/870)
2017-11-28 16:55:40,696 : INFO : total: 32.3% (5756/17827)

-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt_1 -size 200 -window 8 -negative 25 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-29 11:15:27,628 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,361 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,461 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:15:47,365 : INFO : capital-common-countries: 80.0% (405/506)
2017-11-29 11:16:20,013 : INFO : capital-world: 46.2% (1647/3564)
2017-11-29 11:16:25,338 : INFO : currency: 14.4% (86/596)
2017-11-29 11:16:46,128 : INFO : city-in-state: 46.4% (1081/2330)
2017-11-29 11:16:49,861 : INFO : family: 53.1% (223/420)
2017-11-29 11:16:58,723 : INFO : gram1-adjective-to-adverb: 15.7% (156/992)
2017-11-29 11:17:05,424 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-29 11:17:17,216 : INFO : gram3-comparative: 51.1% (680/1332)
2017-11-29 11:17:26,082 : INFO : gram4-superlative: 20.0% (198/992)
2017-11-29 11:17:35,536 : INFO : gram5-present-participle: 29.9% (316/1056)
2017-11-29 11:17:49,177 : INFO : gram6-nationality-adjective: 82.4% (1254/1521)
2017-11-29 11:18:03,059 : INFO : gram7-past-tense: 32.5% (507/1560)
2017-11-29 11:18:15,029 : INFO : gram8-plural: 53.7% (715/1332)
2017-11-29 11:18:22,894 : INFO : gram9-plural-verbs: 26.7% (232/870)
2017-11-29 11:18:22,894 : INFO : total: 42.5% (7575/17827)

:deeplearning4j
Word2Vec vec = new Word2Vec.Builder().layerSize(200).windowSize(8).negativeSample(25).minWordFrequency(5).useHierarchicSoftmax(true).sampling(0.0001).workers(8).iterations(15).epochs(15).iterate(iter)
.elementsLearningAlgorithm("org.deeplearning4j.models.embeddings.learning.impl.elements.CBOW")
.tokenizerFactory(t)
.build();
2017-11-28 16:46:26,894 : INFO : loading projection weights from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,391 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,453 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:46:43,596 : INFO : capital-common-countries: 67.4% (341/506)
2017-11-28 16:47:12,592 : INFO : capital-world: 33.9% (1208/3564)
2017-11-28 16:47:17,515 : INFO : currency: 6.0% (36/596)
2017-11-28 16:47:36,332 : INFO : city-in-state: 36.6% (852/2330)
2017-11-28 16:47:39,834 : INFO : family: 38.3% (161/420)
2017-11-28 16:47:47,898 : INFO : gram1-adjective-to-adverb: 9.0% (89/992)
2017-11-28 16:47:53,953 : INFO : gram2-opposite: 7.0% (53/756)
2017-11-28 16:48:04,632 : INFO : gram3-comparative: 38.7% (515/1332)
2017-11-28 16:48:12,653 : INFO : gram4-superlative: 11.8% (117/992)
2017-11-28 16:48:21,220 : INFO : gram5-present-participle: 23.0% (243/1056)
2017-11-28 16:48:33,519 : INFO : gram6-nationality-adjective: 76.7% (1166/1521)
2017-11-28 16:48:46,165 : INFO : gram7-past-tense: 27.2% (424/1560)
2017-11-28 16:48:56,894 : INFO : gram8-plural: 48.2% (642/1332)
2017-11-28 16:49:03,973 : INFO : gram9-plural-verbs: 19.2% (167/870)
2017-11-28 16:49:03,974 : INFO : total: 33.7% (6014/17827)

:google_c
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 1 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 16:49:29,132 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,848 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,914 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:49:46,154 : INFO : capital-common-countries: 75.7% (383/506)
2017-11-28 16:50:15,078 : INFO : capital-world: 33.2% (1184/3564)
2017-11-28 16:50:19,993 : INFO : currency: 6.0% (36/596)
2017-11-28 16:50:38,967 : INFO : city-in-state: 36.0% (838/2330)
2017-11-28 16:50:42,348 : INFO : family: 47.4% (199/420)
2017-11-28 16:50:50,315 : INFO : gram1-adjective-to-adverb: 10.6% (105/992)
2017-11-28 16:50:56,355 : INFO : gram2-opposite: 7.8% (59/756)
2017-11-28 16:51:07,065 : INFO : gram3-comparative: 48.3% (644/1332)
2017-11-28 16:51:14,905 : INFO : gram4-superlative: 18.0% (179/992)
2017-11-28 16:51:23,299 : INFO : gram5-present-participle: 29.0% (306/1056)
2017-11-28 16:51:35,345 : INFO : gram6-nationality-adjective: 70.1% (1066/1521)
2017-11-28 16:51:47,733 : INFO : gram7-past-tense: 31.9% (498/1560)
2017-11-28 16:51:58,316 : INFO : gram8-plural: 50.1% (667/1332)
2017-11-28 16:52:05,321 : INFO : gram9-plural-verbs: 20.0% (174/870)
2017-11-28 16:52:05,322 : INFO : total: 35.6% (6338/17827)

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)

hankcs · 2017-12-02T04:08:30Z

感谢反馈，非常有价值的测试。

参数解析不兼容的情况已经修复，请参考上述commit ，有问题欢迎继续反馈。
关于“对于C版本，准确率比你的测试结果低了10%”，这是因为原版compute-accuracy接受一个threshold参数。我报告的结果threshold=30000。请参考：Accuracy rate seems to be 10% lower than the original version kojisekig/word2vec-lucene#21 我试了试不加threshold，准确率的确只有41%：Total accuracy: 41.00 % Semantic accuracy: 38.86 % Syntactic accuracy: 42.52 %
你在无threshold下测试的结果发现HanLP的准确率与原本差不多，与我的上述验证结论相同，这是个好消息。毕竟正常使用的情况下，估计没人会愿意把词表限制在3万。

tiandiweizun · 2017-12-04T06:46:16Z

我又按照gensim默认的30000测试了一下，由于word2vec的c版本默认第一个是“/s”，所以我还测试30001的情况，发现结果没有任何区别。以下是测试结果。

	hs=1	hs=0
hanlp	40.80%	51.90%
google_c	44.50%	52.70%

我发现我测试的52.7%（google_c ,hs=0）和你测试的53.32 %（google_c ,hs=0）差别不大，而我测试的40.80%（hanlp,hs=1）和你测试的41.03 %(kojisekig/word2vec-lucene,hs=0) 差别不大，猜测是由于参数配置不同导致的。果断看了一下他的源码，印证了我的猜想。

结论：java配置模块代码和c版本非完全一致导致。

感想：结论如此简单，然而我却看了好几个源码，做了无数测试，虽然我早就发现了问题，但是出于对我女王的崇拜，没有太多思虑。

hankcs added a commit that referenced this issue Dec 2, 2017

使word2vec命令行参数解析与原版兼容：#699

7d63ab4

hankcs added the question label Dec 2, 2017

tiandiweizun closed this as completed Dec 4, 2017

TylunasLi pushed a commit to TylunasLi/HanLP that referenced this issue Dec 30, 2017

使word2vec命令行参数解析与原版兼容：hankcs#699

5ed83a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vector准确率测试，貌似和C没有什么区别 #699

word2vector准确率测试，貌似和C没有什么区别 #699

tiandiweizun commented Nov 29, 2017 •

edited

Loading

hankcs commented Dec 2, 2017 •

edited

Loading

tiandiweizun commented Dec 4, 2017 •

edited

Loading

word2vector准确率测试，貌似和C没有什么区别 #699

word2vector准确率测试，貌似和C没有什么区别 #699

Comments

tiandiweizun commented Nov 29, 2017 • edited Loading

注意事项

版本号

我的问题

hankcs commented Dec 2, 2017 • edited Loading

tiandiweizun commented Dec 4, 2017 • edited Loading

tiandiweizun commented Nov 29, 2017 •

edited

Loading

hankcs commented Dec 2, 2017 •

edited

Loading

tiandiweizun commented Dec 4, 2017 •

edited

Loading