Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word2vector准确率测试,貌似和C没有什么区别 #699

Closed
1 task done
tiandiweizun opened this issue Nov 29, 2017 · 2 comments
Closed
1 task done

word2vector准确率测试,貌似和C没有什么区别 #699

tiandiweizun opened this issue Nov 29, 2017 · 2 comments
Labels

Comments

@tiandiweizun
Copy link

tiandiweizun commented Nov 29, 2017

注意事项

请确认下列注意事项:

  • 我已仔细阅读下列文档,都没有找到答案:
  • 我已经通过Googleissue区检索功能搜索了我的问题,也没有找到答案。
  • 我明白开源社区是出于兴趣爱好聚集起来的自由社区,不承担任何责任或义务。我会礼貌发言,向每一个帮助我的人表示感谢。
  • 我在此括号内输入x打钩,代表上述事项确认完毕。

版本号

当前最新版本号是:1.5.2
我使用的版本是:1.5.2

我的问题

  1. hanlp中word2vector的参数配置问题
  2. 对于C版本,准确率比你的测试结果低了10%
  3. 我对word2vector的各方版本进行了测试,发现准确率差别并不大
  • 对于1,源码中参数只要发现有cbow和hs,就直接设为true,无关0与1的值,所以当测试了hs=0的时候,其实hanlp使用hs,而c版本没有,在《word2vec原理推导与代码分析》中尽管参数一样,但实际训练过程不一样,不知道这是不是造成准确率差别比较大的原因。我分别测试了hanlp在hs=1和没有添加hs这个参数时的准确率。

  • 对于2,对于c版本,采用的c进行训练,gensim计算accuracy,我看过源码和跑过c的accuracy,两个结果一致,没有问题,但是gensim的更快,log更清晰,就跑了gensim的。
    这是测试结果:比《Accuracy rate seems to be 10% lower than the original version》中的c低了10%,不知道为什么?

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)

  • 对于3,分别测试了google的c版本,gensim,hanlp,deeplearning4j。 除了deeplearning4j,没有测试hs=0的情况,其他都测试了。统计发现使用hs的准确率更差一下,猜测是数据较少,太稀疏导致的。对于hs=0,各家大概43%,hs=1,各家大概35%。

:gensim
model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=1, sample=0.0001, workers=8, iter=15)
2017-11-29 11:49:46,647 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,520 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,599 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:50:04,786 : INFO : capital-common-countries: 76.5% (387/506)
2017-11-29 11:50:33,871 : INFO : capital-world: 37.9% (1349/3564)
2017-11-29 11:50:38,687 : INFO : currency: 7.0% (42/596)
2017-11-29 11:50:57,526 : INFO : city-in-state: 40.4% (942/2330)
2017-11-29 11:51:01,313 : INFO : family: 47.4% (199/420)
2017-11-29 11:51:09,776 : INFO : gram1-adjective-to-adverb: 10.8% (107/992)
2017-11-29 11:51:16,038 : INFO : gram2-opposite: 9.0% (68/756)
2017-11-29 11:51:26,976 : INFO : gram3-comparative: 51.4% (685/1332)
2017-11-29 11:51:34,859 : INFO : gram4-superlative: 19.8% (196/992)
2017-11-29 11:51:43,236 : INFO : gram5-present-participle: 25.5% (269/1056)
2017-11-29 11:51:55,519 : INFO : gram6-nationality-adjective: 73.0% (1111/1521)
2017-11-29 11:52:07,953 : INFO : gram7-past-tense: 35.5% (554/1560)
2017-11-29 11:52:18,648 : INFO : gram8-plural: 49.2% (655/1332)
2017-11-29 11:52:25,628 : INFO : gram9-plural-verbs: 21.8% (190/870)
2017-11-29 11:52:25,628 : INFO : total: 37.9% (6754/17827)

model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=0, sample=0.0001, workers=8, iter=15)
2017-11-29 11:53:14,415 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,427 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,505 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:53:31,894 : INFO : capital-common-countries: 72.9% (369/506)
2017-11-29 11:54:01,937 : INFO : capital-world: 51.1% (1822/3564)
2017-11-29 11:54:06,974 : INFO : currency: 18.0% (107/596)
2017-11-29 11:54:26,329 : INFO : city-in-state: 41.5% (966/2330)
2017-11-29 11:54:29,640 : INFO : family: 59.3% (249/420)
2017-11-29 11:54:37,565 : INFO : gram1-adjective-to-adverb: 14.3% (142/992)
2017-11-29 11:54:43,559 : INFO : gram2-opposite: 13.6% (103/756)
2017-11-29 11:54:54,144 : INFO : gram3-comparative: 64.3% (857/1332)
2017-11-29 11:55:02,068 : INFO : gram4-superlative: 23.1% (229/992)
2017-11-29 11:55:10,453 : INFO : gram5-present-participle: 36.0% (380/1056)
2017-11-29 11:55:22,509 : INFO : gram6-nationality-adjective: 73.7% (1121/1521)
2017-11-29 11:55:34,861 : INFO : gram7-past-tense: 34.3% (535/1560)
2017-11-29 11:55:45,290 : INFO : gram8-plural: 49.8% (664/1332)
2017-11-29 11:55:52,154 : INFO : gram9-plural-verbs: 31.5% (274/870)
2017-11-29 11:55:52,155 : INFO : total: 43.9% (7818/17827)

:hanlp
-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt -size 200 -window 8 -negative 25 -hs 0 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-28 16:53:03,293 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,493 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,553 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:53:19,831 : INFO : capital-common-countries: 69.8% (353/506)
2017-11-28 16:53:49,194 : INFO : capital-world: 30.3% (1079/3564)
2017-11-28 16:53:54,053 : INFO : currency: 4.9% (29/596)
2017-11-28 16:54:12,895 : INFO : city-in-state: 35.7% (831/2330)
2017-11-28 16:54:16,322 : INFO : family: 31.9% (134/420)
2017-11-28 16:54:24,401 : INFO : gram1-adjective-to-adverb: 7.7% (76/992)
2017-11-28 16:54:30,487 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-28 16:54:41,328 : INFO : gram3-comparative: 38.3% (510/1332)
2017-11-28 16:54:49,278 : INFO : gram4-superlative: 13.5% (134/992)
2017-11-28 16:54:58,219 : INFO : gram5-present-participle: 21.6% (228/1056)
2017-11-28 16:55:10,444 : INFO : gram6-nationality-adjective: 72.4% (1101/1521)
2017-11-28 16:55:22,950 : INFO : gram7-past-tense: 28.5% (445/1560)
2017-11-28 16:55:33,730 : INFO : gram8-plural: 45.9% (612/1332)
2017-11-28 16:55:40,694 : INFO : gram9-plural-verbs: 17.1% (149/870)
2017-11-28 16:55:40,696 : INFO : total: 32.3% (5756/17827)

-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt_1 -size 200 -window 8 -negative 25 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-29 11:15:27,628 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,361 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,461 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:15:47,365 : INFO : capital-common-countries: 80.0% (405/506)
2017-11-29 11:16:20,013 : INFO : capital-world: 46.2% (1647/3564)
2017-11-29 11:16:25,338 : INFO : currency: 14.4% (86/596)
2017-11-29 11:16:46,128 : INFO : city-in-state: 46.4% (1081/2330)
2017-11-29 11:16:49,861 : INFO : family: 53.1% (223/420)
2017-11-29 11:16:58,723 : INFO : gram1-adjective-to-adverb: 15.7% (156/992)
2017-11-29 11:17:05,424 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-29 11:17:17,216 : INFO : gram3-comparative: 51.1% (680/1332)
2017-11-29 11:17:26,082 : INFO : gram4-superlative: 20.0% (198/992)
2017-11-29 11:17:35,536 : INFO : gram5-present-participle: 29.9% (316/1056)
2017-11-29 11:17:49,177 : INFO : gram6-nationality-adjective: 82.4% (1254/1521)
2017-11-29 11:18:03,059 : INFO : gram7-past-tense: 32.5% (507/1560)
2017-11-29 11:18:15,029 : INFO : gram8-plural: 53.7% (715/1332)
2017-11-29 11:18:22,894 : INFO : gram9-plural-verbs: 26.7% (232/870)
2017-11-29 11:18:22,894 : INFO : total: 42.5% (7575/17827)

:deeplearning4j
Word2Vec vec = new Word2Vec.Builder().layerSize(200).windowSize(8).negativeSample(25).minWordFrequency(5).useHierarchicSoftmax(true).sampling(0.0001).workers(8).iterations(15).epochs(15).iterate(iter)
.elementsLearningAlgorithm("org.deeplearning4j.models.embeddings.learning.impl.elements.CBOW")
.tokenizerFactory(t)
.build();
2017-11-28 16:46:26,894 : INFO : loading projection weights from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,391 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,453 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:46:43,596 : INFO : capital-common-countries: 67.4% (341/506)
2017-11-28 16:47:12,592 : INFO : capital-world: 33.9% (1208/3564)
2017-11-28 16:47:17,515 : INFO : currency: 6.0% (36/596)
2017-11-28 16:47:36,332 : INFO : city-in-state: 36.6% (852/2330)
2017-11-28 16:47:39,834 : INFO : family: 38.3% (161/420)
2017-11-28 16:47:47,898 : INFO : gram1-adjective-to-adverb: 9.0% (89/992)
2017-11-28 16:47:53,953 : INFO : gram2-opposite: 7.0% (53/756)
2017-11-28 16:48:04,632 : INFO : gram3-comparative: 38.7% (515/1332)
2017-11-28 16:48:12,653 : INFO : gram4-superlative: 11.8% (117/992)
2017-11-28 16:48:21,220 : INFO : gram5-present-participle: 23.0% (243/1056)
2017-11-28 16:48:33,519 : INFO : gram6-nationality-adjective: 76.7% (1166/1521)
2017-11-28 16:48:46,165 : INFO : gram7-past-tense: 27.2% (424/1560)
2017-11-28 16:48:56,894 : INFO : gram8-plural: 48.2% (642/1332)
2017-11-28 16:49:03,973 : INFO : gram9-plural-verbs: 19.2% (167/870)
2017-11-28 16:49:03,974 : INFO : total: 33.7% (6014/17827)

:google_c
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 1 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 16:49:29,132 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,848 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,914 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:49:46,154 : INFO : capital-common-countries: 75.7% (383/506)
2017-11-28 16:50:15,078 : INFO : capital-world: 33.2% (1184/3564)
2017-11-28 16:50:19,993 : INFO : currency: 6.0% (36/596)
2017-11-28 16:50:38,967 : INFO : city-in-state: 36.0% (838/2330)
2017-11-28 16:50:42,348 : INFO : family: 47.4% (199/420)
2017-11-28 16:50:50,315 : INFO : gram1-adjective-to-adverb: 10.6% (105/992)
2017-11-28 16:50:56,355 : INFO : gram2-opposite: 7.8% (59/756)
2017-11-28 16:51:07,065 : INFO : gram3-comparative: 48.3% (644/1332)
2017-11-28 16:51:14,905 : INFO : gram4-superlative: 18.0% (179/992)
2017-11-28 16:51:23,299 : INFO : gram5-present-participle: 29.0% (306/1056)
2017-11-28 16:51:35,345 : INFO : gram6-nationality-adjective: 70.1% (1066/1521)
2017-11-28 16:51:47,733 : INFO : gram7-past-tense: 31.9% (498/1560)
2017-11-28 16:51:58,316 : INFO : gram8-plural: 50.1% (667/1332)
2017-11-28 16:52:05,321 : INFO : gram9-plural-verbs: 20.0% (174/870)
2017-11-28 16:52:05,322 : INFO : total: 35.6% (6338/17827)

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)

@hankcs
Copy link
Owner

hankcs commented Dec 2, 2017

感谢反馈,非常有价值的测试。

  1. 参数解析不兼容的情况已经修复,请参考上述commit ,有问题欢迎继续反馈。
  2. 关于“对于C版本,准确率比你的测试结果低了10%”,这是因为原版compute-accuracy接受一个threshold参数。我报告的结果threshold=30000。请参考:Accuracy rate seems to be 10% lower than the original version kojisekig/word2vec-lucene#21 我试了试不加threshold,准确率的确只有41%:Total accuracy: 41.00 % Semantic accuracy: 38.86 % Syntactic accuracy: 42.52 %
  3. 你在无threshold下测试的结果发现HanLP的准确率与原本差不多,与我的上述验证结论相同,这是个好消息。毕竟正常使用的情况下,估计没人会愿意把词表限制在3万。

@hankcs hankcs added the question label Dec 2, 2017
@tiandiweizun
Copy link
Author

tiandiweizun commented Dec 4, 2017

我又按照gensim默认的30000测试了一下,由于word2vec的c版本默认第一个是“/s”,所以我还测试30001的情况,发现结果没有任何区别。以下是测试结果。

  hs=1 hs=0
hanlp 40.80% 51.90%
google_c 44.50% 52.70%

我发现我测试的52.7%(google_c ,hs=0)和你测试的53.32 %(google_c ,hs=0)差别不大,而我测试的40.80%(hanlp,hs=1)和你测试的41.03 %(kojisekig/word2vec-lucene,hs=0) 差别不大,猜测是由于参数配置不同导致的。果断看了一下他的源码,印证了我的猜想。

结论:java配置模块代码和c版本非完全一致导致。

感想:结论如此简单,然而我却看了好几个源码,做了无数测试,虽然我早就发现了问题,但是出于对我女王的崇拜,没有太多思虑。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants