-
Notifications
You must be signed in to change notification settings - Fork 1
Speed Comparison
Koichi Akabe edited this page Jun 11, 2022
·
5 revisions
This wiki shows the analysis speed of python-vaporetto and other tokenizers and morphological analyzers.
We compared the following softwares:
- Mykytea-python (v0.1.7)
- python-vaporetto (v0.1.1)
- mecab-python3 (v1.0.5)
- SudachiPy (v0.6.3)
For python-vaporetto and Mykytea-python, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page. For mecab-python3, we used unidic 1.1.0. For SudachiPy, we used SudachiDict-core 20220519 based on UniDic and used both "a" and "c" modes.
We tokenized I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and measured elapsed time of counting tokens, concatenating all surfaces, and directly generating tokenized strings.
The following is the specification of the used machine:
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- Memory: 64GiB
- OS: CentOS Linux release 7.5.1804 (Core)
The benchmark code can be found here.
Tool Name | Counting [ms] | STD | Concatenating [ms] | STD | To String [ms] | STD |
---|---|---|---|---|---|---|
Mykytea-python | 883. | 3. | 2,227. | 105. | ---- | |
python-vaporetto | 177. | 0. | 424. | 3. | 166. | 0. |
mecab-python3 | ---- | ---- | 207. | 7. | ||
SudachiPy (a) | 635. | 2. | 1,097. | 44. | 788. | 1. |
SudachiPy (c) | 614. | 1. | 1,044. | 3. | 773. | 3. |