-
Notifications
You must be signed in to change notification settings - Fork 1
Speed Comparison
Koichi Akabe edited this page Jun 9, 2022
·
5 revisions
This wiki shows the analysis speed of python-vaporetto and other tokenizers and morphological analyzers.
We compared the following softwares:
- Mykytea-python (v0.1.7)
- python-vaporetto (v0.1.0)
- SudachiPy (v0.6.3)
For python-vaporetto and Mykytea-python, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page. For SudachiPy, we used SudachiDict-core 20220519 based on UniDic and used both "a" and "c" modes.
We tokenized I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and measured elapsed time of counting tokens and concatenating all surfaces 5 times for each software.
The following is the specification of the used machine:
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- Memory: 64GiB
- OS: CentOS Linux release 7.5.1804 (Core)
The benchmark code can be found here.
Tool Name | Counting [ms] | STD | Concatenating [ms] | STD |
---|---|---|---|---|
Mykytea-python | 872. | 10. | 2,062. | 23. |
python-vaporetto | 240. | 1. | 436. | 2. |
SudachiPy (a) | 622. | 2. | 1,041. | 12. |
SudachiPy (c) | 600. | 2. | 1,012. | 3. |