Skip to content

Speed Comparison

Koichi Akabe edited this page Jun 9, 2022 · 5 revisions

This wiki shows the analysis speed of python-vaporetto and other tokenizers and morphological analyzers.

Experimental Setup

We compared the following softwares:

For python-vaporetto and Mykytea-python, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page. For SudachiPy, we used SudachiDict-core 20220519 based on UniDic and used both "a" and "c" modes.

We tokenized I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and measured elapsed time of counting tokens and concatenating all surfaces 5 times for each software.

The following is the specification of the used machine:

  • CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  • Memory: 64GiB
  • OS: CentOS Linux release 7.5.1804 (Core)

The benchmark code can be found here.

Results

Tool Name Counting [ms] STD Concatenating [ms] STD
Mykytea-python 872. 10. 2,062. 23.
python-vaporetto 240. 1. 436. 2.
SudachiPy (a) 622. 2. 1,041. 12.
SudachiPy (c) 600. 2. 1,012. 3.
Clone this wiki locally