Implementations of Norvig's spelling corrector in various languages.
- Python (minor modification of Norvig's original program)
- C++14 (common header)
- Haskell
- C
hat-trie
is the fast trie implementation by dcjones.
The hat-trie
library is included as a submodule. Doing
git submodule init
git submodule update
bash -c 'cd hat-trie && autoreconf -i && ./configure && make'
will build it.
The Haskell programs depend on the bytestring
, bytestring-trie
and unordered-containers
packages.
The Java program requires Java 8 and Maven.
Once you have all the dependencies, running make
at the top level will build all the programs and place them in bin/
.
Each program can be run as
norvig_xx [training data set]
where [training data set]
is a plain text file from which the program will learn word frequencies. The programs expect to be given one word per line on standard input and print
word, correction
pairs on standard output.
The folder data/
has a training file train.txt
and a test file test.txt
. The first is exactly Norvig's big.txt
. The second is Norvig's test set but with multiword strings removed.
On Linux,running make benchmark
creates a file benchmarks/all.md
containing performance results.
The different implementations use different algorithms and code organization, so the benchmarks are not in any sense a comparison of the different languages.
This is what I get on my setup.
Version | Time (s) | Memory use (Max resident,M) |
Lines of code |
---|---|---|---|
Python | 12.6 | 75.5 | 31 |
C++ (unordered_map) | 6.0 | 5.6 | 165 |
C++ (hat-trie) | 3.5 | 3.6 | 166 |
Haskell (HashMap) | 11.5 | 81.3 | 60 |
Haskell (Trie) | 22.3 | 157.0 | 61 |
C (dynamic programming) | 0.3 | 162.7 | 225 |
Racket (v.6.1.1) | 16.2 | 277.7 | 73 (with tests) |
Java | 10.8 | 363.2 | 100 |