TL;DR: I had to find a way to efficiently search for multiple possible strings in a large text (specifically, keywords inside amongst product titles and descriptions).
So the mathematical challenge is an efficient way to address O(n + m)
where n
is the total length of keywords and m
is the length of the text.
Theoretically the Aho-Corasick string matching algorithm should one of (or the?) most efficient way to do this, so I spun up this very superficial benchmark to see what to expect in PHP - and thereby also implied how well the Wikimedia implementation of Aho-Corasick performs.
If you're curious, then these were the results I got by running the test on my Macbook Pro M2 Pro while having several demanding programs running meanwhile the tests ran (like my IntelliJ IDE. If you know, you know).
Both text libraries were parsed using the same keyword library.
The book is ~1.2 MB large (or in other words: 1,276,290 characters):
Average execution time: 157.38 milliseconds
Average execution time: 11,194.72 milliseconds
Average execution time: 136.86 milliseconds
Average execution time: 135.85 milliseconds
Average execution time: 122.75 milliseconds
This product text is 6,180 characters long:
Average execution time: 1.68 milliseconds
Average execution time: 1.30 milliseconds
Average execution time: 1.60 milliseconds
Average execution time: 1.64 milliseconds
Average execution time: 0.78 milliseconds
- You need to have PHP installed on the machine that is running the tests.
- You need Composer installed on the machine that fetch Wikimedia's implementation of Aho-Corasick.
- composer install
- run
php index.php mobydick 100
- The first argument is text to search through. You can find possible values in the TextLibrary.
- The second argument is the number of iterations you'd like to run to get a more accurate average execution time.
You're more than welcome to extend the repository and add your own text to the TextLibrary.
You'd probably also want to add new keywords to the KeywordLibrary.
$ php index.php aho-corasick-position mobydick 100
#### Strategy: aho-corasick-position
```
Memory usage: 2,048.00 KB
Average execution time: 1.68 milliseconds
```
The actual Aho-Corasick implementation is done by Wikimedia's implementation of Aho-Corasick.
The Moby Dick book is made available by Project Gutenberg.