Documentation | Build Status |
---|---|
Julia port of SymSpell, extremely fast spelling correction and fuzzy search algorithm.
using SymSpellChecker
d = SymSpell()
push!(d, "hello")
push!(d, "world")
d["wrold"] = ["world"]
Dictionaries can be created as follows
using SymSpellChecker
# Loading from file
d = SymSpell("assets/frequency_dictionary_en_30_000.txt")
# Manual update
d = SymSpell()
push!(d, "hello", 100)
push!(d, "world", 50)
Third term in push!
function is the word frequency, which is used later in lookup
to sort results from highest frequency to the lowest.
SymSpell
constructor has following arguments
- max_dictionary_edit_distance: maximum allowed search distance. High value of this argument requires lots of memory. Default value is 2.
- prefix_length: prefix length used to generate candidates, higher values corresponds to higher memory requirements, but smaller search times. Default value is 5
- count_threshold: words with frequencies below this threshold wouldn't show in search results.
Words search can be made as follows
lookup(d, "wrold") # [SuggestItem("world", 1, 50)]
Here 1
is a Damerau-Levenshtein distance between world
and wrold
, 50
is a word frequency in current dictionary.
One can extract only words from lookup
result
term.(lookup(d, "wrold")) = ["world"]
There is more convenient form of lookup
exists
d["wrold"] = ["world"]
Search arguments can be passed either in lookup
function or set globally with the help of set_options!(d::SymSpell; kwargs...)
command.
set_options!(d, include_unknown = true, verbosity = "closest")
d["wrold"] = ["wrold", "world"]
# this is equivalent to
term.(lookup(d, include_unknown = true, verbosity = "closest"))
Following arguments are supported
- include_unknown: whether include or not original word in results, if it falls under search criteria
- ignore_token: ignore words in lookup that contain token string or regexp.
- transfer_casing: when this option set to
true
, results will try to mimic casing of the original word, for exampled["Wrold"] = ["World"]
- max_edit_distance: maximum allowed distance for search. By default equals to the
max_dictionary_edit_distance
- verbosity: select type of search result. Three levels of verbosity exists
- "top": only single suggestion is returned, with lowest distance and highest frequency
- "closest": all words with lowest distance are returned
- "all": all words within given
max_edit_distance
are returned
The SymSpellChecker.jl package is licensed under the MIT License. This package is based on SymSpell and it's python adaptation. Some parts of the code is based on StringDistances.jl.