This plugin is based on the GreekStemmer that is included in Apache Lucene.
Lucene's GreekStemmer is created according to Development of a Stemmer for the Greek Language of Georgios Ntaias. This thesis mentions that 166 suffixes are recognized in the Greek language. However, only 158 were captured by this stemmer, because the addition of the remainning suffixes would reduce the precision of the stemmer on the word-sets that were used for its evaluation.
But the exclusion of these suffixes does not perform well on our word-set which consists of more than 120.000 words. So, for our needs we had to modify the implementation of Lucene's GreekStemmer in order to include eight more suffixes which improve the quality of our search results. Four of the these new suffixes are not included to the 166 suffixes of the thesis of Geogios Ntaias. These are:
-ιο, ιοσ, -εασ, -εα
The remaining four suffixes are included in the set of the eight suffixes that were intentionally not captured by the the original GreekStemmer. These suffixes reflect different forms of the words that end with the first three of the above suffixes and these are the following:
-ιασ, -ιεσ, -ιοι, -ιουσ
Examples:
Word | GreekStemmer | SkroutzGreekStemmer |
---|---|---|
κριτηριο (singular) | κριτηρι | κριτηρ |
κριτηρια (plural) | κριτηρ | κριτηρ |
προβολεας (singular) | προβολε | προβολ |
προβολεις (plural) | προβολ | προβολ |
αμινοξυ (singular) | αμινοξ | αμινοξ |
αμινοξεα (plural) | αμινοξε | αμινοξ |
The stemmer can be combined with the
keyword-marker
and
stemmer-override
Elasticsearch filters for stemming exceptions support
(see also the greek_exceptions.txt
sample stemmer-override
configuration file).
As of version 5.4.2.6, there is no builtin support for stemming exceptions.
To list all plugins in current installation:
sudo bin/elasticsearch-plugin list
In order to install the latest version of the plugin, simply run:
sudo bin/elasticsearch-plugin install gr.skroutz:elasticsearch-skroutz-greekstemmer:7.7.0.1
To install version 5.4.2.6 run:
sudo bin/elasticsearch-plugin install gr.skroutz:elasticsearch-skroutz-greekstemmer:5.4.2.6
In order to install version 2.4.4 of the plugin, simply run:
sudo bin/plugin install skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1
In order to install versions prior to 0.0.12, simply run:
sudo bin/plugin -install skroutz/elasticsearch-skroutz-greekstemmer/0.0.1
To remove a plugin (5.x.x/7.x.x):
sudo bin/elasticsearch-plugin remove <plugin_name>
SkroutzGreekStemmer Plugin | ElasticSearch | Branch |
---|---|---|
7.7.0.1 | 7.7.0 | 7.7.0 |
5.4.2.6 | 5.4.2 | 5.4.2 |
5.4.0.1 | 5.4.0 | 5.4.0 |
2.4.4.1 | 2.4.4 | 2.4.4 |
0.0.12 (<=) | 1.5.0 | 1.5.0 |
# Create index
$ curl -XPUT 'http://localhost:9200/test_stemmer' -H 'Content-Type: application/json' -d '{
"settings":{
"analysis":{
"analyzer":{
"stem_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter": ["lower_greek", "stem_greek"]
}
},
"filter": {
"lower_greek": {
"type":"lowercase",
"language":"greek"
},
"stem_greek": {
"type":"skroutz_stem_greek"
}
}
}
}
}'
{"acknowledged":true}
# Test analyzer
$ curl -XGET 'http://localhost:9200/test_stemmer/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "stem_analyzer", "text": "κουρευτικές μηχανές"}'
{
"tokens" : [ {
"token" : "κουρευτ",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "μηχαν",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
$ curl -XGET 'http://localhost:9200/test_stemmer/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "stem_analyzer", "text": "κουρευτική μηχανή"}'
{
"tokens" : [ {
"token" : "κουρευτ",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "μηχαν",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
# Delete test index
$ curl -XDELETE 'http://localhost:9200/test_stemmer'
{"ok":true,"acknowledged":true}
index:
analysis:
filter:
stem_greek:
type: skroutz_stem_greek
Input is expected to to be casefolded for Greek (including folding of final sigma to sigma), and with diacritics removed. This can be achieved with GreekLowerCaseFilter.
For stemming issues: here