You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We seem to have a lack of consistency in how we expect apostrophes to be handled by code using Snowball stemmers, which means currently tokenisation before stemming needs to encode knowledge of the stemmer to be used. This is explicitly noted in the docs, but that doesn't make it any less unhelpful:
What is a word? For indexing purposes, a word in a European language is a sequence of letters bounded by non-letters. But in English, an internal apostrophe does not split a word, although it is not classed as a letter. The treatment of these word boundary characters affects the stemmer. For example, the Kraaij Pohlmann stemmer for Dutch (Kraaij, 1994, 1995) removes hyphen and treats apostrophe as part of the alphabet (so 's, 'tje and 'je are three of their endings). The Dutch stemmer presented here assumes hyphen and apostrophe have already been removed from the word to be stemmed.
Rather contradicting the text quoted above, the English stemmer expects apostrophes to be treated as a letter: "the English stemmer treats apostrophe as a letter" (https://snowballstem.org/texts/apostrophe.html).
Catalan includes suffixes containing apostrophe.
Irish includes prefixes containing apostrophe.
As above, the Kraaij Pohlmann stemmer also expects apostrophes to be treated as a word character.
But the "Dutch" stemmer doesn't.
The French stemmer doesn't either (and so doesn't expect l' and d' prefixes to be present on input).
If it's feasible then I think it'd be more helpful for all the stemmers to handle apostrophe being treated as a word character. If there's a reason why we can't, then we should provide some sort of metadata (e.g. an "apostrophe_is_word_character" flag that can be queried on each stemmer) so that code using the stemmers can automatically configure their tokenisation stage.
The text was updated successfully, but these errors were encountered:
We seem to have a lack of consistency in how we expect apostrophes to be handled by code using Snowball stemmers, which means currently tokenisation before stemming needs to encode knowledge of the stemmer to be used. This is explicitly noted in the docs, but that doesn't make it any less unhelpful:
Rather contradicting the text quoted above, the English stemmer expects apostrophes to be treated as a letter: "the English stemmer treats apostrophe as a letter" (https://snowballstem.org/texts/apostrophe.html).
Catalan includes suffixes containing apostrophe.
Irish includes prefixes containing apostrophe.
As above, the Kraaij Pohlmann stemmer also expects apostrophes to be treated as a word character.
But the "Dutch" stemmer doesn't.
The French stemmer doesn't either (and so doesn't expect
l'
andd'
prefixes to be present on input).If it's feasible then I think it'd be more helpful for all the stemmers to handle apostrophe being treated as a word character. If there's a reason why we can't, then we should provide some sort of metadata (e.g. an "apostrophe_is_word_character" flag that can be queried on each stemmer) so that code using the stemmers can automatically configure their tokenisation stage.
The text was updated successfully, but these errors were encountered: