-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turkish Stemmer has problems #176
Comments
Note that while the stem form is often a word itself, this is not always the case as this is not a requirement for text search systems, which are the intended field of use of Snowball. So "odu" being meaningless is not a problem in itself. If other forms of the word "odun" don't stem to "odu" as well, that's a problem. If unrelated words also stem to "odu" that's a (probably worse) problem. |
I looked into the
I didn't see any for this case, but the stemmer currently produces some very short stems (a single character in some cases) which results in conflating unrelated words - this is effectively a form of overstemming and is a worse problem as it leads to incorrectly matching irrelevant documents rather than possibly missing some relevant documents. I've written both these issues up in more detail on the mailing list in the hopes someone with more knowledge of Turkish than me is up to the job of helping sort it out (many more people read the list than are likely to see a discussion here): https://lists.tartarus.org/pipermail/snowball-discuss/2023-August/001755.html #171 reported |
odun --> odu (meaningless)
oda ---> o (oda means room or you too, stemmer chooses you)
adam ---> ada (adam means man or my island, stemmer chooses my island)
adamlar ---> adam
odam -----> oda
One should perhaps somehow distinguish them
The text was updated successfully, but these errors were encountered: