-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension for another Turkic related language #193
Comments
What's the language? CONTRIBUTING.rst documents the process, but doesn't currently talk about difficulty. The hardest part is probably coming up with an algorithm. If there's a suitable existing algorithm in an academic paper that may be a good starting point, as someone has devised the algorithm and evaluated it for you. If there's a widely used stemmer implementation in another programming language licensed such that you can study the source and reimplement it in Snowball you could start there. Or if the language is similar to Turkish you could start from Implementing the algorithm in Snowball is not usually too difficult, though if you're implementing a pre-existing algorithm then sometimes little details of an existing implementation can prove awkward to implement exactly in Snowball. We can probably help there. Documenting the new algorithm and integrating it into Snowball should be fairly easy. |
It occurs to me that if your language is similar to Turkish and you're also familiar with Turkish then helping us resolve these problems and then adapting the revised Turkish Snowball stemmer could work. The key problem with the Turkish stemmer is it can produce very short stems - e.g. see the example of all the words which stem to |
I have seen the problems with the Turkic... I wouldn't know how to solve it at first glance. oda - is a word as well as o + da is a word with a suffix |
Indeed, but such cases occur in other languages too - e.g. in English Stemmers inevitably do an imperfect job - the key thing is really that they can improve retrieval results despite this. Generally overstemming is more problematic than understemming because conflating unrelated results is generally worse than missing potential result. For Turkish, I think the biggest problem is the one and two character stems as these result in a lot of conflation of unrelated words. Probably adding an R1/R2 based approach would address this as this approach has proved successful in many other languages (https://snowballstem.org/texts/r1r2.html). |
Hello developers,
how difficult is it to extend your library to another Turkic language? Where should I start?
I appreciate any advance!
The text was updated successfully, but these errors were encountered: