Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension for another Turkic related language #193

Closed
cherepanovic opened this issue Feb 8, 2024 · 4 comments
Closed

Extension for another Turkic related language #193

cherepanovic opened this issue Feb 8, 2024 · 4 comments

Comments

@cherepanovic
Copy link

Hello developers,

how difficult is it to extend your library to another Turkic language? Where should I start?

I appreciate any advance!

@ojwb
Copy link
Member

ojwb commented Feb 27, 2024

What's the language?

CONTRIBUTING.rst documents the process, but doesn't currently talk about difficulty.

The hardest part is probably coming up with an algorithm. If there's a suitable existing algorithm in an academic paper that may be a good starting point, as someone has devised the algorithm and evaluated it for you. If there's a widely used stemmer implementation in another programming language licensed such that you can study the source and reimplement it in Snowball you could start there. Or if the language is similar to Turkish you could start from turkish.sbl, though I should warn you that the current Turkish algorithm has unresolved problems (see #176).

Implementing the algorithm in Snowball is not usually too difficult, though if you're implementing a pre-existing algorithm then sometimes little details of an existing implementation can prove awkward to implement exactly in Snowball. We can probably help there.

Documenting the new algorithm and integrating it into Snowball should be fairly easy.

@ojwb
Copy link
Member

ojwb commented Feb 27, 2024

Or if the language is similar to Turkish you could start from turkish.sbl, though I should warn you that the current Turkish algorithm has unresolved problems (see #176).

It occurs to me that if your language is similar to Turkish and you're also familiar with Turkish then helping us resolve these problems and then adapting the revised Turkish Snowball stemmer could work.

The key problem with the Turkish stemmer is it can produce very short stems - e.g. see the example of all the words which stem to a (https://lists.tartarus.org/pipermail/snowball-discuss/2023-August/001755.html). Martin also wondered if it was overly complex, though Turkish has a lot of suffixes compared to many of the languages we have stemmers so that complexity may be justified.

@cherepanovic
Copy link
Author

I have seen the problems with the Turkic... I wouldn't know how to solve it at first glance.

oda - is a word as well as o + da is a word with a suffix

@ojwb
Copy link
Member

ojwb commented Feb 27, 2024

oda - is a word as well as o + da is a word with a suffix

Indeed, but such cases occur in other languages too - e.g. in English routing is a form of both the verbs route and rout (https://en.wiktionary.org/wiki/routing).

Stemmers inevitably do an imperfect job - the key thing is really that they can improve retrieval results despite this. Generally overstemming is more problematic than understemming because conflating unrelated results is generally worse than missing potential result.

For Turkish, I think the biggest problem is the one and two character stems as these result in a lot of conflation of unrelated words. Probably adding an R1/R2 based approach would address this as this approach has proved successful in many other languages (https://snowballstem.org/texts/r1r2.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants