-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds German compound words decomposition with new segmenter #303
feat: Adds German compound words decomposition with new segmenter #303
Conversation
I assume this could be a very expensive algorithm because all word lengths are checked against the dict? Not sure if there is a better solution, but at least a first version for compound words :) |
Also another open question: can we even use the dictionary? The orignal author has it under GNU GPL |
@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @luflow,
Could you add a feature flag on your implementation as I suggested, please? Then add it as a default feature in the Cargo.toml
file.
In terms of implementation, you chose to rely on an HashSet to split your words, but I don't think it's the best approach.
I highly suggest using an FstSegmenter
like in the Thai tokenizer, it's a bit complex to build but way more efficient in time and space, or you could use an AhoCorasick automaton using the LeftmostLongest
match kind.
Sorry for the delays!
Let me know if you have a question
Hi @ManyTheFish! Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣 Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly? |
You can use the CLI fst-bin to build your dictionary from a source file. 😄
Yes you can build it from an iterator over str, so it's convenient |
…and min lemma length definition
@ManyTheFish I extended the The dictionary is now also transformed into an FST file. Let me know what you think :) |
@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @luflow,
sorry for the delay, LGTM!
bors merge
303: feat: Adds German compound words decomposition with new segmenter r=ManyTheFish a=luflow # Pull Request ## What does this PR do? - Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/) - Adds benchmark with german sentences ## PR checklist Please check if your PR fulfills the following requirements: - [X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [X] Have you read the contributing guidelines? - [X] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Florian Ludwig <[email protected]> Co-authored-by: Florian Ludwig <[email protected]>
Build failed: |
Co-authored-by: Many the fish <[email protected]>
@ManyTheFish ok applied suggestion :) |
Hello @luflow, the test and clippy are not happy, could you ensure that:
work on your machine please? I'll merge as soon as the tests pass 😃 |
@ManyTheFish done 👍🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Thank you for the contribution!
bors merge
Build succeeded: |
Pull Request
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements: