Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds German compound words decomposition with new segmenter #303

Merged
merged 16 commits into from
Sep 10, 2024

Conversation

luflow
Copy link
Contributor

@luflow luflow commented Aug 9, 2024

Pull Request

What does this PR do?

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

@luflow
Copy link
Contributor Author

luflow commented Aug 9, 2024

I assume this could be a very expensive algorithm because all word lengths are checked against the dict?

Not sure if there is a better solution, but at least a first version for compound words :)

@luflow
Copy link
Contributor Author

luflow commented Aug 10, 2024

Also another open question: can we even use the dictionary?

The orignal author has it under GNU GPL
https://github.com/uschindler/german-decompounder/blob/master/NOTICE.txt

@luflow
Copy link
Contributor Author

luflow commented Aug 12, 2024

@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @luflow,

Could you add a feature flag on your implementation as I suggested, please? Then add it as a default feature in the Cargo.toml file.

In terms of implementation, you chose to rely on an HashSet to split your words, but I don't think it's the best approach.
I highly suggest using an FstSegmenter like in the Thai tokenizer, it's a bit complex to build but way more efficient in time and space, or you could use an AhoCorasick automaton using the LeftmostLongest match kind.

Sorry for the delays!
Let me know if you have a question

charabia/src/segmenter/mod.rs Show resolved Hide resolved
charabia/src/segmenter/mod.rs Show resolved Hide resolved
charabia/src/segmenter/mod.rs Outdated Show resolved Hide resolved
@luflow
Copy link
Contributor Author

luflow commented Aug 27, 2024

Hi @ManyTheFish!

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

@ManyTheFish
Copy link
Member

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

You can use the CLI fst-bin to build your dictionary from a source file. 😄

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

Yes you can build it from an iterator over str, so it's convenient

@luflow
Copy link
Contributor Author

luflow commented Aug 28, 2024

@ManyTheFish I extended the FstSegmenter with two options to also be able to handle a min lemma length and being able to hinder the segmenter from spitting out single letters. That keeps my dictionary even smaller and may be also useful for other languages later?

The dictionary is now also transformed into an FST file.

Let me know what you think :)

@luflow
Copy link
Contributor Author

luflow commented Sep 7, 2024

@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :)

ManyTheFish
ManyTheFish previously approved these changes Sep 9, 2024
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @luflow,
sorry for the delay, LGTM!

bors merge

meili-bors bot added a commit that referenced this pull request Sep 9, 2024
303: feat: Adds German compound words decomposition with new segmenter r=ManyTheFish a=luflow

# Pull Request

## What does this PR do?
- Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/)
- Adds benchmark with german sentences

## PR checklist
Please check if your PR fulfills the following requirements:
- [X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [X] Have you read the contributing guidelines?
- [X] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: Florian Ludwig <[email protected]>
Co-authored-by: Florian Ludwig <[email protected]>
Copy link
Contributor

meili-bors bot commented Sep 9, 2024

Build failed:

@luflow
Copy link
Contributor Author

luflow commented Sep 9, 2024

@ManyTheFish ok applied suggestion :)

@ManyTheFish
Copy link
Member

Hello @luflow,

the test and clippy are not happy,

could you ensure that:

  • cargo clippy
  • cargo test

work on your machine please?

I'll merge as soon as the tests pass 😃

@luflow
Copy link
Contributor Author

luflow commented Sep 9, 2024

@ManyTheFish done 👍🏻

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Thank you for the contribution!

bors merge

Copy link
Contributor

meili-bors bot commented Sep 10, 2024

Build succeeded:

@meili-bors meili-bors bot merged commit 38b8529 into meilisearch:main Sep 10, 2024
4 checks passed
@luflow luflow deleted the feature/german-compound-words branch September 10, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants