feat: Adds German compound words decomposition with new segmenter #303

luflow · 2024-08-09T22:49:19Z

Pull Request

What does this PR do?

Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/)
Adds benchmark with german sentences

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

luflow · 2024-08-09T22:52:32Z

I assume this could be a very expensive algorithm because all word lengths are checked against the dict?

Not sure if there is a better solution, but at least a first version for compound words :)

luflow · 2024-08-10T12:52:22Z

Also another open question: can we even use the dictionary?

The orignal author has it under GNU GPL
https://github.com/uschindler/german-decompounder/blob/master/NOTICE.txt

luflow · 2024-08-12T12:10:20Z

@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun

ManyTheFish

Hello @luflow,

Could you add a feature flag on your implementation as I suggested, please? Then add it as a default feature in the Cargo.toml file.

In terms of implementation, you chose to rely on an HashSet to split your words, but I don't think it's the best approach.
I highly suggest using an FstSegmenter like in the Thai tokenizer, it's a bit complex to build but way more efficient in time and space, or you could use an AhoCorasick automaton using the LeftmostLongest match kind.

Sorry for the delays!
Let me know if you have a question

charabia/src/segmenter/mod.rs

luflow · 2024-08-27T08:28:09Z

Hi @ManyTheFish!

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

ManyTheFish · 2024-08-27T12:07:36Z

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

You can use the CLI fst-bin to build your dictionary from a source file. 😄

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

Yes you can build it from an iterator over str, so it's convenient

…and min lemma length definition

luflow · 2024-08-28T11:18:36Z

@ManyTheFish I extended the FstSegmenter with two options to also be able to handle a min lemma length and being able to hinder the segmenter from spitting out single letters. That keeps my dictionary even smaller and may be also useful for other languages later?

The dictionary is now also transformed into an FST file.

Let me know what you think :)

luflow · 2024-09-07T16:49:51Z

@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :)

ManyTheFish

Hello @luflow,
sorry for the delay, LGTM!

bors merge

303: feat: Adds German compound words decomposition with new segmenter r=ManyTheFish a=luflow # Pull Request ## What does this PR do? - Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/) - Adds benchmark with german sentences ## PR checklist Please check if your PR fulfills the following requirements: - [X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [X] Have you read the contributing guidelines? - [X] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Florian Ludwig <[email protected]> Co-authored-by: Florian Ludwig <[email protected]>

meili-bors · 2024-09-09T08:17:31Z

Build failed:

tests

charabia/src/segmenter/utils.rs

Co-authored-by: Many the fish <[email protected]>

luflow · 2024-09-09T09:54:17Z

@ManyTheFish ok applied suggestion :)

ManyTheFish · 2024-09-09T11:32:46Z

Hello @luflow,

the test and clippy are not happy,

could you ensure that:

cargo clippy
cargo test

work on your machine please?

I'll merge as soon as the tests pass 😃

luflow · 2024-09-09T14:36:01Z

@ManyTheFish done 👍🏻

ManyTheFish

Nice!

Thank you for the contribution!

bors merge

meili-bors · 2024-09-10T09:14:04Z

Build succeeded:

feat: Adds German composition words decompound

434edde

feat: Adds some long sample words, change min suffix length to 4

98dbb6a

luflow added 4 commits August 10, 2024 17:06

feat: Adds some more words to german dictionary

6461aa4

fix: Allow max 3 characters remaining for suffix

323a47f

feat: Updates german benchmark

251b1e8

Adds ambiguous german city names to dictionary

f27587b

curquiza requested a review from ManyTheFish August 12, 2024 08:08

luflow added 3 commits August 12, 2024 14:07

Fixes clippy issues

c5e755b

Adds another dictionary word

fe43163

Fixes rust fmt issues

e534194

ManyTheFish requested changes Aug 27, 2024

View reviewed changes

charabia/src/segmenter/mod.rs Show resolved Hide resolved

charabia/src/segmenter/mod.rs Show resolved Hide resolved

charabia/src/segmenter/mod.rs Outdated Show resolved Hide resolved

luflow and others added 5 commits August 28, 2024 09:09

Adds german-segmentation feature flag

3882ec4

Introduces new options in fst segmenter to allow character splitting …

83089b9

…and min lemma length definition

Uses fst segmenter instead of hash map for hihger efficiency

257972f

Fixes rust fmt issues

f6999c6

Merge branch 'main' into feature/german-compound-words

61634c9

luflow requested a review from ManyTheFish August 28, 2024 11:16

ManyTheFish previously approved these changes Sep 9, 2024

View reviewed changes

ManyTheFish reviewed Sep 9, 2024

View reviewed changes

charabia/src/segmenter/utils.rs Outdated Show resolved Hide resolved

Update charabia/src/segmenter/utils.rs

d65941f

Co-authored-by: Many the fish <[email protected]>

luflow dismissed ManyTheFish’s stale review via d65941f September 9, 2024 09:53

luflow requested a review from ManyTheFish September 9, 2024 09:54

Fixes clippy issues

8523fa8

ManyTheFish approved these changes Sep 10, 2024

View reviewed changes

meili-bors bot merged commit 38b8529 into meilisearch:main Sep 10, 2024
4 checks passed

luflow deleted the feature/german-compound-words branch September 10, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds German compound words decomposition with new segmenter #303

feat: Adds German compound words decomposition with new segmenter #303

luflow commented Aug 9, 2024

luflow commented Aug 9, 2024

luflow commented Aug 10, 2024

luflow commented Aug 12, 2024 •

edited

Loading

ManyTheFish left a comment

luflow commented Aug 27, 2024

ManyTheFish commented Aug 27, 2024

luflow commented Aug 28, 2024

luflow commented Sep 7, 2024

ManyTheFish left a comment

meili-bors bot commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish left a comment

meili-bors bot commented Sep 10, 2024

feat: Adds German compound words decomposition with new segmenter #303

feat: Adds German compound words decomposition with new segmenter #303

Conversation

luflow commented Aug 9, 2024

Pull Request

What does this PR do?

PR checklist

luflow commented Aug 9, 2024

luflow commented Aug 10, 2024

luflow commented Aug 12, 2024 • edited Loading

ManyTheFish left a comment

Choose a reason for hiding this comment

luflow commented Aug 27, 2024

ManyTheFish commented Aug 27, 2024

luflow commented Aug 28, 2024

luflow commented Sep 7, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Sep 10, 2024

luflow commented Aug 12, 2024 •

edited

Loading