Adding regex replacement feature #202

raivisdejus · 2023-10-25T12:11:08Z

Adding another replacement option, that can process regexs. This can be used to split longer sentences into smaller chunks.

In my tests for Latvian, this can yield ~25% more sentences in the final output.

MichaelKohler · 2023-10-28T13:24:09Z

Thanks for submitting this! I will have a closer look at this PR tomorrow.

MichaelKohler

Thanks for submitting this PR! I have left a few comments, but nothing major. Please ping me once ready for another round of review and we can get this merged as soon as possible :)

README.md

MichaelKohler · 2023-10-31T21:43:17Z

README.md

+```
+
+This will find words that glue two sentences and will add a space to un-glue them. 
+And will split a long sentence in two smaller.


I think this is a good example and easily understandable, thanks for this thorough documentation. In the context of Wikipedia extracts, more sentences might actually mean less content, as a sentence might be fulfilling all rule requirements, but then gets split into two. And then only one of them gets picked. Of course this heavily depends on how many potential sentences a given article has. In many cases (such as yours), this might be beneficial, but it doesn't always have to be. Might be worth it to write a short explanation here for that as well.

I think this is where it is most worthwhile: If the article does not have enough sentences to select from (<3) because of the rules, especially max_words and/or max_characters. At that time, this algorithm can kick in and try to produce split sentences.

There is no way for us to know if pre-split or post-split can produce more "valuable" sentences. But many of the "sub-sentences" might be simple introductory wording etc.

I agree, and that's exactly why I would prefer a short sentence explaining that, so people don't just blindly copy. If we have an indication that it works in all corpuses, then we could also just do it by default.

Added a note on sentence splitting.

src/replacer.rs

MichaelKohler · 2023-10-31T22:00:26Z

src/replacer.rs

@@ -28,6 +28,19 @@ pub fn replace_strings(rules: &Rules, raw: &str) -> String {
        }
    }

+    // regex replacements
+    for regex_replacement in rules.regex_replacement_list.iter() {
+        if Value::as_array(regex_replacement).unwrap().len() == 3 {


This made me wonder if this implementation should go further than just with 3 values. Initially I thought such a regex implementation would only take two arguments and basically work like the replace_all function. But thinking about it, I can absolutely see why 3 arguments can be even more helpful, though many use cases could also be covered by named capture groups (but not all!).

Would you be interested in implementing a second form of this that accepts two arguments and replaces every matched occurrence with that string? This of course could be done outside this PR as a follow-up.

Co-authored-by: Michael Kohler <[email protected]>

raivisdejus force-pushed the add-regex-replacement-list branch 2 times, most recently from 0addfd9 to 350d7ff Compare October 25, 2023 12:14

Adding regex replacement feature

ac5a79f

raivisdejus force-pushed the add-regex-replacement-list branch from 350d7ff to ac5a79f Compare October 25, 2023 12:16

MichaelKohler requested changes Oct 31, 2023

View reviewed changes

raivisdejus and others added 3 commits January 13, 2024 08:43

Update README.md

eb1b4f6

Co-authored-by: Michael Kohler <[email protected]>

Update src/replacer.rs

ad93f1f

Co-authored-by: Michael Kohler <[email protected]>

Adding note on sentence splitting

303464a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding regex replacement feature #202

Adding regex replacement feature #202

raivisdejus commented Oct 25, 2023

MichaelKohler commented Oct 28, 2023

MichaelKohler left a comment

MichaelKohler Oct 31, 2023

HarikalarKutusu Nov 1, 2023

MichaelKohler Nov 1, 2023

raivisdejus Jan 13, 2024

MichaelKohler Oct 31, 2023

Adding regex replacement feature #202

Are you sure you want to change the base?

Adding regex replacement feature #202

Conversation

raivisdejus commented Oct 25, 2023

MichaelKohler commented Oct 28, 2023

MichaelKohler left a comment

Choose a reason for hiding this comment

MichaelKohler Oct 31, 2023

Choose a reason for hiding this comment

HarikalarKutusu Nov 1, 2023

Choose a reason for hiding this comment

MichaelKohler Nov 1, 2023

Choose a reason for hiding this comment

raivisdejus Jan 13, 2024

Choose a reason for hiding this comment

MichaelKohler Oct 31, 2023

Choose a reason for hiding this comment