Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding regex replacement feature #202

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

raivisdejus
Copy link

Adding another replacement option, that can process regexs. This can be used to split longer sentences into smaller chunks.

In my tests for Latvian, this can yield ~25% more sentences in the final output.

@raivisdejus raivisdejus force-pushed the add-regex-replacement-list branch 2 times, most recently from 0addfd9 to 350d7ff Compare October 25, 2023 12:14
@raivisdejus raivisdejus force-pushed the add-regex-replacement-list branch from 350d7ff to ac5a79f Compare October 25, 2023 12:16
@MichaelKohler
Copy link
Member

Thanks for submitting this! I will have a closer look at this PR tomorrow.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR! I have left a few comments, but nothing major. Please ping me once ready for another round of review and we can get this merged as soon as possible :)

README.md Outdated Show resolved Hide resolved
README.md Outdated
```

This will find words that glue two sentences and will add a space to un-glue them.
And will split a long sentence in two smaller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good example and easily understandable, thanks for this thorough documentation. In the context of Wikipedia extracts, more sentences might actually mean less content, as a sentence might be fulfilling all rule requirements, but then gets split into two. And then only one of them gets picked. Of course this heavily depends on how many potential sentences a given article has. In many cases (such as yours), this might be beneficial, but it doesn't always have to be. Might be worth it to write a short explanation here for that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where it is most worthwhile: If the article does not have enough sentences to select from (<3) because of the rules, especially max_words and/or max_characters. At that time, this algorithm can kick in and try to produce split sentences.

There is no way for us to know if pre-split or post-split can produce more "valuable" sentences. But many of the "sub-sentences" might be simple introductory wording etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and that's exactly why I would prefer a short sentence explaining that, so people don't just blindly copy. If we have an indication that it works in all corpuses, then we could also just do it by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note on sentence splitting.

src/replacer.rs Outdated Show resolved Hide resolved
@@ -28,6 +28,19 @@ pub fn replace_strings(rules: &Rules, raw: &str) -> String {
}
}

// regex replacements
for regex_replacement in rules.regex_replacement_list.iter() {
if Value::as_array(regex_replacement).unwrap().len() == 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me wonder if this implementation should go further than just with 3 values. Initially I thought such a regex implementation would only take two arguments and basically work like the replace_all function. But thinking about it, I can absolutely see why 3 arguments can be even more helpful, though many use cases could also be covered by named capture groups (but not all!).

Would you be interested in implementing a second form of this that accepts two arguments and replaces every matched occurrence with that string? This of course could be done outside this PR as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants