Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opusfilter backing #43

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Opusfilter backing #43

wants to merge 1 commit into from

Conversation

jelmervdl
Copy link
Collaborator

Experimental branch that uses OpusFilter for all of the tools and processing.

Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility a lot. But OpusFilter already comes with some useful ones. And implementing our own filters in Python is also doable.

I'm also going to use this Pull Request as a little notepad for things I find in OpusFilter that I need to write down somewhere so I can have someone else look at whether it makes sense.

Notes on OpusFilter

Because this is just from reading the source, not from actually trying it. So I might be wrong.

RegExpPreprocessor

Does the RegExpPreprocessor work? It seems to do double compilation of lang_patterns:
https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/preprocessors.py#L93-L98

Filter pipeline implementation

FilterABC has a filter base implementation that’s pretty naievely calling self.score with a single pair:
https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/__init__.py#L50-L54

It looks as if that naïve implementation is called in the pipeline:
https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/pipeline.py#L94-L98

… which all in all feels wrong given how much attention is given to do proper chunking in the steps before it, and all of the filter implementations being generators. None of the actual filters make use of batching, but I’d say that would be a useful thing once you’d add filters like LASER.

Separation of preprocessors and filters and intermediate output files

It is useful that OpusFilter can read file formats as part of processing steps, but the downside is that each step has to name input and output files. When mixing processing and filtering steps, this forces you to write intermediate data to disk. Maybe empty-train should have a more strict distinction between filtering and preprocessing. But from a user perspective… is that what you’d want? Say I’d like to filter out the obvious trash first, then preprocess the remainder to be as good as possible, and then use the expensive filters to filter out the lower quality stuff.

@jelmervdl jelmervdl linked an issue Oct 18, 2022 that may be closed by this pull request
@svirpioj
Copy link

svirpioj commented Oct 26, 2022

Hello Jelmer,

Does the RegExpPreprocessor work? It seems to do double compilation of lang_patterns

Indeed there was extra code there - it didn't affect anything, just compiled the same patterns twice, so no-one had noticed before. The extra lines are now removed in the develop branch.

all in all feels wrong given how much attention is given to do proper chunking in the steps before it, and all of the filter implementations being generators. None of the actual filters make use of batching, but I’d say that would be a useful thing once you’d add filters like LASER.

The exact benefit of filters being generators is that you can write the pipeline filter as simply and efficiently as that.

Unfortunately scoring and filterfalse do not work as simply as that, so they require chunking in order that not all outputs need to be stored into the memory at once.

And yes, another complication are the filters that require or benefit from batching. For example, WordAlignFilter does chunking internally, as eflomal needs input in files, and SentenceEmbeddingFilter - based on LASER - does it already for scoring. In fact, it definitely should use chunking also for the filter method, I'll need to fix that.

@jelmervdl jelmervdl removed a link to an issue Nov 9, 2022
@jelmervdl jelmervdl linked an issue Nov 9, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opusfilters integration
2 participants