Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Experimental branch that uses OpusFilter for all of the tools and processing.
Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility a lot. But OpusFilter already comes with some useful ones. And implementing our own filters in Python is also doable.
I'm also going to use this Pull Request as a little notepad for things I find in OpusFilter that I need to write down somewhere so I can have someone else look at whether it makes sense.
Notes on OpusFilter
Because this is just from reading the source, not from actually trying it. So I might be wrong.
RegExpPreprocessor
Does the RegExpPreprocessor work? It seems to do double compilation of lang_patterns:
https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/preprocessors.py#L93-L98
Filter pipeline implementation
FilterABC has a
filter
base implementation that’s pretty naievely calling self.score with a single pair:https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/__init__.py#L50-L54
It looks as if that naïve implementation is called in the pipeline:
https://github.com/Helsinki-NLP/OpusFilter/blob/9f6636960a21a673f80308e8bd36216cdb144caa/opusfilter/pipeline.py#L94-L98
… which all in all feels wrong given how much attention is given to do proper chunking in the steps before it, and all of the filter implementations being generators. None of the actual filters make use of batching, but I’d say that would be a useful thing once you’d add filters like LASER.
Separation of preprocessors and filters and intermediate output files
It is useful that OpusFilter can read file formats as part of processing steps, but the downside is that each step has to name input and output files. When mixing processing and filtering steps, this forces you to write intermediate data to disk. Maybe empty-train should have a more strict distinction between filtering and preprocessing. But from a user perspective… is that what you’d want? Say I’d like to filter out the obvious trash first, then preprocess the remainder to be as good as possible, and then use the expensive filters to filter out the lower quality stuff.