-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically derive filters based on a clean sample provded by the user. #148
Comments
@marco-c Thanks for pointing me to this! This is slightly different in the rule construction process as described but would achieve a similar effect! |
@PinzhenChen an idea we have been thinking about with @miau1 would be:
|
Oh I wasn't aware that you are already working on this. It reads very similar to my initial idea. |
Oh, we are not working on it yet, it was just an idea for now! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In practice I would have big noisy training data and sample clean data that is representative of the downstream task (e.g. wmt validation sets).
It is still difficulty for me to decide on the values for the filters, for example, should I choose a source_word_ratio of 0.4 or 0.5, especially if I do not speak both languages. There are many filters and values to search for. This is largely empirical and it is also hard to attribute the final system's BLEU/COMET to a specific value change.
If I provide a small clean data that is sufficiently representative of the test domain, can the tool automatically run to derive some rules/values for me? Maybe the tool should search for and return the filter values that are "extreme" enough yet do not lead to the provided clean data being filtered out?
The text was updated successfully, but these errors were encountered: