-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_mismatch discards some useful entries #132
Comments
I guess we should update the filter to maybe strip punctuation? |
Is there a possibilty to have language-specific rules such as |
It is, but I'd rather avoid such rules where possible. I don't think there's a need to be so exact. This is more an issue with how the dash is interpreted (as a minus sign for the latter number) than anything. |
So running the lines above through
Line 1 and 5 should not be discarded according to this. Line 2 is confused by the space and I'm unsure how to deal with that. I could add a Line 3 is because it doesn't know that Line 4 suffers from the space before the dash again but at least that's consistent on both sides. It also doesn't know about |
Numeric-aware embeddings as an extension to LASER/LaBSE? In addition, we can train |
I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way.
The text was updated successfully, but these errors were encountered: