You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).
The text was updated successfully, but these errors were encountered:
Ideally if this is the case, this would be a part of the model, and not an option inside warc2text, as it would be really hard to keep track of which model benefits from it, and which doesn't.
On the other hand, I can also understand that the web is kind of garbage and there's a lot of ALL UPPER CASE text out there that's not in the training data. And that doesn't match any ngrams in the model.
Maybe we should train a model on explicitly all lower case text, see whether it degrades performance a lot, and if it doesn't do indeed just classify always on lowercase?
If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).
The text was updated successfully, but these errors were encountered: