IllegalStateException thrown for unusual case #24

RichardInnocent · 2020-02-04T14:12:10Z

I'm able to configure the LanguageDetector as follows:

LanguageDetector languageDetector =
    LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.UNKNOWN)
                           .build();

When trying to compute the probabilities of the languages for the content 그 가격으로는 최상, the following exception is thrown:

Exception in thread "main" java.lang.IllegalStateException: inputStream must not be null
	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:346)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:353)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:72)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:336)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:312)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:299)
	at com.github.pemistahl.lingua.api.LanguageDetector.addNgramProbabilities$lingua(LanguageDetector.kt:164)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:116)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$3(LanguageDetectionTimeAnalysis.java:83)
	at java.util.ArrayList.forEach(ArrayList.java:1257)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$2(LanguageDetectionTimeAnalysis.java:81)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$TimedEvent.time(LanguageDetectionTimeAnalysis.java:107)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$1(LanguageDetectionTimeAnalysis.java:89)

This exception is not thrown for other clearly non-English content (e.g. 여보세요), although changing from Language.UNKNOWN to Language.GERMAN solves this issue.

If Language.UNKNOWN is not meant to be included in the fromLanguages collection, a suitable exception should be thrown to indicate this.

As a side note, my use case for including Language.ENGLISH and Language.UNKNOWN is that, for my use case, I only care to know whether or not the language is English so would prefer to maintain the ability to include Language.UNKNOWN.

The text was updated successfully, but these errors were encountered:

krzysztofcybulski · 2020-02-04T14:34:04Z

You can hack the API by passing same language twice:
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.ENGLISH)

RichardInnocent · 2020-02-04T14:44:32Z

Thanks @krzysztofpcy, that's a great workaround for now. I still think this issue should be resolved in an upcoming version.

pemistahl · 2020-02-04T19:33:33Z

It never ceases to amaze me how creative people become to widen a tool's use cases which were never intended to be supported. :-)

Language.UNKNOWN is not meant to be used as input for the method. It serves only as a return value. Your exception is thrown because the library tries to load a language model for Language.UNKNOWN from disk into memory which does not exist, of course. For the cases where you didn't get the exception this is because the rule-based engine could successfully determine the language, so loading the language models was not necessary.

If you just want to determine whether some text is English or not and you cannot reliably exclude any other language in your data set, then please use LanguageDetectorBuilder.fromAllBuiltInLanguages() and throw away everything that does not return Language.ENGLISH. If I find the time to implement some kind of confidence scoring, this use case can be handled easier perhaps.

In any case, I will change the api so that an exception is thrown whenever Language.UNKNOWN is tried to be used as the input language. Thanks for letting me know about this, @RichardInnocent.

RichardInnocent · 2020-02-04T22:10:14Z

Thanks for your response and advice with my use case - much appreciated.

I know you've already tagged it for the next release, but the confidence scoring issue would be really useful to me too as it would allow me to avoid the overhead of including all other languages so I look forward to it.

pemistahl added the bug Something isn't working label Feb 4, 2020

pemistahl added this to the Lingua 0.6.1 milestone Feb 4, 2020

pemistahl added a commit that referenced this issue Feb 6, 2020

Fix misuse of Language.UNKNOWN in public api (#24)

cc6c9ce

pemistahl closed this as completed Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IllegalStateException thrown for unusual case #24

IllegalStateException thrown for unusual case #24

RichardInnocent commented Feb 4, 2020

krzysztofcybulski commented Feb 4, 2020

RichardInnocent commented Feb 4, 2020

pemistahl commented Feb 4, 2020

RichardInnocent commented Feb 4, 2020

IllegalStateException thrown for unusual case #24

IllegalStateException thrown for unusual case #24

Comments

RichardInnocent commented Feb 4, 2020

krzysztofcybulski commented Feb 4, 2020

RichardInnocent commented Feb 4, 2020

pemistahl commented Feb 4, 2020

RichardInnocent commented Feb 4, 2020