Update docs, plots and release notes

pemistahl · Dec 30, 2022 · e707057 · e707057
1 parent 3b9b57c
commit e707057
Show file tree

Hide file tree

Showing 17 changed files with 92 additions and 103 deletions.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 [![supported languages](https://img.shields.io/badge/supported%20languages-75-green.svg)](#3-which-languages-are-supported)
 [![docs](https://img.shields.io/badge/docs-API-yellowgreen)](https://pemistahl.github.io/lingua-py)
 ![supported Python versions](https://img.shields.io/badge/Python-%3E%3D%203.8-blue)
-[![pypi](https://img.shields.io/badge/PYPI-v1.2.1-blue)](https://pypi.org/project/lingua-language-detector)
+[![pypi](https://img.shields.io/badge/PYPI-v1.3.0-blue)](https://pypi.org/project/lingua-language-detector)
 [![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 </div>
 
@@ -3149,7 +3149,7 @@ each possible language have to satisfy. It can be stated in the following way:
 >>> from lingua import Language, LanguageDetectorBuilder
 >>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
 >>> detector = LanguageDetectorBuilder.from_languages(*languages)\
-.with_minimum_relative_distance(0.7)\
+.with_minimum_relative_distance(0.9)\
 .build()
 >>> print(detector.detect_language_of("languages are awesome"))
 None
@@ -3175,25 +3175,19 @@ to the most likely one? These questions can be answered as well:
 >>> confidence_values = detector.compute_language_confidence_values("languages are awesome")
 >>> for language, value in confidence_values:
 ...     print(f"{language.name}: {value:.2f}")
-ENGLISH: 0.99
-FRENCH: 0.32
-GERMAN: 0.15
+ENGLISH: 0.93
+FRENCH: 0.04
+GERMAN: 0.02
 SPANISH: 0.01
 ```
 
 In the example above, a list is returned containing those languages which the
 calling instance of LanguageDetector has been built from, sorted by
-their confidence value in descending order. The values that the detector
-computes are part of a **relative** confidence metric, not of an absolute one.
-Each value is a number between 0.0 and 1.0.
-
+their confidence value in descending order. Each value is a probability between
+0.0 and 1.0. The probabilities of all languages will sum to 1.0.
 If the language is unambiguously identified by the rule engine, the value 1.0
 will always be returned for this language. The other languages will receive a
-value of 0.0. If the statistics engine is additionally needed, the most likely
-language will be returned with value 0.99 and the least likely language will
-be returned with value 0.01. All other languages get values assigned between
-0.01 and 0.99, denoting how less likely those languages are in comparison to
-the most likely language.
+value of 0.0.
 
 There is also a method for returning the confidence value for one specific
 language only:
@@ -3204,14 +3198,13 @@ language only:
 >>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
 >>> confidence_value = detector.compute_language_confidence("languages are awesome", Language.FRENCH)
 >>> print(f"{confidence_value:.2f}")
-0.32
+0.04
 ```
 
 The value that this method computes is a number between 0.0 and 1.0. If the
 language is unambiguously identified by the rule engine, the value 1.0 will
 always be returned. If the given language is not supported by this detector
-instance, the value 0.0 will always be returned. Otherwise, a value between
-0.01 and 0.99 will be returned.
+instance, the value 0.0 will always be returned.
 
 ### 9.4 Eager loading versus lazy loading
 
@@ -3281,7 +3274,7 @@ ENGLISH: 'A little bit is better than nothing.'
 ```
 
 In the example above, a list of
-[`DetectionResult`](https://github.com/pemistahl/lingua-py/blob/main/lingua/detector.py#L144)
+[`DetectionResult`](https://github.com/pemistahl/lingua-py/blob/main/lingua/detector.py#L148)
 is returned. Each entry in the list describes a contiguous single-language text section,
 providing start and end indices of the respective substring.
 
@@ -3317,9 +3310,9 @@ LanguageDetectorBuilder.from_iso_codes_639_1(IsoCode639_1.EN, IsoCode639_1.DE)
 LanguageDetectorBuilder.from_iso_codes_639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)
 ```
 
-## 10. What's next for version 1.3.0?
+## 10. What's next for version 1.4.0?
 
-Take a look at the [planned issues](https://github.com/pemistahl/lingua-py/milestone/3).
+Take a look at the [planned issues](https://github.com/pemistahl/lingua-py/milestone/4).
 
 ## 11. Contributions
 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,3 +1,12 @@
+## Lingua 1.3.0 (released on 30 Dec 2022)
+
+### Improvements
+
+- The min-max normalization method for the confidence values has been
+  replaced with applying the softmax function. This gives more realistic
+  probabilities. Big thanks to @Alex-Kopylov for proposing and implementing
+  this change. (#99)
+
 ## Lingua 1.2.1 (released on 27 Dec 2022)
 
 ### Bug Fixes

diff --git a/docs/lingua.html b/docs/lingua.html
@@ -29,10 +29,6 @@ <h2>Contents</h2>
   <li><a href="#5-why-is-it-better-than-other-libraries">5. Why is it better than other libraries?</a></li>
   <li><a href="#6-how-to-add-it-to-your-project">6. How to add it to your project?</a></li>
   <li><a href="#7-how-to-use">7. How to use?</a></li>
-  <li><a href="#74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</a></li>
-  <li><a href="#75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</a></li>
-  <li><a href="#76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</a></li>
-  <li><a href="#77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</a></li>
 </ul>
 
 
@@ -1184,7 +1180,7 @@ <h3 id="72-minimum-relative-distance">7.2 Minimum relative distance</h3>
 <div class="pdoc-code codehilite">
 <pre><span></span><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">lingua</span> <span class="kn">import</span> <span class="n">Language</span><span class="p">,</span> <span class="n">LanguageDetectorBuilder</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">languages</span> <span class="o">=</span> <span class="p">[</span><span class="n"><a href="#Language.ENGLISH">Language.ENGLISH</a></span><span class="p">,</span> <span class="n"><a href="#Language.FRENCH">Language.FRENCH</a></span><span class="p">,</span> <span class="n"><a href="#Language.GERMAN">Language.GERMAN</a></span><span class="p">,</span> <span class="n"><a href="#Language.SPANISH">Language.SPANISH</a></span><span class="p">]</span>
-<span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">with_minimum_relative_distance</span><span class="p">(</span><span class="mf">0.7</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">with_minimum_relative_distance</span><span class="p">(</span><span class="mf">0.9</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="nb">print</span><span class="p">(</span><span class="n">detector</span><span class="o">.</span><span class="n">detect_language_of</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">))</span>
 <span class="kc">None</span>
 </code></pre>
@@ -1210,26 +1206,20 @@ <h3 id="73-confidence-values">7.3 Confidence values</h3>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">confidence_values</span> <span class="o">=</span> <span class="n">detector</span><span class="o">.</span><span class="n">compute_language_confidence_values</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">)</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="k">for</span> <span class="n">language</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">confidence_values</span><span class="p">:</span>
 <span class="o">...</span>     <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">language</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">value</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
-<span class="n">ENGLISH</span><span class="p">:</span> <span class="mf">0.99</span>
-<span class="n">FRENCH</span><span class="p">:</span> <span class="mf">0.32</span>
-<span class="n">GERMAN</span><span class="p">:</span> <span class="mf">0.15</span>
+<span class="n">ENGLISH</span><span class="p">:</span> <span class="mf">0.93</span>
+<span class="n">FRENCH</span><span class="p">:</span> <span class="mf">0.04</span>
+<span class="n">GERMAN</span><span class="p">:</span> <span class="mf">0.02</span>
 <span class="n">SPANISH</span><span class="p">:</span> <span class="mf">0.01</span>
 </code></pre>
 </div>
 
 <p>In the example above, a list is returned containing those languages which the
 calling instance of LanguageDetector has been built from, sorted by
-their confidence value in descending order. The values that the detector
-computes are part of a <strong>relative</strong> confidence metric, not of an absolute one.
-Each value is a number between 0.0 and 1.0.</p>
-
-<p>If the language is unambiguously identified by the rule engine, the value 1.0
+their confidence value in descending order. Each value is a probability between
+0.0 and 1.0. The probabilities of all languages will sum to 1.0.
+If the language is unambiguously identified by the rule engine, the value 1.0
 will always be returned for this language. The other languages will receive a
-value of 0.0. If the statistics engine is additionally needed, the most likely
-language will be returned with value 0.99 and the least likely language will
-be returned with value 0.01. All other languages get values assigned between
-0.01 and 0.99, denoting how less likely those languages are in comparison to
-the most likely language.</p>
+value of 0.0.</p>
 
 <p>There is also a method for returning the confidence value for one specific
 language only:</p>
@@ -1240,17 +1230,16 @@ <h3 id="73-confidence-values">7.3 Confidence values</h3>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">confidence_value</span> <span class="o">=</span> <span class="n">detector</span><span class="o">.</span><span class="n">compute_language_confidence</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">,</span> <span class="n"><a href="#Language.FRENCH">Language.FRENCH</a></span><span class="p">)</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">confidence_value</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
-<span class="mf">0.32</span>
+<span class="mf">0.04</span>
 </code></pre>
 </div>
 
 <p>The value that this method computes is a number between 0.0 and 1.0. If the
 language is unambiguously identified by the rule engine, the value 1.0 will
 always be returned. If the given language is not supported by this detector
-instance, the value 0.0 will always be returned. Otherwise, a value between
-0.01 and 0.99 will be returned.</p>
+instance, the value 0.0 will always be returned.</p>
 
-<h2 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</h2>
+<h3 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</h3>
 
 <p>By default, <em>Lingua</em> uses lazy-loading to load only those language models on
 demand which are considered relevant by the rule-based filter engine. For web
@@ -1266,7 +1255,7 @@ <h2 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy load
 <p>Multiple instances of <code><a href="#LanguageDetector">LanguageDetector</a></code> share the same language models in
 memory which are accessed asynchronously by the instances.</p>
 
-<h2 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</h2>
+<h3 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</h3>
 
 <p><em>Lingua's</em> high detection accuracy comes at the cost of being noticeably slower
 than other language detectors. The large language models also consume significant
@@ -1294,7 +1283,7 @@ <h2 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode ve
 the texts you want to classify you can almost always rule out certain languages as impossible
 or unlikely to occur.</p>
 
-<h2 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</h2>
+<h3 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</h3>
 
 <p>In contrast to most other language detectors, <em>Lingua</em> is able to detect multiple
 languages in mixed-language texts. This feature can yield quite reasonable results but
@@ -1325,7 +1314,7 @@ <h2 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detectio
 is returned. Each entry in the list describes a contiguous single-language text section,
 providing start and end indices of the respective substring.</p>
 
-<h2 id="77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</h2>
+<h3 id="77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</h3>
 
 <p>There might be classification tasks where you know beforehand that your
 language data is definitely not written in Latin, for instance. The detection
@@ -1948,19 +1937,11 @@ <h5>Inherited Members</h5>
 <p>A list is returned containing those languages which the
 calling instance of LanguageDetector has been built from.
 The entries are sorted by their confidence value in
-descending order. The values that this method computes
-are part of a relative confidence metric, not of an
-absolute one. Each value is a number between 0.0 and 1.0.</p>
-
-<p>If the language is unambiguously identified by the rule
-engine, the value 1.0 will always be returned for this
-language. The other languages will receive a value of 0.0.
-If the statistics engine is additionally needed, the most
-likely language will be returned with value 0.99 and the
-least likely language will be returned with value 0.01.
-All other languages get values assigned between 0.01 and
-0.99, denoting how less likely those languages are in
-comparison to the most likely language.</p>
+descending order. Each value is a probability between
+0.0 and 1.0. The probabilities of all languages will sum to 1.0.
+If the language is unambiguously identified by the rule engine,
+the value 1.0 will always be returned for this language. The
+other languages will receive a value of 0.0.</p>
 
 <p>Args:
     text (str): The text for which to compute confidence values.</p>
@@ -1989,8 +1970,7 @@ <h5>Inherited Members</h5>
 computes is a number between 0.0 and 1.0. If the language is
 unambiguously identified by the rule engine, the value 1.0 will
 always be returned. If the given language is not supported by this
-detector instance, the value 0.0 will always be returned. Otherwise,
-a value between 0.01 and 0.99 will be returned.</p>
+detector instance, the value 0.0 will always be returned.</p>
 
 <p>Args:
     text (str): The text for which to compute the confidence value.</p>

diff --git a/docs/search.js b/docs/search.js
diff --git a/images/plots/barplot-average.png b/images/plots/barplot-average.png
diff --git a/images/plots/barplot-sentences.png b/images/plots/barplot-sentences.png
diff --git a/images/plots/barplot-single-words.png b/images/plots/barplot-single-words.png
diff --git a/images/plots/barplot-word-pairs.png b/images/plots/barplot-word-pairs.png
diff --git a/images/plots/boxplot-average.png b/images/plots/boxplot-average.png
diff --git a/images/plots/boxplot-sentences.png b/images/plots/boxplot-sentences.png
diff --git a/images/plots/boxplot-single-words.png b/images/plots/boxplot-single-words.png
diff --git a/images/plots/boxplot-word-pairs.png b/images/plots/boxplot-word-pairs.png
diff --git a/lingua/__init__.py b/lingua/__init__.py
@@ -324,17 +324,11 @@
 
 In the example above, a list is returned containing those languages which the
 calling instance of LanguageDetector has been built from, sorted by
-their confidence value in descending order. The values that the detector
-computes are part of a **relative** confidence metric, not of an absolute one.
-Each value is a number between 0.0 and 1.0.
-
+their confidence value in descending order. Each value is a probability between
+0.0 and 1.0. The probabilities of all languages will sum to 1.0.
 If the language is unambiguously identified by the rule engine, the value 1.0
 will always be returned for this language. The other languages will receive a
-value of 0.0. If the statistics engine is additionally needed, the most likely
-language will be returned with value 0.99 and the least likely language will
-be returned with value 0.01. All other languages get values assigned between
-0.01 and 0.99, denoting how less likely those languages are in comparison to
-the most likely language.
+value of 0.0.
 
 There is also a method for returning the confidence value for one specific
 language only:
@@ -352,10 +346,9 @@
 The value that this method computes is a number between 0.0 and 1.0. If the
 language is unambiguously identified by the rule engine, the value 1.0 will
 always be returned. If the given language is not supported by this detector
-instance, the value 0.0 will always be returned. Otherwise, a value between
-0.01 and 0.99 will be returned.
+instance, the value 0.0 will always be returned.
 
-## 7.4 Eager loading versus lazy loading
+### 7.4 Eager loading versus lazy loading
 
 By default, *Lingua* uses lazy-loading to load only those language models on
 demand which are considered relevant by the rule-based filter engine. For web
@@ -370,7 +363,7 @@
 Multiple instances of `LanguageDetector` share the same language models in
 memory which are accessed asynchronously by the instances.
 
-## 7.5 Low accuracy mode versus high accuracy mode
+### 7.5 Low accuracy mode versus high accuracy mode
 
 *Lingua's* high detection accuracy comes at the cost of being noticeably slower
 than other language detectors. The large language models also consume significant
@@ -397,7 +390,7 @@
 the texts you want to classify you can almost always rule out certain languages as impossible
 or unlikely to occur.
 
-## 7.6 Detection of multiple languages in mixed-language texts
+### 7.6 Detection of multiple languages in mixed-language texts
 
 In contrast to most other language detectors, *Lingua* is able to detect multiple
 languages in mixed-language texts. This feature can yield quite reasonable results but
@@ -428,7 +421,7 @@
 is returned. Each entry in the list describes a contiguous single-language text section,
 providing start and end indices of the respective substring.
 
-## 7.7 Methods to build the LanguageDetector
+### 7.7 Methods to build the LanguageDetector
 
 There might be classification tasks where you know beforehand that your
 language data is definitely not written in Latin, for instance. The detection