Skip to content

Commit

Permalink
Update docs, plots and release notes
Browse files Browse the repository at this point in the history
  • Loading branch information
pemistahl committed Dec 30, 2022
1 parent 3b9b57c commit e707057
Show file tree
Hide file tree
Showing 17 changed files with 92 additions and 103 deletions.
33 changes: 13 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
[![supported languages](https://img.shields.io/badge/supported%20languages-75-green.svg)](#3-which-languages-are-supported)
[![docs](https://img.shields.io/badge/docs-API-yellowgreen)](https://pemistahl.github.io/lingua-py)
![supported Python versions](https://img.shields.io/badge/Python-%3E%3D%203.8-blue)
[![pypi](https://img.shields.io/badge/PYPI-v1.2.1-blue)](https://pypi.org/project/lingua-language-detector)
[![pypi](https://img.shields.io/badge/PYPI-v1.3.0-blue)](https://pypi.org/project/lingua-language-detector)
[![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
</div>

Expand Down Expand Up @@ -3149,7 +3149,7 @@ each possible language have to satisfy. It can be stated in the following way:
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages)\
.with_minimum_relative_distance(0.7)\
.with_minimum_relative_distance(0.9)\
.build()
>>> print(detector.detect_language_of("languages are awesome"))
None
Expand All @@ -3175,25 +3175,19 @@ to the most likely one? These questions can be answered as well:
>>> confidence_values = detector.compute_language_confidence_values("languages are awesome")
>>> for language, value in confidence_values:
... print(f"{language.name}: {value:.2f}")
ENGLISH: 0.99
FRENCH: 0.32
GERMAN: 0.15
ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01
```

In the example above, a list is returned containing those languages which the
calling instance of LanguageDetector has been built from, sorted by
their confidence value in descending order. The values that the detector
computes are part of a **relative** confidence metric, not of an absolute one.
Each value is a number between 0.0 and 1.0.

their confidence value in descending order. Each value is a probability between
0.0 and 1.0. The probabilities of all languages will sum to 1.0.
If the language is unambiguously identified by the rule engine, the value 1.0
will always be returned for this language. The other languages will receive a
value of 0.0. If the statistics engine is additionally needed, the most likely
language will be returned with value 0.99 and the least likely language will
be returned with value 0.01. All other languages get values assigned between
0.01 and 0.99, denoting how less likely those languages are in comparison to
the most likely language.
value of 0.0.

There is also a method for returning the confidence value for one specific
language only:
Expand All @@ -3204,14 +3198,13 @@ language only:
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> confidence_value = detector.compute_language_confidence("languages are awesome", Language.FRENCH)
>>> print(f"{confidence_value:.2f}")
0.32
0.04
```

The value that this method computes is a number between 0.0 and 1.0. If the
language is unambiguously identified by the rule engine, the value 1.0 will
always be returned. If the given language is not supported by this detector
instance, the value 0.0 will always be returned. Otherwise, a value between
0.01 and 0.99 will be returned.
instance, the value 0.0 will always be returned.

### 9.4 Eager loading versus lazy loading

Expand Down Expand Up @@ -3281,7 +3274,7 @@ ENGLISH: 'A little bit is better than nothing.'
```

In the example above, a list of
[`DetectionResult`](https://github.com/pemistahl/lingua-py/blob/main/lingua/detector.py#L144)
[`DetectionResult`](https://github.com/pemistahl/lingua-py/blob/main/lingua/detector.py#L148)
is returned. Each entry in the list describes a contiguous single-language text section,
providing start and end indices of the respective substring.

Expand Down Expand Up @@ -3317,9 +3310,9 @@ LanguageDetectorBuilder.from_iso_codes_639_1(IsoCode639_1.EN, IsoCode639_1.DE)
LanguageDetectorBuilder.from_iso_codes_639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)
```

## 10. What's next for version 1.3.0?
## 10. What's next for version 1.4.0?

Take a look at the [planned issues](https://github.com/pemistahl/lingua-py/milestone/3).
Take a look at the [planned issues](https://github.com/pemistahl/lingua-py/milestone/4).

## 11. Contributions

Expand Down
9 changes: 9 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## Lingua 1.3.0 (released on 30 Dec 2022)

### Improvements

- The min-max normalization method for the confidence values has been
replaced with applying the softmax function. This gives more realistic
probabilities. Big thanks to @Alex-Kopylov for proposing and implementing
this change. (#99)

## Lingua 1.2.1 (released on 27 Dec 2022)

### Bug Fixes
Expand Down
60 changes: 20 additions & 40 deletions docs/lingua.html
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,6 @@ <h2>Contents</h2>
<li><a href="#5-why-is-it-better-than-other-libraries">5. Why is it better than other libraries?</a></li>
<li><a href="#6-how-to-add-it-to-your-project">6. How to add it to your project?</a></li>
<li><a href="#7-how-to-use">7. How to use?</a></li>
<li><a href="#74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</a></li>
<li><a href="#75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</a></li>
<li><a href="#76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</a></li>
<li><a href="#77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</a></li>
</ul>


Expand Down Expand Up @@ -1184,7 +1180,7 @@ <h3 id="72-minimum-relative-distance">7.2 Minimum relative distance</h3>
<div class="pdoc-code codehilite">
<pre><span></span><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">lingua</span> <span class="kn">import</span> <span class="n">Language</span><span class="p">,</span> <span class="n">LanguageDetectorBuilder</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">languages</span> <span class="o">=</span> <span class="p">[</span><span class="n"><a href="#Language.ENGLISH">Language.ENGLISH</a></span><span class="p">,</span> <span class="n"><a href="#Language.FRENCH">Language.FRENCH</a></span><span class="p">,</span> <span class="n"><a href="#Language.GERMAN">Language.GERMAN</a></span><span class="p">,</span> <span class="n"><a href="#Language.SPANISH">Language.SPANISH</a></span><span class="p">]</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">with_minimum_relative_distance</span><span class="p">(</span><span class="mf">0.7</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">with_minimum_relative_distance</span><span class="p">(</span><span class="mf">0.9</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nb">print</span><span class="p">(</span><span class="n">detector</span><span class="o">.</span><span class="n">detect_language_of</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">))</span>
<span class="kc">None</span>
</code></pre>
Expand All @@ -1210,26 +1206,20 @@ <h3 id="73-confidence-values">7.3 Confidence values</h3>
<span class="o">&gt;&gt;&gt;</span> <span class="n">confidence_values</span> <span class="o">=</span> <span class="n">detector</span><span class="o">.</span><span class="n">compute_language_confidence_values</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="k">for</span> <span class="n">language</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">confidence_values</span><span class="p">:</span>
<span class="o">...</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">language</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">value</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
<span class="n">ENGLISH</span><span class="p">:</span> <span class="mf">0.99</span>
<span class="n">FRENCH</span><span class="p">:</span> <span class="mf">0.32</span>
<span class="n">GERMAN</span><span class="p">:</span> <span class="mf">0.15</span>
<span class="n">ENGLISH</span><span class="p">:</span> <span class="mf">0.93</span>
<span class="n">FRENCH</span><span class="p">:</span> <span class="mf">0.04</span>
<span class="n">GERMAN</span><span class="p">:</span> <span class="mf">0.02</span>
<span class="n">SPANISH</span><span class="p">:</span> <span class="mf">0.01</span>
</code></pre>
</div>

<p>In the example above, a list is returned containing those languages which the
calling instance of LanguageDetector has been built from, sorted by
their confidence value in descending order. The values that the detector
computes are part of a <strong>relative</strong> confidence metric, not of an absolute one.
Each value is a number between 0.0 and 1.0.</p>

<p>If the language is unambiguously identified by the rule engine, the value 1.0
their confidence value in descending order. Each value is a probability between
0.0 and 1.0. The probabilities of all languages will sum to 1.0.
If the language is unambiguously identified by the rule engine, the value 1.0
will always be returned for this language. The other languages will receive a
value of 0.0. If the statistics engine is additionally needed, the most likely
language will be returned with value 0.99 and the least likely language will
be returned with value 0.01. All other languages get values assigned between
0.01 and 0.99, denoting how less likely those languages are in comparison to
the most likely language.</p>
value of 0.0.</p>

<p>There is also a method for returning the confidence value for one specific
language only:</p>
Expand All @@ -1240,17 +1230,16 @@ <h3 id="73-confidence-values">7.3 Confidence values</h3>
<span class="o">&gt;&gt;&gt;</span> <span class="n">detector</span> <span class="o">=</span> <span class="n"><a href="#LanguageDetectorBuilder.from_languages">LanguageDetectorBuilder.from_languages</a></span><span class="p">(</span><span class="o">*</span><span class="n">languages</span><span class="p">)</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">confidence_value</span> <span class="o">=</span> <span class="n">detector</span><span class="o">.</span><span class="n">compute_language_confidence</span><span class="p">(</span><span class="s2">&quot;languages are awesome&quot;</span><span class="p">,</span> <span class="n"><a href="#Language.FRENCH">Language.FRENCH</a></span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">confidence_value</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
<span class="mf">0.32</span>
<span class="mf">0.04</span>
</code></pre>
</div>

<p>The value that this method computes is a number between 0.0 and 1.0. If the
language is unambiguously identified by the rule engine, the value 1.0 will
always be returned. If the given language is not supported by this detector
instance, the value 0.0 will always be returned. Otherwise, a value between
0.01 and 0.99 will be returned.</p>
instance, the value 0.0 will always be returned.</p>

<h2 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</h2>
<h3 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy loading</h3>

<p>By default, <em>Lingua</em> uses lazy-loading to load only those language models on
demand which are considered relevant by the rule-based filter engine. For web
Expand All @@ -1266,7 +1255,7 @@ <h2 id="74-eager-loading-versus-lazy-loading">7.4 Eager loading versus lazy load
<p>Multiple instances of <code><a href="#LanguageDetector">LanguageDetector</a></code> share the same language models in
memory which are accessed asynchronously by the instances.</p>

<h2 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</h2>
<h3 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode versus high accuracy mode</h3>

<p><em>Lingua's</em> high detection accuracy comes at the cost of being noticeably slower
than other language detectors. The large language models also consume significant
Expand Down Expand Up @@ -1294,7 +1283,7 @@ <h2 id="75-low-accuracy-mode-versus-high-accuracy-mode">7.5 Low accuracy mode ve
the texts you want to classify you can almost always rule out certain languages as impossible
or unlikely to occur.</p>

<h2 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</h2>
<h3 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detection of multiple languages in mixed-language texts</h3>

<p>In contrast to most other language detectors, <em>Lingua</em> is able to detect multiple
languages in mixed-language texts. This feature can yield quite reasonable results but
Expand Down Expand Up @@ -1325,7 +1314,7 @@ <h2 id="76-detection-of-multiple-languages-in-mixed-language-texts">7.6 Detectio
is returned. Each entry in the list describes a contiguous single-language text section,
providing start and end indices of the respective substring.</p>

<h2 id="77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</h2>
<h3 id="77-methods-to-build-the-languagedetector">7.7 Methods to build the LanguageDetector</h3>

<p>There might be classification tasks where you know beforehand that your
language data is definitely not written in Latin, for instance. The detection
Expand Down Expand Up @@ -1948,19 +1937,11 @@ <h5>Inherited Members</h5>
<p>A list is returned containing those languages which the
calling instance of LanguageDetector has been built from.
The entries are sorted by their confidence value in
descending order. The values that this method computes
are part of a relative confidence metric, not of an
absolute one. Each value is a number between 0.0 and 1.0.</p>

<p>If the language is unambiguously identified by the rule
engine, the value 1.0 will always be returned for this
language. The other languages will receive a value of 0.0.
If the statistics engine is additionally needed, the most
likely language will be returned with value 0.99 and the
least likely language will be returned with value 0.01.
All other languages get values assigned between 0.01 and
0.99, denoting how less likely those languages are in
comparison to the most likely language.</p>
descending order. Each value is a probability between
0.0 and 1.0. The probabilities of all languages will sum to 1.0.
If the language is unambiguously identified by the rule engine,
the value 1.0 will always be returned for this language. The
other languages will receive a value of 0.0.</p>

<p>Args:
text (str): The text for which to compute confidence values.</p>
Expand Down Expand Up @@ -1989,8 +1970,7 @@ <h5>Inherited Members</h5>
computes is a number between 0.0 and 1.0. If the language is
unambiguously identified by the rule engine, the value 1.0 will
always be returned. If the given language is not supported by this
detector instance, the value 0.0 will always be returned. Otherwise,
a value between 0.01 and 0.99 will be returned.</p>
detector instance, the value 0.0 will always be returned.</p>

<p>Args:
text (str): The text for which to compute the confidence value.</p>
Expand Down
2 changes: 1 addition & 1 deletion docs/search.js

Large diffs are not rendered by default.

Binary file modified images/plots/barplot-average.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/barplot-sentences.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/barplot-single-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/barplot-word-pairs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/boxplot-average.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/boxplot-sentences.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/boxplot-single-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/plots/boxplot-word-pairs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 8 additions & 15 deletions lingua/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,17 +324,11 @@
In the example above, a list is returned containing those languages which the
calling instance of LanguageDetector has been built from, sorted by
their confidence value in descending order. The values that the detector
computes are part of a **relative** confidence metric, not of an absolute one.
Each value is a number between 0.0 and 1.0.
their confidence value in descending order. Each value is a probability between
0.0 and 1.0. The probabilities of all languages will sum to 1.0.
If the language is unambiguously identified by the rule engine, the value 1.0
will always be returned for this language. The other languages will receive a
value of 0.0. If the statistics engine is additionally needed, the most likely
language will be returned with value 0.99 and the least likely language will
be returned with value 0.01. All other languages get values assigned between
0.01 and 0.99, denoting how less likely those languages are in comparison to
the most likely language.
value of 0.0.
There is also a method for returning the confidence value for one specific
language only:
Expand All @@ -352,10 +346,9 @@
The value that this method computes is a number between 0.0 and 1.0. If the
language is unambiguously identified by the rule engine, the value 1.0 will
always be returned. If the given language is not supported by this detector
instance, the value 0.0 will always be returned. Otherwise, a value between
0.01 and 0.99 will be returned.
instance, the value 0.0 will always be returned.
## 7.4 Eager loading versus lazy loading
### 7.4 Eager loading versus lazy loading
By default, *Lingua* uses lazy-loading to load only those language models on
demand which are considered relevant by the rule-based filter engine. For web
Expand All @@ -370,7 +363,7 @@
Multiple instances of `LanguageDetector` share the same language models in
memory which are accessed asynchronously by the instances.
## 7.5 Low accuracy mode versus high accuracy mode
### 7.5 Low accuracy mode versus high accuracy mode
*Lingua's* high detection accuracy comes at the cost of being noticeably slower
than other language detectors. The large language models also consume significant
Expand All @@ -397,7 +390,7 @@
the texts you want to classify you can almost always rule out certain languages as impossible
or unlikely to occur.
## 7.6 Detection of multiple languages in mixed-language texts
### 7.6 Detection of multiple languages in mixed-language texts
In contrast to most other language detectors, *Lingua* is able to detect multiple
languages in mixed-language texts. This feature can yield quite reasonable results but
Expand Down Expand Up @@ -428,7 +421,7 @@
is returned. Each entry in the list describes a contiguous single-language text section,
providing start and end indices of the respective substring.
## 7.7 Methods to build the LanguageDetector
### 7.7 Methods to build the LanguageDetector
There might be classification tasks where you know beforehand that your
language data is definitely not written in Latin, for instance. The detection
Expand Down
Loading

0 comments on commit e707057

Please sign in to comment.