Skip to content

Commit

Permalink
Merge pull request #83 from allo-media/master
Browse files Browse the repository at this point in the history
Version 2.5.0
  • Loading branch information
rtxm authored Dec 9, 2022
2 parents 47dda52 + 636301c commit a24659d
Show file tree
Hide file tree
Showing 26 changed files with 2,527 additions and 242 deletions.
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,10 @@ jobs:
name: build doc
command: |
. venv/bin/activate
cd doc && make html
cd docs && make html
- store_artifacts:
path: doc/_build/html/
path: docs/_build/html/
destination: html-doc

deploy:
Expand Down
166 changes: 149 additions & 17 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ text2num

``text2num`` is a python package that provides functions and parser classes for:

- parsing numbers expressed as words in French, English, Spanish and Portuguese and convert them to integer values;
- detect ordinal, cardinal and decimal numbers in a stream of French, English, Spanish and Portuguese words and get their decimal digit representations. Spanish does not support ordinal numbers yet.
- Parsing of numbers expressed as words in French, English, Spanish, Portuguese, German and Catalan and convert them to integer values.
- Detection of ordinal, cardinal and decimal numbers in a stream of French, English, Spanish and Portuguese words and get their decimal digit representations. NOTE: Spanish does not support ordinal numbers yet.
- Detection of ordinal, cardinal and decimal numbers in a German text (BETA). NOTE: No support for 'relaxed=False' yet (behaves like 'True' by default).

Compatibility
-------------
Expand Down Expand Up @@ -73,6 +74,26 @@ English examples:
>>> text2num("eighty-one", "en")
81
Russian examples:
.. code-block:: python
>>> from text_to_num import text2num
>>> text2num("пятьдесят один миллион пятьсот семьдесят восемь тысяч триста два", "ru")
51578302
>>> text2num("миллиард миллион тысяча один", "ru")
1001001001
>>> text2num("один миллиард один миллион одна тысяча один", "ru")
1001001001
>>> text2num("восемьдесят один", "ru")
81
Spanish examples:
.. code-block:: python
Expand Down Expand Up @@ -102,6 +123,41 @@ Portuguese examples:
>>> text2num("vinte e quatro milhões duzentos mil quarenta e sete", "pt")
24200047
German examples:
.. code-block:: python
>>> from text_to_num import text2num
>>> text2num("einundfünfzigmillionenfünfhundertachtundsiebzigtausenddreihundertzwei", "de")
51578302
>>> text2num("ein und achtzig", "de")
81
Catalan examples:
.. code-block:: python
>>> from text_to_num import text2num
>>> text2num('noranta-cinc', "ca")
95
>>> text2num('huitanta-u', "ca")
81
>>> text2num('mil nou-cents noranta-nou', "ca")
1999
>>> text2num("cinquanta-un milions cinc-cents setanta-vuit mil tres-cents dos", "ca")
51578302
>>> text2num('mil mil dos-cents', "ca")
ValueError: invalid literal for text2num: 'mil mil dos-cents'
Find and transcribe
~~~~~~~~~~~~~~~~~~~
Expand All @@ -113,23 +169,27 @@ French:
>>> from text_to_num import alpha2digit
>>> sentence = (
... "Huit cent quarante-deux pommes, vingt-cinq chiens, mille trois chevaux, "
... "douze mille six cent quatre-vingt-dix-huit clous.\n"
... "Quatre-vingt-quinze vaut nonante-cinq. On tolère l'absence de tirets avant les unités : "
... "soixante seize vaut septante six.\n"
... "Nombres en série : douze quinze zéro zéro quatre vingt cinquante-deux cent trois cinquante deux "
... "trente et un.\n"
... "Ordinaux: cinquième troisième vingt et unième centième mille deux cent trentième.\n"
... "Décimaux: douze virgule quatre-vingt dix-neuf, cent vingt virgule zéro cinq ; "
... "mais soixante zéro deux."
... )
>>> print(alpha2digit(sentence))
... "Huit cent quarante-deux pommes, vingt-cinq chiens, mille trois chevaux, "
... "douze mille six cent quatre-vingt-dix-huit clous.\n"
... "Quatre-vingt-quinze vaut nonante-cinq. On tolère l'absence de tirets avant les unités : "
... "soixante seize vaut septante six.\n"
... "Nombres en série : douze quinze zéro zéro quatre vingt cinquante-deux cent trois cinquante deux "
... "trente et un.\n"
... "Ordinaux: cinquième troisième vingt et unième centième mille deux cent trentième.\n"
... "Décimaux: douze virgule quatre-vingt dix-neuf, cent vingt virgule zéro cinq ; "
... "mais soixante zéro deux."
... )
>>> print(alpha2digit(sentence, "fr", ordinal_threshold=0))
842 pommes, 25 chiens, 1003 chevaux, 12698 clous.
95 vaut 95. On tolère l'absence de tirets avant les unités : 76 vaut 76.
Nombres en série : 12 15 004 20 52 103 52 31.
Ordinaux: 5ème 3ème 21ème 100ème 1230ème.
Décimaux: 12,99, 120,05 ; mais 60 02.
>>> sentence = "Cinquième premier second troisième vingt et unième centième mille deux cent trentième."
>>> print(alpha2digit(sentence, "fr", ordinal_threshold=3))
5ème premier second troisième 21ème 100ème 1230ème.
English:
Expand All @@ -142,7 +202,23 @@ English:
'On May 23rd, I bought 25 cows, 12 chickens and 125.40 kg of potatoes.'
Spanish (ordinals not supported):
Russian:
.. code-block:: python
>>> from text_to_num import alpha2digit
>>> # дробная часть не обрабатывает уточнения вроде "пять десятых", "двенадцать сотых", "сколько-то стотысячных" и т.п., поэтому их лучше опускать
>>> text = "Двадцать пять коров, двенадцать сотен цыплят и сто двадцать пять точка сорок кг картофеля."
>>> alpha2digit(text, "ru")
'25 коров, 1200 цыплят и 125.40 кг картофеля.'
>>> text = "каждый пятый на первый второй расчитайсь!"
>>> alpha2digit(text, 'ru', ordinal_threshold=0)
'каждый 5ый на 1ый 2ой расчитайсь!'
Spanish (ordinals not supported yet):
.. code-block:: python
Expand All @@ -169,12 +245,68 @@ Portuguese:
>>> text = "Temos mais vinte graus dentro e menos quinze fora."
>>> alpha2digit(text, "pt")
'Temos +20 graus dentro e -15 fora.'
'Temos +20 graus dentro e -15 fora.'
>>> text = "Ordinais: quinto, terceiro, vigésimo, vigésimo primeiro, centésimo quarto"
>>> alpha2digit(text, "pt")
'Ordinais: 5º, terceiro, 20ª, 21º, 104º'
>>> alpha2digit(text, "pt")
'Ordinais: 5º, terceiro, 20ª, 21º, 104º'
German (BETA, Note: 'relaxed' parameter is not supported yet and 'True' by default):
.. code-block:: python
>>> from text_to_num import alpha2digit
>>> text = "Ich habe fünfundzwanzig Kühe, zwölf Hühner und einhundertfünfundzwanzig kg Kartoffeln gekauft."
>>> alpha2digit(text, "de")
'Ich habe 25 Kühe, 12 Hühner und 125 kg Kartoffeln gekauft.'
>>> text = "Die Temperatur beträgt minus fünfzehn Grad."
>>> alpha2digit(text, "de")
'Die Temperatur beträgt -15 Grad.'
>>> text = "Die Telefonnummer lautet plus dreiunddreißig neun sechzig null sechs zwölf einundzwanzig."
>>> alpha2digit(text, "de")
'Die Telefonnummer lautet +33 9 60 0 6 12 21.'
>>> text = "Der zweiundzwanzigste Januar zweitausendzweiundzwanzig."
>>> alpha2digit(text, "de")
'22. Januar 2022'
>>> text = "Es ist ein Buch mit dreitausend Seiten aber nicht das erste."
>>> alpha2digit(text, "de", ordinal_threshold=0)
'Es ist ein Buch mit 3000 Seiten aber nicht das 1..'
>>> text = "Pi ist drei Komma eins vier und so weiter, aber nicht drei Komma vierzehn :-p"
>>> alpha2digit(text, "de", ordinal_threshold=0)
'Pi ist 3,14 und so weiter, aber nicht 3 Komma 14 :-p'
Catalan:
.. code-block:: python
>>> from text_to_num import alpha2digit
>>> text = ("Huit-centes quaranta-dos pomes, vint-i-cinc gossos, mil tres cavalls, dotze mil sis-cents noranta-huit claus.\n Vuitanta-u és igual a huitanta-u.\n Nombres en sèrie: dotze quinze zero zero quatre vint cinquanta-dos cent tres cinquanta-dos trenta-u.\n Ordinals: cinquè tercera vint-i-uena centè mil dos-cents trentena.\n Decimals: dotze coma noranta-nou, cent vint coma zero cinc; però seixanta zero dos.")
>>> print(alpha2digit(text, "ca", ordinal_threshold=0))
842 pomes, 25 gossos, 1003 cavalls, 12698 claus.
81 és igual a 81.
Nombres en sèrie: 12 15 004 20 52 103 52 31.
Ordinals: 5è 3a 21a 100è 1230a.
Decimals: 12,99, 120,05; però 60 02.
>>> text = "Cinqué primera segona tercer vint-i-ué centena mil dos-cents trenté."
>>> print(alpha2digit(text, "ca", ordinal_threshold=3))
5é primera segona tercer 21é 100a 1230é.
>>> text = "Compràrem vint-i-cinc vaques, dotze gallines i cent vint-i-cinc coma quaranta kg de creïlles."
>>> alpha2digit(text, "ca")
'Compràrem 25 vaques, 12 gallines i 125,40 kg de creïlles.'
>>> text = "Fa més vint graus dins i menys quinze fora."
>>> alpha2digit(text, "ca")
'Fa +20 graus dins i -15 fora.'
Read the complete documentation on `ReadTheDocs <http://text2num.readthedocs.io/>`_.
Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion doc/conf.py → docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = "en"

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down
16 changes: 13 additions & 3 deletions doc/contribute.rst → docs/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,14 @@ Install from sources
First, create and activate a virtual environment with the tool of you preference.

Then clone https://github.com/allo-media/text2num in your workspace.
If you are going to submit some patches, you should fork the project on Github and clone
If you are going to submit some patches, you should fork the project on GitHub and clone
your own fork in your workspace.

Finally, install the sources in-place::

python setup.py develop

You do that once. Then, any change you make to the code is immediatly visible when you import the modules.
You do that once. Then, any change you make to the code is immediately visible when you import the modules.

Run the tests
-------------
Expand All @@ -55,6 +55,16 @@ We also use mypy::
Submit changes
--------------

If you wish to submit changes, fork the projec on github, and clone your own fork locally.
If you wish to contribute code or documentation to the project, you should first open an issue
on https://github.com/allo-media/text2num/issues to describe what you intend to do, and
why:

* if it's a bug fix, link to the related issues or describe precisely, with examples, what the faulty behavior is;
* if it's a new feature, describe the use case, with examples, and why it matters;
* if it's new or updated documentation, describe precisely which parts you are going to edit, to avoid edition conflicts;
* if it's support for a new language, announce it clearly in order to avoid duplicate effort and to get help from other people interested in that language.

Once you get positive feedback on your issue, you can fork the project on GitHub and start working on the code.

All PR should be made from a dedicated branch, not from *master*, please.
Please check your files are in proper Unicode encoded as UTF-8, and that the line endings follow the Unix convention (LF).
2 changes: 1 addition & 1 deletion doc/index.rst → docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Welcome to text2num's documentation!
:caption: Contents:

quickstart
contribute
contributing
api
license

Expand Down
File renamed without changes.
File renamed without changes.
11 changes: 7 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from setuptools import setup, find_packages


VERSION = "2.4.0"
VERSION = "2.5.0"


def readme():
Expand All @@ -12,7 +12,7 @@ def readme():
setup(
name="text2num",
version=VERSION,
description="Parse and convert numbers written in French, Spanish, English or Portuguese into their digit representation.",
description="Parse and convert numbers written in French, Spanish, English, Portuguese, German, Catalan or Russion into their digit representation.",
long_description=readme(),
classifiers=[
"Development Status :: 5 - Production/Stable",
Expand All @@ -23,9 +23,12 @@ def readme():
"Natural Language :: French",
"Natural Language :: English",
"Natural Language :: Spanish",
"Natural Language :: Portuguese"
"Natural Language :: Portuguese",
"Natural Language :: German",
"Natural Language :: Catalan",
"Natural Language :: Russian"
],
keywords="French, Spanish, English and Portuguese NLP words-to-numbers",
keywords="French Spanish English Portuguese German Catalan Russion NLP words-to-numbers",
url="https://github.com/allo-media/text2num",
author="Allo-Media",
author_email="[email protected]",
Expand Down
Loading

0 comments on commit a24659d

Please sign in to comment.