Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/credits #150

Merged
merged 589 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
589 commits
Select commit Hold shift + click to select a range
c143d61
changing path in test biterm
amaleelhamri Sep 24, 2020
aa89f2f
adding comment TODO
amaleelhamri Sep 24, 2020
eb7ab37
updating path for files topic modeling
amaleelhamri Sep 24, 2020
69fea26
removing test for language detection
amaleelhamri Sep 24, 2020
7e572b7
adding pylint to requirements
amaleelhamri Sep 24, 2020
5354067
fixing lint issues on keyword_extractor.py
amaleelhamri Sep 24, 2020
d43eef4
fixing lint issues on ngrams.py
amaleelhamri Sep 24, 2020
13415e6
fixing lint issues on text_summary.py
amaleelhamri Sep 24, 2020
8d90917
fix lint issues on nautilus_nlp/analysis/visualize.py
amaleelhamri Sep 24, 2020
dae686b
linting config.py
amaleelhamri Sep 24, 2020
c467246
not downloading fasttext in CI
amaleelhamri Sep 24, 2020
22374e8
downloading spacy from requirements
amaleelhamri Sep 24, 2020
5f3fea6
installing in CI with pip3
amaleelhamri Sep 24, 2020
66f2e66
adding external requirements to fix CI
amaleelhamri Sep 24, 2020
b98b2af
adding external requirements to fix CI
amaleelhamri Sep 24, 2020
b4b4c35
removing ipython to fix CI
amaleelhamri Sep 24, 2020
eda21ec
removing pip3
amaleelhamri Sep 24, 2020
5b9d72c
rollback to old versions packages
amaleelhamri Sep 24, 2020
32e2bed
rolling back to old requirements and CI
amaleelhamri Sep 24, 2020
b309e35
removing make_dataset.py file
amaleelhamri Oct 1, 2020
4eb3d37
linting data_augmentation.py
amaleelhamri Oct 1, 2020
100067b
adding pylintrc
amaleelhamri Oct 1, 2020
2696280
linting social_preprocess.py
amaleelhamri Oct 1, 2020
9b147a8
linting text_preprocess.py
amaleelhamri Oct 1, 2020
01d6cd9
linting stopwords.py
amaleelhamri Oct 1, 2020
f164f86
linting token_preprocess.py
amaleelhamri Oct 1, 2020
c11ff9a
fixing phone_number tests
amaleelhamri Oct 1, 2020
1db6e33
fixing pylint phone_number
amaleelhamri Oct 1, 2020
6ce83b4
linting tokenizer.py
amaleelhamri Oct 1, 2020
db3f176
removing compat.py
amaleelhamri Oct 1, 2020
3abc58e
linting constants.py
amaleelhamri Oct 1, 2020
3c139ed
removing unused utils emoji
amaleelhamri Oct 1, 2020
194738c
fix pylint file_loader
amaleelhamri Oct 1, 2020
5529c6e
fixing pylint on phone_number.py
amaleelhamri Oct 1, 2020
9fa3b4c
removing uncalled utils vector_similarity
amaleelhamri Oct 1, 2020
78b48a0
linting test_biterm.py
amaleelhamri Oct 1, 2020
6e9613e
fixing lint file_loader
amaleelhamri Oct 1, 2020
853f66d
linting test_document_loader.py
amaleelhamri Oct 1, 2020
3a64275
linting tests/test_fix_bad_encoding
amaleelhamri Oct 1, 2020
75914c4
linting tests/test_preprocessor.py
amaleelhamri Oct 1, 2020
c895d52
linting tests/test_topic_modeling_short_text.py
amaleelhamri Oct 1, 2020
6b9a5e2
linting biterm_model.py
rafaelleaygalenq Oct 5, 2020
a3ab9a9
linting lda.py
rafaelleaygalenq Oct 5, 2020
812b49e
fix ipython
rafaelleaygalenq Oct 5, 2020
aaa4065
fix ipython version
rafaelleaygalenq Oct 5, 2020
ad49727
linting nmf_model.py
rafaelleaygalenq Oct 5, 2020
ba6505e
update docstring
rafaelleaygalenq Oct 5, 2020
3a0afe2
linting seanmf_model.py and fix tests
rafaelleaygalenq Oct 5, 2020
3cd41be
linting topic_modeling_short_text.py
rafaelleaygalenq Oct 5, 2020
5357ac9
adding linting to CI
amaleelhamri Oct 5, 2020
babcb3c
adding pylint to requirements
amaleelhamri Oct 5, 2020
9157732
replacing x variable with counter
amaleelhamri Oct 7, 2020
d2762e7
replacing bad return to lign
amaleelhamri Oct 7, 2020
e52e490
renaming variable words to nb_words
amaleelhamri Oct 7, 2020
b7ce7f7
adding example of variable countrylist in docstring
amaleelhamri Oct 7, 2020
fe33739
typo counttry > country
amaleelhamri Oct 7, 2020
97266d7
putting anonymous phone number in tests
amaleelhamri Oct 7, 2020
6683a73
putting anonymous phone number in tests
amaleelhamri Oct 7, 2020
59145ab
replacing logging.warning with warnings.warn
amaleelhamri Oct 8, 2020
95c4111
Merge pull request #123 from artefactory/refacto
amaleelhamri Oct 8, 2020
7de3ecc
refacto data augmentation functions
rafaelleaygalenq Oct 14, 2020
e598d95
fix linting
rafaelleaygalenq Oct 14, 2020
0aa06a8
refacto data aug functions, remove utterance notion
rafaelleaygalenq Oct 14, 2020
9cadf3a
rename function and clarify custom errors message
rafaelleaygalenq Oct 15, 2020
6e2571a
feat: add unit test
hugovasselin Oct 16, 2020
90d87f1
feat: add unit test
hugovasselin Oct 16, 2020
35954e2
feat: add test for kw extractor
hugovasselin Oct 16, 2020
20bf151
refacto: add typing
hugovasselin Oct 16, 2020
5f8d8f0
feat: add typing and docstring
hugovasselin Oct 16, 2020
68a9824
feat: add docstring and typing
hugovasselin Oct 16, 2020
d15d380
feat: add typing and docstr
hugovasselin Oct 16, 2020
e866ca3
feat: harmonize docstring + tying
hugovasselin Oct 16, 2020
06db34f
fix: small typo in docstring
hugovasselin Oct 16, 2020
d56c939
feat: typing + docstring
hugovasselin Oct 16, 2020
b787860
feat: harmonize docstrings
hugovasselin Oct 16, 2020
770ba5b
feat: harmonize docstring
hugovasselin Oct 16, 2020
90e1d09
refacto: harmonize docstring + typing
hugovasselin Oct 16, 2020
96bafb9
refacto: add f strings
hugovasselin Oct 18, 2020
3f0e03c
refacto: add typing + small typos
hugovasselin Oct 18, 2020
27c5501
refacto: harmonize optional in docstring
hugovasselin Oct 18, 2020
3556da2
fix: typing error
hugovasselin Oct 18, 2020
bf94722
added first version of actions
Oct 28, 2020
70d131d
modified called events
Oct 28, 2020
a57bffa
yaml syntax fix
Oct 28, 2020
23cd333
changed wd path
Oct 28, 2020
bb9fe1e
yml syntax fix
Oct 28, 2020
3699a58
refacto following comments
rafaelleaygalenq Nov 5, 2020
923c44b
add actions checkout to see if it can find requirements that way
Nov 5, 2020
04fa911
remove text output in clean entities function
rafaelleaygalenq Nov 5, 2020
1f5a643
upgrade pip and specify python version
Nov 5, 2020
b5af78d
add check of entities duplicates
rafaelleaygalenq Nov 5, 2020
dcba544
remove unused import
rafaelleaygalenq Nov 5, 2020
01efce4
update docstring clean entities
rafaelleaygalenq Nov 5, 2020
9e43d70
removing classes
amaleelhamri Nov 5, 2020
ec38260
removing duplicated test
amaleelhamri Nov 5, 2020
d2ffa48
Merge pull request #126 from artefactory/ra-data-aug
rafaelleaygalenq Nov 5, 2020
3341997
adding preprocessor object
amaleelhamri Nov 5, 2020
fce1548
modified test to see if ci working
Nov 6, 2020
7883e0d
fixed tests so CI works
Nov 6, 2020
556b415
removed travis
Nov 6, 2020
e12217b
updated ci
Nov 6, 2020
b3a3c6f
separated spacy models and fasttext in ci
Nov 6, 2020
290602e
removed fast text for ci temporarely
Nov 6, 2020
642f1a9
install spacy models after requirements
Nov 6, 2020
4cbafcd
removed spacy install in ci
Nov 9, 2020
b87766d
refacto: add typo
hugovasselin Nov 13, 2020
faef8d1
Merge branch 'dev' into refacto/hv-harmo-docstring
hugovasselin Nov 13, 2020
34cae88
feat: add docstring
hugovasselin Nov 13, 2020
8ee5605
fix: change error type
hugovasselin Nov 13, 2020
7ab683a
fix: typo
hugovasselin Nov 13, 2020
5c51fd2
refacto: import functions rather than everything
hugovasselin Nov 13, 2020
a823dde
refacto: black this shit out
hugovasselin Nov 13, 2020
648560e
fix: add missing param
hugovasselin Nov 13, 2020
68f3ed4
fix: add python 3.8 support
hugovasselin Nov 13, 2020
7e008f1
test: requirements
hugovasselin Nov 13, 2020
592d034
text: attempt to fix requirements
hugovasselin Nov 13, 2020
81ac21b
fix: fix unit test
hugovasselin Nov 13, 2020
4f659d3
fix: linter errors
hugovasselin Nov 13, 2020
1e77533
fix: linting issues
hugovasselin Nov 13, 2020
dd14562
Merge pull request #128 from artefactory/refacto/hv-harmo-docstring
amaleelhamri Nov 19, 2020
bb44253
[Cleaning] Removing old notebooks
Nov 14, 2020
7eff5f2
[Cleaning] Removing topic modeling and visualization files
Nov 14, 2020
c08024c
[Cleaning] Cleaning travis yaml and upgrading requirements
Nov 14, 2020
3053e4d
[Cleaning] Removing lang_id files and updating readme
Nov 14, 2020
1bf04e3
[Cleaning] Removing install requirements in setuppy
Nov 14, 2020
ffec646
[Cleaning] Keeping keyword extractor and cleaning requirements
Nov 20, 2020
f310ee7
[Remove] Remove test for removed function
Nov 20, 2020
059e760
[Fix] Add newline for lint
Nov 20, 2020
f3b9c26
Merge pull request #129 from artefactory/cleaning/cleaning-requirements
Bruce-at-Artefact Nov 23, 2020
f7661b3
added test coverage
Nov 23, 2020
3a105ec
fixed pytests version
Nov 23, 2020
17cc129
merged dev
Nov 24, 2020
16c6865
fixed conflict
Nov 24, 2020
f19daa5
[Init] New readme draft
Nov 26, 2020
e55577f
[Fix] Cleaning draft
Nov 26, 2020
80f4bcc
add test for data augmentation
rafaelleaygalenq Nov 26, 2020
480af82
update requirements for data augmentation tests
rafaelleaygalenq Nov 26, 2020
2e9ad32
refacto test data augmentation
rafaelleaygalenq Dec 2, 2020
8050eb4
add docstring for process_entities_and_text
rafaelleaygalenq Dec 3, 2020
f76f0c3
Merge pull request #127 from artefactory/feature/ci_workflow
amaleelhamri Dec 3, 2020
fe4862f
Merge branch 'dev' into feature/new-readme
Dec 3, 2020
b83e99b
remove torch and transformers from requirements
rafaelleaygalenq Dec 3, 2020
88d9343
fix linting
rafaelleaygalenq Dec 3, 2020
fdbb56f
Merge pull request #131 from artefactory/ra-data-aug
rafaelleaygalenq Dec 3, 2020
b2c68d7
merging with feature/new-readme
amaleelhamri Dec 3, 2020
bed4879
Merge branch 'dev' into feature/preprocessing_pipelines
amaleelhamri Dec 3, 2020
bb608e1
[FIX] fix conflicts in text_preprocess
rafaelleaygalenq Dec 14, 2020
7c9aac2
[FIX] fix tests
rafaelleaygalenq Dec 14, 2020
ce78c42
fix default preprocessing functions
amaleelhamri Jan 4, 2021
18f693a
adding docstring to preprocessor
amaleelhamri Jan 4, 2021
0c78865
add type hinting
amaleelhamri Jan 4, 2021
df54ddd
adding test for text preprocessor
amaleelhamri Jan 4, 2021
c551adf
remove __init__.py and .gitkeep
amaleelhamri Jan 4, 2021
6f2d3a2
remove __init__.py and .gitkeep
amaleelhamri Jan 4, 2021
75fa21e
Revert "remove __init__.py and .gitkeep"
amaleelhamri Jan 4, 2021
049d807
Revert "remove __init__.py and .gitkeep"
amaleelhamri Jan 4, 2021
bbb5bf1
adding python and pip versions
amaleelhamri Jan 4, 2021
eac588a
upgrade to python 3.7
amaleelhamri Jan 4, 2021
104bc2a
debug CI
amaleelhamri Jan 4, 2021
f3986f1
debug CI
amaleelhamri Jan 4, 2021
b6fa1c7
debug CI
amaleelhamri Jan 4, 2021
2a4bc91
debug CI
amaleelhamri Jan 4, 2021
c2a1a33
update spacy version
amaleelhamri Jan 4, 2021
5b80c77
update spacy models
amaleelhamri Jan 4, 2021
bcb041e
fix spacy version
amaleelhamri Jan 4, 2021
92a2543
fix pylint
amaleelhamri Jan 4, 2021
706e9ba
adding 3.6 3.7 and 3.8 versions in CI
amaleelhamri Jan 5, 2021
73a97fd
removing arg input_str
amaleelhamri Jan 5, 2021
a62391a
merigng social_functions and text_functions to functions param
amaleelhamri Jan 5, 2021
b51ca14
fix pylint
amaleelhamri Jan 5, 2021
e167a46
Merge pull request #132 from artefactory/feature/preprocessing_pipelines
amaleelhamri Jan 6, 2021
7af4cf4
Implemented correction in Tokenizer function in nautilus_nlp/utils/to…
tkumar19088 Jan 13, 2021
5e9b2cf
Merge pull request #134 from artefactory/loadingtokeniserspacy
amaleelhamri Jan 18, 2021
084b1fd
Merge pull request #135 from artefactory/dev
hugovasselin Jan 18, 2021
6f53999
refacto preprocessor
amaleelhamri Jan 22, 2021
ef9da2c
adding Preprocessor in root __init__
amaleelhamri Jan 22, 2021
e997930
reorg repo v1
rafaelleaygalenq Jan 25, 2021
c45e5ac
update import init
rafaelleaygalenq Jan 25, 2021
1ebb201
fix pylint
rafaelleaygalenq Jan 25, 2021
c646072
update ci with new name and fix pylint
rafaelleaygalenq Jan 25, 2021
bdf4f98
rm unused config files
rafaelleaygalenq Jan 25, 2021
962d2cf
Merge pull request #136 from artefactory/feature/refacto_preprocessor
amaleelhamri Jan 26, 2021
de7c7dc
reorg repo v1
rafaelleaygalenq Jan 25, 2021
625b59f
update import init
rafaelleaygalenq Jan 25, 2021
f6cd208
fix pylint
rafaelleaygalenq Jan 25, 2021
b601730
update ci with new name and fix pylint
rafaelleaygalenq Jan 25, 2021
c403924
rm unused config files
rafaelleaygalenq Jan 25, 2021
0f7012d
rebase master and fix conflicts
rafaelleaygalenq Jan 26, 2021
47cccd1
rebase master and fix conflicts
rafaelleaygalenq Jan 26, 2021
c0ffe8f
rename main folder and update readme
rafaelleaygalenq Jan 26, 2021
ca1c35f
Merge pull request #137 from artefactory/ra-reorg
rafaelleaygalenq Jan 26, 2021
13cc879
adding remove stopwords function for text
rafaelleaygalenq Jan 29, 2021
7e264cb
Merge pull request #138 from artefactory/feature/stopwords
rafaelleaygalenq Feb 3, 2021
8ba75e7
adding init files in all modules
amaleelhamri Feb 11, 2021
0d16fcb
Merge pull request #139 from artefactory/hotfix/init
amaleelhamri Feb 11, 2021
5acb46c
moving all lib essential folders in nautilus_nlp/
amaleelhamri Feb 12, 2021
c6b6c2b
fix CI
amaleelhamri Feb 12, 2021
da4ed48
fix CI with python3.7
amaleelhamri Feb 12, 2021
b756b84
fix CI
amaleelhamri Feb 12, 2021
a93b9e7
fix pylint
amaleelhamri Feb 12, 2021
ac00831
fix pylint
amaleelhamri Feb 12, 2021
bb0ce3f
adding python 3.6
amaleelhamri Feb 12, 2021
03e240e
adding python 3.8
amaleelhamri Feb 12, 2021
70faafb
Merge pull request #140 from artefactory/hotfix/init
amaleelhamri Feb 12, 2021
2255353
update readme and fix pipeline with arguments
rafaelleaygalenq Feb 3, 2021
cb16728
fix layout
rafaelleaygalenq Feb 4, 2021
c1c8a69
remove temporary text
rafaelleaygalenq Feb 4, 2021
638fbb9
update lib name
rafaelleaygalenq Feb 11, 2021
f75fd2a
updates with new name and minor fixes for json config files
rafaelleaygalenq Feb 12, 2021
7d6b51d
update readme with slogan
rafaelleaygalenq Feb 12, 2021
8519f35
update readme layout
rafaelleaygalenq Feb 12, 2021
ec53038
update link readme
rafaelleaygalenq Feb 12, 2021
dceebb6
update structure
rafaelleaygalenq Feb 12, 2021
ee446e0
fix typo readme
rafaelleaygalenq Feb 12, 2021
e504f96
update project organization readme
rafaelleaygalenq Feb 12, 2021
9fd8d21
Merge pull request #141 from artefactory/feature/readme
rafaelleaygalenq Feb 12, 2021
ed3ee85
update readme
rafaelleaygalenq Feb 12, 2021
2ced84d
rename ci_actions.yml > ci.yml
amaleelhamri Feb 15, 2021
a1a7b4e
update setup.py
amaleelhamri Feb 15, 2021
bcb3e61
adding requirements_dev
amaleelhamri Feb 15, 2021
e072f01
adding CD to publish package to pypi
amaleelhamri Feb 15, 2021
f86f36a
adding spacy model download in CI
amaleelhamri Feb 15, 2021
040a62e
adding pypi publication only when master is merged
amaleelhamri Feb 15, 2021
04ae61e
renaming classic > basic
amaleelhamri Feb 15, 2021
ec1ad63
nlpretext/social/preprocess.py
amaleelhamri Feb 15, 2021
da35f13
removing CD
amaleelhamri Feb 16, 2021
0772036
update link readme
amaleelhamri Feb 16, 2021
b9a7006
update remove_stopwords token with arg lang instead of list
amaleelhamri Feb 16, 2021
adf5739
update version 1.0.0
amaleelhamri Feb 16, 2021
09db3b4
add mosestekonizer version
amaleelhamri Feb 16, 2021
633af61
typo README
amaleelhamri Feb 16, 2021
2a95b1d
update project documentation
amaleelhamri Feb 16, 2021
7112402
fix Nautilus > NLPretext
amaleelhamri Feb 16, 2021
8180ba1
Merge pull request #142 from artefactory/feature/cd
amaleelhamri Feb 16, 2021
6660000
fix conflicts
rafaelleaygalenq Feb 16, 2021
4fbb350
add init files
rafaelleaygalenq Feb 16, 2021
3cbf230
update readme classic changed to basic
rafaelleaygalenq Feb 16, 2021
d2f6ccc
Merge pull request #143 from artefactory/fix/init
rafaelleaygalenq Feb 16, 2021
b2d22ea
change Licence to Apache
amaleelhamri Feb 17, 2021
789438b
correct license
amaleelhamri Feb 18, 2021
2909044
correct license
amaleelhamri Feb 18, 2021
275bb7c
Merge branch 'master' into feature/license
amaleelhamri Feb 18, 2021
c0acd7b
Merge pull request #144 from artefactory/feature/license
amaleelhamri Feb 18, 2021
68075d9
fix conflicts
rafaelleaygalenq May 10, 2021
84dec97
add credits
rafaelleaygalenq May 10, 2021
aaea0e0
clean test
rafaelleaygalenq May 10, 2021
4a470f9
fix test replace url
rafaelleaygalenq May 10, 2021
7545f26
fix: englify tests
hugovasselin May 10, 2021
4ca5127
fix: emoji test
hugovasselin May 10, 2021
45f00dc
fix: update version number
hugovasselin May 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ print(example)

# Make HTML documentation


In order to make the html Sphinx documentation, you need to run at the nlpretext root path:
`sphinx-apidoc -f nlpretext -o docs/`
This will generate the .rst files.
Expand Down Expand Up @@ -184,3 +183,19 @@ You can now open the file index.html located in the build folder.
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
└── pylintrc <- The linting configuration file


# Credits

- [textacy](https://github.com/chartbeat-labs/textacy) for the following basic preprocessing functions:
- `fix_bad_unicode`
- `normalize_whitespace`
- `unpack_english_contractions`
- `replace_urls`
- `replace_emails`
- `replace_numbers`
- `replace_currency_symbols`
- `remove_punct`
- `remove_accents`
- `replace_phone_numbers` *(with some modifications of our own)*

2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.2
1.0.2
hugovasselin marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions nlpretext/_config/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
# limitations under the License
"""
Collection of regular expressions and other (small, generally useful) constants.
Credits to textacy for some of them: https://github.com/chartbeat-labs/textacy
"""
from __future__ import unicode_literals

Expand Down
50 changes: 50 additions & 0 deletions nlpretext/basic/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@

def normalize_whitespace(text) -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Given ``text`` str, replace one or more spacings with a single space, and
one or more linebreaks with a single newline. Also strip leading/trailing
whitespace.
Expand Down Expand Up @@ -106,6 +111,11 @@ def remove_eol_characters(text) -> str:

def fix_bad_unicode(text, normalization: str = "NFC") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Fix unicode text that's "broken" using `ftfy
<http://ftfy.readthedocs.org/>`_;
this includes mojibake, HTML entities and other code cruft,
Expand Down Expand Up @@ -133,6 +143,11 @@ def fix_bad_unicode(text, normalization: str = "NFC") -> str:

def unpack_english_contractions(text) -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace *English* contractions in ``text`` str with their unshortened
forms.
N.B. The "'d" and "'s" forms are ambiguous (had/would, is/has/possessive),
Expand Down Expand Up @@ -173,6 +188,11 @@ def unpack_english_contractions(text) -> str:

def replace_urls(text, replace_with: str = "*URL*") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace all URLs in ``text`` str with ``replace_with`` str.

Parameters
Expand All @@ -193,6 +213,11 @@ def replace_urls(text, replace_with: str = "*URL*") -> str:

def replace_emails(text, replace_with="*EMAIL*") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace all emails in ``text`` str with ``replace_with`` str

Parameters
Expand All @@ -213,6 +238,11 @@ def replace_phone_numbers(text, country_to_detect: list,
replace_with: str = "*PHONE*",
method: str = "regex") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Inspired code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace all phone numbers in ``text`` str with ``replace_with`` str

Parameters
Expand Down Expand Up @@ -249,6 +279,11 @@ def replace_phone_numbers(text, country_to_detect: list,

def replace_numbers(text, replace_with="*NUMBER*") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace all numbers in ``text`` str with ``replace_with`` str.

Parameters
Expand All @@ -267,6 +302,11 @@ def replace_numbers(text, replace_with="*NUMBER*") -> str:

def replace_currency_symbols(text, replace_with=None) -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Replace all currency symbols in ``text`` str with string specified by
``replace_with`` str.

Expand Down Expand Up @@ -294,6 +334,11 @@ def replace_currency_symbols(text, replace_with=None) -> str:

def remove_punct(text, marks=None) -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Remove punctuation from ``text`` by replacing all instances of ``marks``
with whitespace.

Expand Down Expand Up @@ -327,6 +372,11 @@ def remove_punct(text, marks=None) -> str:

def remove_accents(text, method: str = "unicode") -> str:
"""
----
Copyright 2016 Chartbeat, Inc.
Code from textacy: https://github.com/chartbeat-labs/textacy
----

Remove accents from any accented unicode characters in ``text`` str,
either by transforming them into ascii equivalents or removing them
entirely.
Expand Down
69 changes: 32 additions & 37 deletions tests/test_preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ def test_get_stopwords():
@pytest.mark.parametrize(
"input_tokens, lang, expected_output",
[
(['I', 'like', 'when', 'you', 'move', 'your', 'body', '!'], "en", ['I', 'move', 'body', '!'])
(['I', 'like', 'this', 'song', 'very', 'much', '!'], "en", ['I', 'song', '!'])
],
)
def test_remove_stopwords_tokens(input_tokens, lang, expected_output):
Expand All @@ -201,7 +201,7 @@ def test_remove_stopwords_tokens(input_tokens, lang, expected_output):
@pytest.mark.parametrize(
"input_text, lang, expected_output",
[
('I like when you move your body !', 'en', 'I move body !'),
('I like this song very much !', 'en', 'I song !'),
('Can I get a beer?', 'en', 'Can I beer ?'),
('Je vous recommande ce film !', 'fr', 'Je recommande film !'),
('je vous recommande ce film !', 'fr', 'recommande film !'),
Expand All @@ -216,7 +216,7 @@ def test_remove_stopwords_text(input_text, lang, expected_output):
@pytest.mark.parametrize(
"input_text, lang, custom_stopwords, expected_output",
[
('I like when you move your body !', 'en', ['body'], 'I move !'),
('I like this song very much !', 'en', ['song'], 'I !'),
('Je vous recommande ce film la scène de fin est géniale !', 'fr',
['film', 'scène'], 'Je recommande fin géniale !'),
],
Expand Down Expand Up @@ -249,7 +249,6 @@ def test_remove_accents():
('proportienelle', 'proportienelle'),
('Pour plus de démocratie participative', 'Pour plus de démocratie participative'),
('Transparence de la vie public', 'Transparence de la vie public'),
('18 mois de trop....ca suffit macron', '18 mois de trop....ca suffit macron'),
('Egalité devant les infractions routières', 'Egalité devant les infractions routières')],)
def test_fix_bad_unicode(input_str, expected_str):
result = fix_bad_unicode(input_str)
Expand Down Expand Up @@ -287,14 +286,13 @@ def test_unpack_english_contractions(input_str, expected_str):
@pytest.mark.parametrize(
"input_str, expected_str",
[(
"Wan't to contribute to Nautilus? read https://github.com/artefactory/nautilus-nlp/blob/docs/CONTRIBUTING.md"\
"Wan't to contribute to NLPretext? read https://github.com/artefactory/NLPretext/blob/master/CONTRIBUTING.md"\
" first",
"Wan't to contribute to Nautilus? read *URL* first"),
("The ip address of my VM is http://34.76.182.5:8888", "The ip address of my VM is *URL*"),
"Wan't to contribute to NLPretext? read *URL* first"),
("If you go to http://internet.org, you will find a website hosted by FB.",
"If you go to *URL*, you will find a website hosted by FB."),
("Ishttps://waaaou.com/ available?", 'Is*URL* available?'),
("mailto:hugo.vasselin@artefact.com", '*URL*')])
("Ishttps://internet.org/ available?", 'Is*URL* available?'),
("mailto:john.doe@artefact.com", '*URL*')])
def test_replace_urls(input_str, expected_str):
result = replace_urls(input_str)
np.testing.assert_equal(result, expected_str)
Expand All @@ -303,10 +301,9 @@ def test_replace_urls(input_str, expected_str):
@pytest.mark.parametrize(
"input_str, expected_str",
[
("my email:hugo.vasselin@artefact.com", "my email:*EMAIL*"),
("my email:john.doe@artefact.com", "my email:*EMAIL*"),
("[email protected] is a temporary email", "*EMAIL* is a temporary email"),
("our emails used to be [email protected]", "our emails used to be *EMAIL*"),
("[email protected],C ton email bb?", '*EMAIL*,C ton email bb?')
("our emails used to be [email protected]", "our emails used to be *EMAIL*")
]
)
def test_replace_emails(input_str, expected_str):
Expand All @@ -317,17 +314,17 @@ def test_replace_emails(input_str, expected_str):
@pytest.mark.parametrize(
"input_str, expected_str",
[
("mon 06 bb: 0625093267", "mon 06 bb: *PHONE*"),
("mon 06 bb: 06.25.09.32.67", "mon 06 bb: *PHONE*"),
("call me at +33625093267", "call me at *PHONE*"),
("call me at +33 6 25 09 32 67", "call me at *PHONE*"),
("call me at +33 625 093 267", "call me at *PHONE*"),
("if this unit test doesn't work, call 3615 and says 'ROBIN'",
"if this unit test doesn't work, call *PHONE* and says 'ROBIN'"),
('(541) 754-3010 is a US. Phone', '*PHONE* is a US. Phone'),
('+1-541-754-3010 is an international Phone', '*PHONE* is an international Phone'),
('+1-541-754-3010 Dialed in the US', '*PHONE* Dialed in the US'),
('+1-541-754-3010 Dialed from Germany', '*PHONE* Dialed from Germany')
("mon 06: 0601020304", "mon 06: *PHONE*"),
("mon 06: 06.01.02.03.04", "mon 06: *PHONE*"),
("call me at +33601020304", "call me at *PHONE*"),
("call me at +33 6 01 02 03 04", "call me at *PHONE*"),
("call me at +33 601 020 304", "call me at *PHONE*"),
("if this unit test doesn't work, call 3615 and says 'HELP'",
"if this unit test doesn't work, call *PHONE* and says 'HELP'"),
('(541) 754-0000 is a US. Phone', '*PHONE* is a US. Phone'),
('+1-541-754-0000 is an international Phone', '*PHONE* is an international Phone'),
('+1-541-754-0000 Dialed in the US', '*PHONE* Dialed in the US'),
('+1-541-754-0000 Dialed from Germany', '*PHONE* Dialed from Germany')
]
)
def test_replace_phone_numbers(input_str, expected_str):
Expand All @@ -343,9 +340,8 @@ def test_replace_phone_numbers(input_str, expected_str):
"input_str, expected_str",
[
("123, 3 petits chats", "*NUMBER*, *NUMBER* petits chats"),
("l0ve 2 twa <3", "l0ve *NUMBER* twa <*NUMBER*"),
("Give me 45bucks!", "Give me *NUMBER*bucks!"),
("call me at +33625093267", "call me at *NUMBER*")
("call me at +33601020304", "call me at *NUMBER*")
]
)
def test_replace_numbers(input_str, expected_str):
Expand Down Expand Up @@ -384,9 +380,9 @@ def test_replace_currency_symbols(input_str, param, expected_str):
("Seriously.,.", '.,;', "Seriously "),
("Seriously...", '.,;', "Seriously "),
("Seriously.!.", '.,;', "Seriously ! "),
("hugo.vasselin@artefact.com", '.,;', "hugo vasselin@artefact com"),
("hugo.vasselin@artefact.com", None, "hugo vasselin artefact com"),
("hugo-vasselin@artefact.com", None, "hugo vasselin artefact com")
("john.doe@artefact.com", '.,;', "john doe@artefact com"),
("john.doe@artefact.com", None, "john doe artefact com"),
("john-doe@artefact.com", None, "john doe artefact com")
]
)
def test_remove_punct(input_str, param, expected_str):
Expand All @@ -397,27 +393,26 @@ def test_remove_punct(input_str, param, expected_str):
@pytest.mark.parametrize(
"input_str, expected_str",
[
("👉👌", ""),
("👌", ""),
("🎅🏿⌚", ""),
("🥖✊💦", ""),
("🥖🍷🇫🇷", ""),
("✊", ""),
("J'espère que les 🚓 vont pas lire ce test",
"J'espère que les vont pas lire ce test"),
("J'espère que les vont pas lire ce test🚓",
"J'espère que les vont pas lire ce test")
("Save 🐼 and 🐟",
"Save and "),
]
)
def test_remove_emoji(input_str, expected_str):
result = remove_emoji(input_str)
np.testing.assert_equal(result, expected_str)
assert len(result) == len(expected_str)
assert result == expected_str


@pytest.mark.parametrize(
"input_str, expected_str",
[
("👉👌", ":backhand_index_pointing_right::OK_hand:"),
("⚽️👌", ":soccer_ball::OK_hand:"),
("🎅🏿⌚", ":Santa_Claus_dark_skin_tone::watch:"),
("🥖✊💦", ":baguette_bread::raised_fist::sweat_droplets:"),
("🥖🍷🇫🇷", ":baguette_bread::wine_glass::France:"),
("✊", ":raised_fist:")
]
)
Expand Down