Name	Name	Last commit message	Last commit date
Latest commit History 404 Commits
datasets/external	datasets/external
docs	docs
nautilus_nlp	nautilus_nlp
notebooks	notebooks
references	references
tests	tests
.gitignore	.gitignore
.travis.yml	.travis.yml
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
VERSION	VERSION
requirements.txt	requirements.txt
setup.py	setup.py

Nautilus_NLP

The Nautilus NLP library aimed to be a meta-library to be used to help you get started on handling your NLP use-case.

This library can help you with:

1. Cleaning text data
2. Normalizing your dataset
3. Training automatically multiclass, multilabel classifier
4. Help you discover topics and cluster your data

You can find a list of the available features in this article.

Feature Request

As an Artefact user, you might be working on a NLP use case, and wish to use Nautilus.

However, if you think Nautilus is lacking features that can be useful not only to your use case but also others, feel free to to fill up an issue with the label "Feature-request".

We will try to put it in the roadmap and implement it as soon as possible.

Installation

Beware, this package has been tested on Python 3.6 & 3.7, and will probably not be working under python 2.7 as Python2.7 EOL is scheduled for December 2019.

To install this library you should first clone the repository:

git clone git@github.com:artefactory/nautilus-nlp.git && cd nautilus_nlp/

If you don't use the docker container, we strongly advise you to do these steps in a virtual environnement

First you need to install the required files:

pip install -r requirements.txt

then you can install it via pip:

pip install -e .

Once it's done, please install the additional required libraries:

Installation of the additional required libraries

If you want to leverage all the features of nautilus-nlp, you need to install others required libraries (such as FastText).

Install spaCy language models (highly recommanded)

Installing additional spaCy models will give you the possibility to handle a lot of new language for text processing feature (such as lemmatization or tokenization).

To do so, run: (on your virtual environment if you are using one)

bash nautilus_nlp/scripts/download_spacy_models.sh

Install FastText

run:

bash nautilus_nlp/scripts/install_fasttext.sh

Install Lang Detect

run:

bash nautilus_nlp/scripts/download_ft_langdetect.sh

Install Mallet

Mallet is an implementation of the LDA algorithm (used for topic modeling).

Install Mallet file run:

bash nautilus_nlp/scripts/install_mallet.sh

Install Java JDK (required to implement mallet) run:

bash nautilus_nlp/scripts/install_java.sh

Handling installation errors

Problem when building FastText on MACOS:

If you see this errorwhen building Fasttext:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Then you should type this command in your terminal:

export MACOSX_DEPLOYMENT_TARGET=10.9

command 'gcc' failed with exit status 1

While runing pip install -r requirements.txt you might get the following error message:

command 'gcc' failed with exit status 1

To solve it, run the following command before installing requirements.txt:

conda install pyemd

Cannot uninstall 'package_name'

You might get the following error message while installing the library: Cannot uninstall 'PACKAGE_NAME'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

To fix this, just type pip install -e . --ignore-installed PACKAGE_NAME instead.

Installing nautilus-nlp on a linux-based VM

If you are installing nautilus on a linux-powered Virtual Machine (VM), you might want to install essentials first. If you don't, you might experience some issues to install Cython-based libraries, such as spaCy, becode the C compiler will be missing.

On a Ubuntu VM (tested on 16.04), you can run the following command before following the classic installation process above:

# install compilators and required softwares
sudo apt-get update && sudo apt-get install -y build-essential unzip git wget
# install conda
wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
bash Anaconda3-2019.03-Linux-x86_64.sh

Quick start

The notebook folder contains various notebook on how to use this library.

Here is a quick example:

>>> from nautilus_nlp.preprocessing.preprocess import preprocess_text
>>> from nautilus_nlp.preprocessing.tokenizer import tokenize
>>> from nautilus_nlp.preprocessing.lemmatization import lemmatize_english_tokens
[nltk_data] Downloading package wordnet to /Users/hugo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
>>> text = """Tropical Storm Nicole was a short-lived and unusually asymmetric tropical cyclone that caused extensive flooding in Jamaica during the 2010 Atlantic hurricane season.\nSource:https://en.wikipedia.org/wiki/Tropical_Storm_Nicole_(2010)"""
>>> clean_text = preprocess_text(text, lowercase=True,
...                                         no_punct=True,
...                                         no_numbers=True,
...                                         no_stopwords='en',
...                                         no_urls=True,
...                                         replace_with='')
>>> clean_text
'tropical storm nicole short lived unusually asymmetric tropical cyclone caused extensive flooding jamaica atlantic hurricane season source'
>>> tokenized_text = tokenize(clean_text)
>>> tokenized_text
['tropical', 'storm', 'nicole', 'short', 'lived', 'unusually', 'asymmetric', 'tropical', 'cyclone', 'caused', 'extensive', 'flooding', 'jamaica', 'atlantic', 'hurricane', 'season', 'source']
>>> lemma_text = lemmatize_english_tokens(tokenized_text)
>>> lemma_text
['tropical', 'storm', 'nicole', 'short', 'live', 'unusually', 'asymmetric', 'tropical', 'cyclone', 'cause', 'extensive', 'flooding', 'jamaica', 'atlantic', 'hurricane', 'season', 'source']

Make HTML documentation

In order to make the html Sphinx documentation, you need to run at the nautilus_nlp root path: sphinx-apidoc -f nautilus_nlp -o docs/ This will generate the .rst files. You can generate the doc with cd docs && make html

You can now open the file index.html located in the build folder.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data               <- Scripts & bits to download datasets to try nautilus
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
├── docker             <- Where to build a docker image using this lib
├── docs               <- Sphinx HTML documentation
│   ├── _build
│   │   └── html
│   ├── source
├── models
├── nautilus_nlp       <- Main Nautilus Package. This is where the code lives
│   ├── config
│   ├── data
│   ├── models
│   ├── preprocessing
│   ├── scripts
│   └── utils
├──notebooks           <- Various notebooks explaining how to use Nautilus_NLP library
├── tests <- Where the tests lives
│   └── testfolder_fileloader
├── wiki               <- Where the Markdown for the Wiki lives
├── setup.py           <- makes project pip installable (pip install -e .) so nautilus_nlp can be imported
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
                          generated with `pip freeze > requirements.txt`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nautilus_NLP

Feature Request

Installation

Installation of the additional required libraries

Install spaCy language models (highly recommanded)

Install FastText

Install Lang Detect

Install Mallet

Handling installation errors

Problem when building FastText on MACOS:

command 'gcc' failed with exit status 1

Cannot uninstall 'package_name'

Installing nautilus-nlp on a linux-based VM

Quick start

Make HTML documentation

Project Organization

About

Releases 7

Packages

Contributors 17

Languages

License

artefactory/NLPretext

Folders and files

Latest commit

History

Repository files navigation

Nautilus_NLP

Feature Request

Installation

Installation of the additional required libraries

Install spaCy language models (highly recommanded)

Install FastText

Install Lang Detect

Install Mallet

Handling installation errors

Problem when building FastText on MACOS:

command 'gcc' failed with exit status 1

Cannot uninstall 'package_name'

Installing nautilus-nlp on a linux-based VM

Quick start

Make HTML documentation

Project Organization

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 17

Languages

Packages