Installation

laser-keep-alive is a project aimed at providing a stable run time environment for the open-source Facebook AI Research (FAIR) project, Language-Agnostic SEntence Representations (LASER).

Installation

Currently installation can only be done using the source code.

git clone https://github.com/mingruimingrui/laser-keep-alive.git
cd laser-keep-alive
python setup.py install

To ensure hardware compatibility, an explicit installation of pytorch>=1.0 might be necessary.

Basic Usage

Script Example

To use this package in your python script, the easiest way is to import the laser.SentenceEncoder class.

from laser import SentenceEncoder

# Loading the model
sent_encoder = SentenceEncoder(
    lang='en',
    model_path=path_to_model_file,
    bpe_codes=path_to_bpe_codes_file,
)

# Encode texts
# Given a List[str]
embeddings = sent_encoder.encode_sentences(list_of_texts)

# Where embeddings is a 2D np.ndarray
# of shape [num_texts, embedding_size]

Commandline Tool

laser-keep-alive can also be ran directly from the commandline.

$ python -m laser
usage: python -m laser [-h] {encode,filter} ...

Language-Agnostic SEntence Representations

positional arguments:
  {encode,filter}
    encode         Encode a text file line by line
    filter         Filter a parallel corpus based on similarity

optional arguments:
  -h, --help       show this help message and exit

At the moment, the following commandline routines are provided.

`encode`

Encodes a text file line by line into sentence embeddings. Output formats are .npy and .csv. If you are using the pretrained-model, your embedding output will have dimension size of 1024. In the case of .npy output format, this corresponds to byte sizes of 4096 for np.float32 and 2048 for np.float16. (Don't worry if you don't get that last sentence)

`filter`

Filters a parallel corpus line by line. Keeps only sentences which has euclidean distance below a threshold (default: 1.04). To apply a stricter filter, use a smaller threshold.

Downloading Pretrained Model

Pretrained models are necessary since this repository does not provide training code.

Please reference this script to download pretrained models.

Credits

Full credit goes to Holger Schwenk, the author of the LASER toolkit as well as FAIR. For more information regarding FAIR and LASER, please visit their webpages.

FAIR Website: https://ai.facebook.com/
FAIR Github: https://github.com/facebookresearch
LASER Github: https://github.com/facebookresearch/LASER/

If you like this project, please visit the LASER project page and give it a star ⭐.

License

laser-keep-alive is MIT-licensed and LASER is BSD-licensed. If you wish to use laser-keep-alive please remember to include the copyright notice.

Citation

Please cite Holger Schwenk and Matthijs Douze (also creator of FAISS).

@inproceedings{Schwenk2017LearningJM,
  title={Learning Joint Multilingual Sentence Representations with Neural Machine Translation},
  author={Holger Schwenk and Matthijs Douze},
  booktitle={Rep4NLP@ACL},
  year={2017},
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
laser		laser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_models.sh		download_models.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Basic Usage

Script Example

Commandline Tool

`encode`

`filter`

Downloading Pretrained Model

Credits

License

Citation

About

Releases 1

Packages

Languages

License

mingruimingrui/laser-keep-alive

Folders and files

Latest commit

History

Repository files navigation

Installation

Basic Usage

Script Example

Commandline Tool

encode

filter

Downloading Pretrained Model

Credits

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`encode`

`filter`

Packages