langid-benchmark

A benchmark of off-the-shelf models that detects language in text - Contributions welcome

Goal

To measure the speed, accuracy and memory usage of language detection algorithms for online usecases. Batch usecases are out of scope.

Dataset options

Kaggle language detection dataset
- Languages supported {'Chinese', 'Romanian', 'Persian', 'Korean', 'Pushto', 'Thai', 'Japanese', 'Indonesian', 'Portugese', 'Urdu', 'Swedish', 'Turkish', 'Latin', 'Hindi', 'Arabic', 'Spanish', 'English', 'Dutch', 'Estonian', 'Tamil', 'French', 'Russian'}
- 22K records
- There is a known issue 17 of the GT labels are wrong, but it is negligible.

Dataset EDA

TBD

How contribute?

Add a new class for each algorithm implementation, use benchmark_langid.py as reference.
Reuse the Language Dictionary
Follow the csv file formats for results so it will be easy to collate later.
Use psutil resident set size (RSS) for memory usage during load + dummy predict (see usage examples in any of the algorithms)
Please submit a PR so we can follow the review process.
Collaborators can serve as peer reviewers.

Results

Run on a fresh GCP e2-medium (2 vCPU; 4GB Memory)
Result Table:

algorithm	mean	max	min	median	mem	accuracy
Langid	0.0004	0.0687	0.0001	0.0003	34.43 mb	0.9543
Fasttext_ftz	0.0001	0.0013	0.0000	0.0001	0.81 mb	0.9673
Fasttext_bin	0.0001	0.0004	0.0000	0.0001	130.84 mb	0.9751
CLD3	0.0003	0.0024	0.0000	0.0002	TBD	0.9557
CLD2	0.0000	0.0004	0.0000	0.0000	TBD	0.9308

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_cld2.py		benchmark_cld2.py
benchmark_cld3.py		benchmark_cld3.py
benchmark_fasttext.py		benchmark_fasttext.py
benchmark_langid.py		benchmark_langid.py
language_dictionary.py		language_dictionary.py
object_size.py		object_size.py
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langid-benchmark

Goal

Dataset options

Dataset EDA

How contribute?

Results

About

Releases

Packages

Contributors 3

Languages

License

PrithivirajDamodaran/langid-benchmark

Folders and files

Latest commit

History

Repository files navigation

langid-benchmark

Goal

Dataset options

Dataset EDA

How contribute?

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages