Zemberek Python Examples

Zemberek Turkish NLP examples written in Python using the JPype package.

Zemberek is a Java-based natural language processing (NLP) tool created for the Turkish language. This repository contains the Python implementations of the official Zemberek examples for learning purposes.

Folder	Description
classification	fastText examples
core	histogram
morphology	stemming, lemmatization, diacritics analysis, POS tag analysis, morphological analysis, word generation, sentence disambiguation, informal word analysis, adding dictionary items
named-entitiy-recognition	on hold
normalization	document correction, noisy text normalization, spell checking
tokenization	sentence boundary detection, turkish tokenization

Requirements

Python 3.6+

Getting Started

Clone this library and cd into it.

$ git clone https://github.com/ozturkberkay/Zemberek-Python-Examples.git
$ cd Zemberek-Python-Examples

Install the required packages. Using virtualenv is highly encouraged!

$ python -m pip install --upgrade pip virtualenv
$ python -m virtualenv .env
$ # Windows: .env\Scripts\activate
$ source .env/bin/activate
$ python -m pip install -r requirements.txt

Download the required Zemberek files:

$ python -m downloader

Optionally, you can manually download all the data and version 0.17.1 of Zemberek distribution from the official Zemberek Drive folder and put the files in the corresponding folders:

 .
 +-- bin
 |   +-- zemberek-full.jar
 +-- data
 |   +-- classification
 |       +-- news-title-category-set
 |       +-- news-title-category-set.lemmas
 |       +-- news-title-category-set.tokenized
 |   +-- dictionaries
 |   +-- lm
 |       +-- lm.2gram.slm
 |   +-- ner
 |   +-- normalization
 |       +-- ascii-map
 |       +-- lookup-from-graph
 |       +-- split

Usage

Run python -m main category.example args.

$ python -m main classification.simple_classification "Fenerbahçe bu maçı galibiyet ile sonlandırdı."
...

    News classification example. Trains a new model if there is no model
    available.

    Args:
        sentence (str): Sentence to classify.
    
Sentence: Fenerbahçe bu maçı galibiyet ile sonlandırdı.

Item 1: __label__spor 
Score 1: -0.009194993413984776

Item 2: __label__magazin 
Score 2: -6.12613582611084

Item 3: __label__kültür_sanat 
Score 3: -6.226541996002197

Known Bugs

During the model training, fastText will print errors. It still works, just ignore them.

Changelog

2020-12-05
- Automatic downloader for Zemberek files.
- Simple CLI entry-point to run the examples with custom data.
- JPype1 v1.2.0 upgrade. This should fix some memory leak issues.
- Code quality improvements.
- Fixes for broken links.
2019-10-29
- Zemberek v0.17.1 upgrade.
- JPype1 v0.7.0 upgrade.
- Code style changes.
- Bug-fixes.
- License is now the same with Zemberek (Apache v2.0).
2018-12-01
- Classification, morphology, normalization and tokenization examples.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
data		data
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
documentation.yml		documentation.yml
downloader.py		downloader.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zemberek Python Examples

Table of Contents

Requirements

Getting Started

Usage

Known Bugs

Changelog

About

Releases

Packages

Languages

License

ozturkberkay/Zemberek-Python-Examples

Folders and files

Latest commit

History

Repository files navigation

Zemberek Python Examples

Table of Contents

Requirements

Getting Started

Usage

Known Bugs

Changelog

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages