Skip to content
Shreeshrii edited this page Mar 14, 2017 · 46 revisions

Fonts for Tesseract training

Tesseract training can use images made from text which was rendered with a list of fonts. Those fonts must be available on the host where the training process is running.

The required fonts are defined in training/language-specific.sh.

Find Fonts

To find fonts already installed on your system which will render a given training text, you can use the following command (change the language code and directory locations to match your setup). fontslist.txt will provide text that can be used in training/language-specific.sh.

text2image --find_fonts \
--fonts_dir /usr/share/fonts \
--text ../langdata/hin/hin.training_text \
--min_coverage .9  \
--outputbase ../langdata/hin/hin \
|& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' >../langdata/hin/fontslist.txt

Font installation

Debian

On Debian GNU Linux and similar distributions (Linux Mint, Ubuntu, ...), the required fonts can be installed like that:

# AMHARIC_FONTS (todo)
# ANCIENT_GREEK_FONTS (todo)
# ARABIC_FONTS (todo)
# ARMENIAN_FONTS (todo)
# BENGALI_FONTS (todo)
# BURMESE_FONTS (todo)
# CHI_SIM_FONTS (todo)
# CHI_TRA_FONTS (todo)
# DEVANAGARI_FONTS ( Also see external links below)
  apt-get install fonts-deva
# EARLY_LATIN_FONTS (todo)
# FRAKTUR_FONTS (todo)
# GEORGIAN_FONTS (todo)
# GREEK_FONTS (todo)
# GUJARATI_FONTS (todo)
# HEBREW_FONTS (todo)
# JPN_FONTS (todo)
# KANNADA_FONTS (todo)
# KHMER_FONTS (todo)
# KOREAN_FONTS (todo)
# KURDISH_FONTS (todo)
# KYRGYZ_FONTS (todo)
# LAOTHIAN_FONTS (todo)
# LATIN_FONTS
apt-get install fonts-dejavu gsfonts ttf-mscorefonts-installer
# MALAYALAM_FONTS (todo)
# NEOLATIN_FONTS (still incomplete)
apt-get install fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode
# NORTH_AMERICAN_ABORIGINAL_FONTS (todo)
# OLD_GEORGIAN_FONTS (todo)
# ORIYA_FONTS (todo)
# PERSIAN_FONTS (todo)
# PUNJABI_FONTS (todo)
# RUSSIAN_FONTS (todo)
# SINHALA_FONTS (todo)
# SYRIAC_FONTS (todo)
# TAMIL_FONTS (todo)
# TELUGU_FONTS (todo)
# THAANA_FONTS (todo)
# THAI_FONTS (todo)
# TIBETAN_FONTS (todo)
# VERTICAL_FONTS (todo)
# VIETNAMESE_FONTS (todo)

The installed fonts are shown by the command fc-list. See also the Debian wiki.

Links

Sources of (mostly free) fonts

More information on fonts

Devanagari Fonts

Fraktur Fonts

As of 02/02/2020


These wiki pages are no longer maintained.

All pages were moved to tesseract-ocr/tessdoc.

The latest documentation is available at https://tesseract-ocr.github.io/.


Clone this wiki locally