Skip to content

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text

Notifications You must be signed in to change notification settings

george-steel/maxent-learner

Repository files navigation

Maxent Phonotactic Learner

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's A Maximum Entropy Model of Phonotactics and Phonotactic Learning. This package provides functionality both as a Haskell library, a command line tool (phono-learner-hw), and a GTK-based GUI (phono-learner-hw-gui). The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the two interfaces.

To compile this package, run stack build in the root of this repository (for just the command line tool, run stack build maxent-learner-hw). Compiling the GUI requires GTK3 to be installed (on windows use pacman in the msys2 environment that comes with stack) and if initializing a new stack snapshot, will you will need to run stack install gtk2hs-buildtools before compilation of all dependencies can succeed.

Both the main package and GUI package are also available on Hackage. Precompiled binaries for Windows and OSX are available on the Releases section on Github.

Lexicon format

If the collate option is used, raw phonetic text may be used, where whitespace separates words and punctuation is ignored (although it will separate segments if multi-character segments are defined). If the option is not used, words must be on their own lines and whitespace is ignored (but may be used to separate segments). Each word may optionally be followed by a tab character and an integer indicating its frequency.

Grammar format

As these are auto-generated, they should not need to be edited manually.

Blank lines ans lines beginning with # are ignored. The first regular line must contain a length distribution (a list of (Length,Int) pairs). Subsequent lines contain a weight and a rule separated by a space. Rules are represented as a sequence of classes with + and * for repetition, ¬ for inversion, and # to indicate the presence of a word boundary.

Feature Table Format

To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain +, -, or 0 for binary features and + or 0 for privative features (where we do not want a minus set that could form classes).

As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).

     ,a,n,t
vowel,+,-,-
nasal,0,+,-

If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be displayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.

Command line usage

The command line tool (phono-learner-hw) has two commands: learn, which infers grammars, and gensalad, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The gensalad takes a grammar generated by learn and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.

The command line works as follows:

phono-learner-hw COMMAND [-t|--featuretable CSVFILE] [-n|--samples ARG] [-o|--output OUTFILE]
Option Description
-f, --featuretable CSVFILE Use the features and segment list from a feature table in CSV format (a table for IPA is used by default).
-n, --samples N Number of samples to use for salad generation.
-o, --output OUTFILE Record final output to OUTFILE as well as stdout.
hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]
Option Description
-c, --collate Lexicon file contains raw text with words separated by whitespace.
--thresholds THRESHOLDS Thresholds to use for candidate selection (default is [0.01, 0.1, 0.2, 0.3]).
-e, --edges Allow single classes and bigrams restricted to word boundaries.
-3, --trigrams COREFEATURES Allows trigrams as long as at least one class is [] or [±x] where x is in COREFEATURES (space separated in quotes).
-l, --longdistance SKIPFEATURES Allows long-distance constraints of the form AB+C where A,C are classes and C = [] or [±x] with x in SKIPFEATURES.
hw-learner gibber GRAMMAR [GLOBALOPTIONS]
Option Description
--spaced Separate segments with spaces in output.
-s, --shuffle Shuffle generated output (sorted by default).

Example usage

The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.

phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt

OSX build instructions

GTK can be installed by using the jhbuild tool on the GTK website. For a quick install, run the following terminal commands :

cd ~
export PATH=~/.local/bin:$PATH
curl http://git.gnome.org/browse/gtk-osx/plain/gtk-osx-build-setup.sh -o gtk-osx-build-setup.sh
chmod +x gtk-osx-build-setup.sh
./gtk-osx-build-setup.sh
jhbuild bootstrap
jhbuild build meta-gtk-osx-bootstrap meta-gtk-osx-gtk3
git clone git://git.gnome.org/gtk-mac-bundler
cd gtk-mac-bundler
make install

Once GTK is installed, enter the GTK environment with the command jhbuild shell. Once inside the shell, navigate to this repository and run the following commands to build and package the app.

stack build
stack install
cd osx-bundle/
gtk-mac-bundler learner.bundle

This will place the command line executable in ~/.local/bin and create an app bundle in your Desktop folder. If this is your first run and stack complains about missing programs, run the following commands before attempting to build again.

stack setup
stack install gtk2hs-buildtools

Copyright © 2016-2017 George Steel and Peter Jurgec.

This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

About

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text

Resources

Stars

Watchers

Forks

Packages

No packages published