A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's A Maximum Entropy Model of Phonotactics and Phonotactic Learning. This package provides functionality both as a Haskell library, a command line tool (phono-learner-hw
), and a GTK-based GUI (phono-learner-hw-gui
). The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the two interfaces.
To compile this package, run stack build
in the root of this repository (for just the command line tool, run stack build maxent-learner-hw
). Compiling the GUI requires GTK3 to be installed (on windows use pacman
in the msys2 environment that comes with stack) and if initializing a new stack snapshot, will you will need to run stack install gtk2hs-buildtools
before compilation of all dependencies can succeed.
Both the main package and GUI package are also available on Hackage. Precompiled binaries for Windows and OSX are available on the Releases section on Github.
If the collate option is used, raw phonetic text may be used, where whitespace separates words and punctuation is ignored (although it will separate segments if multi-character segments are defined). If the option is not used, words must be on their own lines and whitespace is ignored (but may be used to separate segments). Each word may optionally be followed by a tab character and an integer indicating its frequency.
As these are auto-generated, they should not need to be edited manually.
Blank lines ans lines beginning with # are ignored.
The first regular line must contain a length distribution (a list of (Length,Int) pairs).
Subsequent lines contain a weight and a rule separated by a space. Rules are represented as a sequence of classes with +
and *
for repetition, ¬
for inversion, and #
to indicate the presence of a word boundary.
To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain +
, -
, or 0
for binary features and +
or 0
for privative features (where we do not want a minus set that could form classes).
As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).
,a,n,t
vowel,+,-,-
nasal,0,+,-
If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be displayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.
The command line tool (phono-learner-hw
) has two commands: learn
, which infers grammars, and gensalad
, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The gensalad
takes a grammar generated by learn
and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.
The command line works as follows:
phono-learner-hw COMMAND [-t|--featuretable CSVFILE] [-n|--samples ARG] [-o|--output OUTFILE]
Option | Description |
---|---|
-f, --featuretable CSVFILE | Use the features and segment list from a feature table in CSV format (a table for IPA is used by default). |
-n, --samples N | Number of samples to use for salad generation. |
-o, --output OUTFILE | Record final output to OUTFILE as well as stdout. |
hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]
Option | Description |
---|---|
-c, --collate | Lexicon file contains raw text with words separated by whitespace. |
--thresholds THRESHOLDS | Thresholds to use for candidate selection (default is [0.01, 0.1, 0.2, 0.3] ). |
-e, --edges | Allow single classes and bigrams restricted to word boundaries. |
-3, --trigrams COREFEATURES | Allows trigrams as long as at least one class is [] or [±x] where x is in COREFEATURES (space separated in quotes). |
-l, --longdistance SKIPFEATURES | Allows long-distance constraints of the form AB+C where A,C are classes and C = [] or [±x] with x in SKIPFEATURES. |
hw-learner gibber GRAMMAR [GLOBALOPTIONS]
Option | Description |
---|---|
--spaced | Separate segments with spaces in output. |
-s, --shuffle | Shuffle generated output (sorted by default). |
The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.
phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt
GTK can be installed by using the jhbuild tool on the GTK website. For a quick install, run the following terminal commands :
cd ~
export PATH=~/.local/bin:$PATH
curl http://git.gnome.org/browse/gtk-osx/plain/gtk-osx-build-setup.sh -o gtk-osx-build-setup.sh
chmod +x gtk-osx-build-setup.sh
./gtk-osx-build-setup.sh
jhbuild bootstrap
jhbuild build meta-gtk-osx-bootstrap meta-gtk-osx-gtk3
git clone git://git.gnome.org/gtk-mac-bundler
cd gtk-mac-bundler
make install
Once GTK is installed, enter the GTK environment with the command jhbuild shell
. Once inside the shell, navigate to this repository and run the following commands to build and package the app.
stack build
stack install
cd osx-bundle/
gtk-mac-bundler learner.bundle
This will place the command line executable in ~/.local/bin
and create an app bundle in your Desktop folder. If this is your first run and stack complains about missing programs, run the following commands before attempting to build again.
stack setup
stack install gtk2hs-buildtools
Copyright © 2016-2017 George Steel and Peter Jurgec.
This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.