TODO: Document this code. The repository is located at https://github.com/asaparov/grammar.
This code implements the generative model of grammar as described in this paper. Given a logical form, the grammar generates a derivation tree top-down, by selecting production rules probabilistically conditioned on the logical form. The leaves of the derivation tree form the tokens of the output utterance.
If you use this code in your research, please cite:
Abulhair Saparov, Vijay Saraswat, and Tom M. Mitchell. 2017. A Probabilistic Generative Grammar for Semantic Parsing. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017).
To use the code, simply download the files into a folder named "grammar". Add this folder to the include path for the compiler, for example by using the -I
flag.
This library depends on core, math, and hdp. The code makes use of C++11
and is regularly tested with gcc 11
but I have previously compiled it with gcc 4.8
, clang 4.0
, and Microsoft Visual C++ 14.0 (2015)
. The code is intended to be platform-independent, so please create an issue if there are any compilation bugs.
grammar.h
contains the definitions of the basic data structures, such as production rules in struct rule
, nonterminals in struct nonterminal
, the grammar in struct grammar
, and the nodes of derivation trees in struct syntax_node
. These data structures are defined as template types, where the logical form type is given by the Semantics
typename parameter, and the distribution over production rules at each nonterminal is given by the RuleDistribution
template parameter.
struct null_semantics
in grammar.h
is an example of a Semantics
type where the logical forms are empty. pcfg_grammar.h
implements a RuleDistribution
type where the distribution over production rules does not depend on the logical form. The test program in pcfg_induction.cpp
uses this rule distribution along with null_semantics
to effectively perform grammar induction on a probabilistic context-free grammar.
hdp_grammar.h
defines struct hdp_rule_distribution
which provides another example of a RuleDistribution
type, where the distribution over production rules is given by a hierarchical Dirichlet process (HDP).
The parser and Metropolis-Hastings sampler (for training) are implemented in parser.h
. Since the structure of the two algorithms are so similar, they are implemented as a single algorithm, where the Mode
template parameter determines whether the algorithm is parsing or sampling. The function parse_result parse(...)
contains the main loop of this algorithm, where at every iteration, the search state is popped from the priority queue (agenda) and processed according to its type. This processing may add new search states to the priority queue.
morphology.h
implements a simple model of word morphology, where each word is modeled as an inflected root. The root may be inflected according to grammatical number, aspect, and tense.
There are two test programs in this repository.
pcfg_induction
creates an instance of the grammar where the logical forms are empty. So the grammar is a probabilistic context-free grammar and is purely syntactic. The program then trains the grammar on a handful of examples created from the reference grammar:E -> 0 E 1
,E -> 0 1
,E -> E E
. The program is defined inpcfg_induction.cpp
and can be compiled usingmake pcfg_induction
.math_induction
learns a grammar where the logical forms are simple arithmetic expressions such as1 + (1 * 0) / 1
. These logical expressions are modeled as simple binary trees, as defined intree_semantics.h
. This program uses an HDP to model the distribution over production rules at each nonterminal (struct hdp_rule_distribution
inhdp_grammar.h
). This program is located inmath_induction.cpp
and can be compiled usingmake math_induction
.