Read and process constituency trees in various formats.
- From crates.io:
cargo install lumberjack-utils
- From GitHub:
cargo install --git https://github.com/sebpuetz/lumberjack
- Convert treebank in NEGRA export 4 format to bracketed TueBa V2 format
lumberjack-conversion --input_file treebank.negra --input_format negra \
--output_format tueba --output_file treebank.tueba --projectivize
- Retain only root node,
NP
s andPP
s and print to simple bracketed format:
echo "NP PP" > filter_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple \
--output_format tueba --output_file treebank.filtered \
--filter filter_set.txt
- Convert from treebank in simple bracketed to CONLLX format and annotate parent tags of terminals as features.
lumberjack-conversion --input_file treebank.simple --input_format simple\
--output_format conllx --output_file treebank.conll --parent
- Modifications in the following order:
- Reattach all terminals with part-of-speech starting with
$
to the root node - Remove all nonterminals except the root,
S
s,NP
s,PP
s andVP
s - Assign unique identifiers based on the closest
S
to terminals - Insert nodes with label
label
above terminals that aren't dominated byNP
orPP
- Annotate label of parent node on terminals.
- Print to CONLLX format with annotations.
echo "S VP NP PP" > filter_set.txt
echo "NP PP" > insert_set.txt
echo "S" > id_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple\
--output_format conllx --insertion_set insert_set.txt \
--insertion_label label --id_set id_set.txt --reattach $\
--parent parent --output_file treebank.conllx
- read and projectivize trees from NEGRA format and print to simple bracketed format
use std::io::{BufReader, File};
use lumberjack::io::{NegraReader, PTBFormat};
use lumberjack::Projectivize;
fn print_negra(path: &str) {
let file = File::open(path).unwrap();
let reader = NegraReader::new(BufReader::new(file));
for tree in reader {
let mut tree = tree.unwrap();
tree.projectivize();
println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
}
}
- filter non-terminal nodes from trees in a treebank and print to simple bracketed format:
use lumberjack::{io::PTBFormat, Tree, TreeOps, util::LabelSet};
fn filter_nodes(iter: impl Iterator<Item=Tree>, set: LabelSet) {
for mut tree in iter {
tree.filter_nonterminals(|tree, nt| set.matches(tree[nt].label())).unwrap();
println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
}
}
- convert treebank in simple bracketed format to CONLLX with constituency structure encoded in the features field
use conllx::graph::Sentence;
use lumberjack::io::Encode;
use lumberjack::{Tree, TreeOps, UnaryChains};
fn to_conllx(iter: impl Iterator<Item=Tree>) {
for mut tree in iter {
tree.collaps_unary_chains().unwrap();
tree.annotate_absolute().unwrap();
println!("{}", Sentence::from(&tree));
}
}