-
Notifications
You must be signed in to change notification settings - Fork 7
Unsupervised parsing and noun phrase identification
License
eponvert/upparse
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
upparse -- Unsupervised parsing and noun phrase identification. Elias Ponvert <[email protected]> April 15, 2011 This software contains efficient implementations of hidden Markov models (HMMs) and probabilistic right linear grammars (PRLGs) for unsupervised partial parsing (also known as: unsupervised chunking, unsupervised NP identification, unsupervised phrasal segmentation). These models are particularly effective at noun phrase identification, and have been evaluated at that task using corpora in English, German and Chinese. In addition, this software package provides a driver script to manage a cascade of chunkers to create full (unlabeled) constituent trees. This strategy produces state-of-the-art unsupervised constituent parsing results when evaluated using labeled constituent trees in English, German and Chinese -- possibly others, those are just the ones we tried. A description of the methods implemented in this project can be found in the paper Elias Ponvert, Jason Baldridge and Katrin Erk (2011), "Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models" in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 2011. If you use this system in academic research with published results, please cite this paper, or use this Bibtex: @InProceedings{ponvert-baldridge-erk:2011:ACL, author = {Ponvert, Elias and Baldridge, Jason and Erk, Katrin}, title = {Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics} } I. Installation and usage The core of this system is implemented in Java 6 and makes use of Apache Ant for project building and JUnit for unit testing. Most system interaction, and all replicating of reported results, is accomplished through a driver script (scripts/chunk.py) implemented in Python 2.6. The system is designed to work with the Penn Treebank, the Negra German treebank, and the Penn Chinese Treebank with minimal data preparation. The following assume you're working in a Unix environment and interfacing with the OS using bash (Bourne again shell). $ indicates shell prompt. A. Getting the source If you are using the compiled distribution of this software, then this section is not relevant to your needs. Also, if you have already acquired the source code for this project via a source distribution (a zip or a tarball, in other words), then you can skip to II.B Installation. To acquire the most recent changes to this project, use Mercurial SCM, for info see http://mercurial.selenic.com/ The following assumes you have Mercurial installed, and hg refers to the Mercurial command, as usual. To install the most recent version from Bitbucket, run: $ hg clone http://bitbucket.org/eponvert/upparse By default, this will create a new directory called 'upparse'. B. Installation Using the command line, make sure you are in the source code directory, e.g.: $ cd upparse To create an executable Jar file, run: $ ant jar And that's it. C. Using the convenience script chunk.py For most purposes, including replicating reported results, the convenience script chunk.py is the easiest way to use the system. 1. Chunking To simply run the system on train and evaluation datasets, let's call them WSJ-TRAIN.mrg and WSJ-EVAL.mrg, the command is: $ ./scripts/chunk.py -t WSJ-TRAIN.mrg -s WSJ-EVAL.mrg You have to be in the project directory to run that command, at present. Also: the script determines the file type from the file name suffix: .mrg : Penn Treebank merged annotated files (POS and brackets) .fid : Penn Chinese Treebank bracketed files (in UTF-8, see below) .penn : Negra corpus in Penn Treebank format Any other files are assumed to be tokenized, UTF-8, tokens separated by white-space, one-sentence-per-line. The command above prints numerical results of the experiment. To actually get chunker output, use the -o flag to specify an output directory: $ ./scripts/chunk.py -t WSJ-TRAIN.mrg -s WSJ-EVAL.mrg -o out If that directory already exists, you will be prompted to make sure you wish to overwrite. This command creates the directory named `out` and puts some information in there: out/README : some information about the parameters used in this experiment out/STATUS : the iterations and some information about the experiment run; to track the progress of the experiment, run $ tail -f out/STATUS out/RESULTS : evaluation results of the experiment out/OUTPUT : the output of the model on the test dataset -- unlabeled chunk data where chunks are wrapped in parentheses Typical RESULTS output is like: Chunker-CLUMP Iter-78 : 76.2 / 63.9 / 69.5 ( 7354 / 2301 / 4147 ) [G = 2.46, P = 2.60] Chunker-NPS Iter-78 : 76.8 / 76.7 / 76.7 ( 7414 / 2241 / 2251 ) [G = 2.69, P = 2.60] This reports on the two evaluations described in the ACL paper, constituent chunking (here called CLUMP) and NP identification (NPS). Iter-N reports the number N of iterations of EM required before convergence. The next three numbers are constituent precision, recall, and F-score respectively (here 76.2 / 63.9 / 69.5). The following three numbers (here 7354 / 2301 / 4147) are the raw counts of true positives, false positives and false negatives, respectively. Finally, in the square brackets is constituent length information (here, G = 2.46, P = 2.60). This refers to the average constituent length of the gold standard annotations (G) and the predicted constituents (P). The predicted constituent lengths is the same for the two evaluations, since its the same output being evaluated against different annotations. The chunk.py script comes with a number of command-line options: -h Print help message --help -t FILE File input for training --train=FILE -s FILE File input for evaluation and output --test=FILE -o DIR Directory to write output. --output=DIR -T X File type. X may be WSJ, NEGRA, CTB or SPL for -input_type=X sentence-per-line. -m MODEL Model type. This may be: --model=MODEL prlg-uni: PRLG with uniform parameter initialization hmm-uni: HMM with uniform parameter initialization prlg-2st: PRLG with "two-stage" initialization. This isn't discussed in the ACL paper, but what this means is that the sequence model is trained in a pseudo-supervised fashion using the output of a simple heuristic model. hmm-2st: HMM with two-stage initialization. prlg-sup-clump: Train a PRLG model for constituent chunking using gold standard treebank annotations; this uses maximum likelihood model estimation with no iterations of EM hmm-sup-clump: Train an HMM model for constituent chunking using gold standard treebank annotations. prlg-sup-nps: hmm-sup-nps: Train PRLG and HMM models for supervised NP identification. -f N Evaluate using only sentences length N or less. --filter_test=N -P Evaluate ignoring all phrasal punctuation as --nopunc indicators of phrasal boundaries. Sentence boundaries are not ignored. -r Run model as a right-to-left HMM (or PRLG) --reverse -M Java memory flag, e.g. -Xmx1g. Other Java options can --memflag=M be specified with this option -E X Run EM until the full dataset likelihood (negative log --emdelta=X likelihood is less than X; default = .0001 -S X Smoothing value, default = .1 --smooth=X -c X The coding used to encode constituents as tags. --coding=X Options include: BIO: Beginning-inside-outside tagset BILO: Beginning-inside-last-outside tagset BIO_GP: Simulate a second-order sequence model by using current-tag/last-tag pairs BIO_GP_NOSTOP: Simulate a second-order sequence model by using current-tag/last-tag pairs, only not using paired tags for STOP symbols (sentence boundaries and phrasal punctuation) -I N Run only N iterations --iter=N -C Run a cascade of chunkers to produce tree-output; this --cascade produces different output 2. Cascaded parsing Running the chunk.py script with the -C (--cascade) option creates a cascade of chunkers to produce unlabeled constituent tree output (or, hierarchical bracket output). Each of the models in the cascade share the same parameters -- smoothing, tagset, etc -- as specified by the other command-line options. The -o parameter still instructs the script to write output to a specified directory, but a different set of output is written: Assuming 'out' is the specified output directory, several subdirectories are created, of the following form are created: out/cascade00 out/cascade01 etc. Each contains further subdirectories: out/cascade01/train-out out/cascade01/test-out These each contain the same chunking information fields as before (OUTPUT, README, RESULTS, and STATUS, though RESULTS is empty for most). Each cascade directory also contains updated train and evaluation files, e.g. out/cascade01/next-train out/cascade01/next-test These are the datasets modified with pseudowords as stand-ins for chunks, as described in the ACL paper. The expanded constituency parsing output on the evaluation data -- in other words, the full bracketing for each level of the cascade -- is written into each subdirectory, e.g. at out/cascade01/test-eval Empirical evaluation for all levels is ultimately written to out/results. This is a little difficult to read, since the different levels are not strictly indicated. But it becomes easier when you filter it by evaluation. For instance, to get PARSEEVAL evaluation of each level, run: $ grep 'asTrees' < out/results asTrees asTrees : 53.8 / 16.8 / 25.6 asTrees asTrees : 53.7 / 25.7 / 34.8 asTrees asTrees : 51.1 / 30.4 / 38.1 asTrees asTrees : 50.5 / 32.6 / 39.7 asTrees asTrees : 50.4 / 32.8 / 39.8 asTrees asTrees : 50.4 / 32.8 / 39.8 For NP and PP identification at each level: $ grep 'NPs Recall' < out/results NPs Recall : 19.6 / 30.9 / 24.0 NPs Recall : 13.0 / 31.6 / 18.5 NPs Recall : 10.8 / 32.3 / 16.1 NPs Recall : 10.1 / 33.0 / 15.4 NPs Recall : 10.0 / 33.0 / 15.4 NPs Recall : 10.0 / 33.0 / 15.4 $ grep 'PPs Recall' < out/results PPs Recall : 8.1 / 33.6 / 13.1 PPs Recall : 7.4 / 47.1 / 12.9 PPs Recall : 6.1 / 47.9 / 10.8 PPs Recall : 5.9 / 50.5 / 10.6 PPs Recall : 5.9 / 51.1 / 10.6 PPs Recall : 5.9 / 51.1 / 10.6 Since these evaluations are considering all constituents output by the model, the interesting metric here is recall (the middle one, here 33.6 to 51.1 for PP recall). The last cascade directory -- in this example run, out/cascade06 -- is empty except for the train-out subdirectory. This serves to indicate that the model as converged: the model will produce no new constituents at subsequent cascade levels. So, the final model output on the evaluation data is in the second-to-last cascade directory, in test-eval. This is sentence-per-line, with constituents indicated by parentheses. An additional bracket for the sentence root is added to each sentence. To see a sample, run $ head out/cascade05/test-eval assuming that cascade05 is the second-to-last cascade directory, as here. D. Running from the Jar file Running `ant jar` creates an executable jar file, which has much the same functionality as the chunk.py script, but offers a couple more options. Do use it, run something like: $ java -Xmx1g -jar upparse.jar chunk \ -chunkerType PRLG \ -chunkingStrategy UNIFORM \ -encoderType BIO \ -emdelta .0001 \ -smooth .1 \ -output testout \ -train wsj/train/*.mrg \ -test wsj/23/*.mrg \ -trainFileType WSJ \ -testFileType WSJ \ First of all, note that you can pass multiple files to upparse.jar, unlike the chunk script. Using Bash (indeed, most shells), this means you can use file name patterns like *.mrg. On the other hand, you have to specify the file types directly using -trainFileType and -testFileType. Here are most of the command line options for calling `java upparse.jar chunk`: -chunkerType HMM or PRLG (see 'model' option above) -chunkingStrategy TWOSTAGE or UNIFORM (see 'model' option above) -encoderType T Use tagset T to encode constituents e.g. BIO, BILO, -G T etc. (see 'coding' option above) -noSeg Evaluate without using phrasal punctuation (see 'nopunc' option above) -train FILES Train using specified files -test FILES Evaluate using specified files -trainFileType WSJ, NEGRA, CTB or SPL (for tokenized sentence-per-line) -numtrain N Train only using the first N sentences -filterTrain L Train only using sentences length L or less -filterTest L Evaluate on sentences only length L or less -output D Output results to directory D (see above) -iterations N Train using N iterations of EM -emdelta X Train until EM converges, where convergence is when the percent change in full dataset perplexity (negative log likelihood) is less than X -smooth V Set smoothing value to V (see above, and ACL paper) -evalReportType X Evaluation report type. Possible values are: -E X PR : Precision, recall and F-score PRL : Precision, recall, F-score and information about constituent length PRC : Precision, recall, F-score and raw counts for true positives, false positives and false negatives PRCL : Precision, recall, F-score, raw counts and constituent length information PRLcsv: PRL output in CSV format to import into a spreadsheet -evalTypes E1,E2.. Evaluation types. This also dictates the format -e E1,E2.. used in the output. Possible values are: CLUMP : Evaluate using constituent chunks (see ACL paper) NPS : Evaluate using NP identification PPS : Evaluate using prepositional phrase identification TREEBANKPREC : Evaluate constituents against a treebank as precision on all treebank constituents. This will also output recall and F-score: ignore these. -outputType T This parameter uses many of the same values as -evalTypes. Values CLUMP, NPS and TREEBANKPREC all output basic chunker output, using parentheses to indicate chunk boundaries. Other output formats are specified with: UNDERSCORE : Output chunks as words separated by underscore UNDERSCORE4CCL : Output chunks as word separated by underscore, also indicate phrasal punctuation by include a semicolon (;) character -continuousEval Train model performance on the evaluation dataset through the learning process -- that is, after each iteration of EM, evaluate. This is interesting to see the model's performance improve (or degrade) as it converges. NOTE when using this option, predictions will be slightly different, since the vocabulary count V incorporates terms only seen in the evaluation dataset. For this reason, we do not use this option in experiments whose numerical results we report in the paper. The different numbers are reported in out/RESULTS. -outputAll Write model output on the evaluation data to disk for each iteration of EM. Each model output is out/Iter-N for iteration N. This can create a lot of files and eats up some amount of disk over time. -reverse Evaluate as a right-to-left sequence model II. Replicating published results If you have downloaded this code from a repository, then to replicate the results reported in the ACL paper cited above, consider using the acl-2011 tag, which indicates the version of the code used to generate the reported results. To do this, run: $ hg up acl-2011 A. Data preparation In the ACL paper cited above, this system is evaluated using the following resources: (WSJ) The Wall Street Journal sections of Treebank-3 (The Penn Treebank release 3), from the Linguistic Data Corporation LDC99T42. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42 (Negra) The Negra (German) corpus from Saarland University. http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ (CTB) Chinese Treebank 5.0 from the Linguistic Data Corporation LDC2005T01. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01 You can use the system on downloaded treebank data directly (with one caveat: CTB must be converted to UTF8). The convenience script chunk.py assumes that training and evaluation data-sets are single files, so the following describes the experimental setup, in terms of choice of these subsets of the data. 1. WSJ For WSJ we use sections 02-22 for train, section 24 for development and section 23 for test. Assuming the Penn Treebank was downloaded and unzipped into a directory called penn-treebank-rel3: $ cat penn-treebank-rel3/parsed/mrg/wsj/0[2-9]/*.mrg \ penn-treebank-rel3/parsed/mrg/wsj/1[0-9]/*.mrg \ penn-treebank-rel3/parsed/mrg/wsj/2[0-2]/*.mrg \ > wsj-train.mrg $ cat penn-treebank-rel3/parsed/mrg/wsj/24/*.mrg > wsj-devel.mrg $ cat penn-treebank-rel3/parsed/mrg/wsj/23/*.mrg > wsj-test.mrg 2. Negra The Negra corpus comes as one big file. For train we use the first 18602 sentences, for test we use the penultimate 1000 sentences and for development we use the last 1000 sentences. Assuming that the corpus is downloaded and unzipped into a directory called negra2: $ head -n 663736 negra2/negra-corpus.penn > negra-train.penn $ tail -n +663737 negra2/negra-corpus.penn \ | head -n $((698585-663737)) > negra-test.penn $ tail -n +698586 negra2/negra-corpus.penn > negra-devel.penn 3. CTB Assuming the CTB corpus is downloaded and unzipped into a directory called ctb5. We use the bracket annotations in ctb5/data/bracketed. To create the UTF8 version of this resource that this code works with, use the included gb2unicode.py script as follows: $ mkdir ctb5-utf8 $ python scripts/gb2unicode.py ctb5/data/bracketed ctb5-utf8 For the train/development/test splits used in the ACL paper, we used the split used by Duan et al in Xiangyu Duan, Jun Zhao, and Bo Xu. 2007. "Probabilistic models for action-based Chinese dependency parsing." In Proceedings of ECML/ECPPKDD, Warsaw, Poland, September. Specifically: Train: 001-815 ; 1001-1136 Devel: 886-931 ; 1148 - 1151 Test: 816-885 ; 1137 - 1147 To create these, run $ cat ctb5-utf8/chtb_[0-7][0-9][0-9].fid \ ctb5-utf8/chtb_81[0-5].fid \ ctb5-utf8/chtb_10[0-9][0-9].fid \ ctb5-utf8/chtb_11[0-2][0-9].fid \ ctb5-utf8/chtb_113[0-6].fid \ > ctb-train.fid $ cat ctb5-utf8/chtb_9[0-2][0-9].fid \ ctb5-utf8/chtb_93[01].fid \ ctb5-utf8/chtb_114[89].fid \ ctb5-utf8/chtb_115[01].fid \ > ctb-dev.fid there are no 886-900 $ cat ctb5-utf8/chtb_81[6-9].fid \ ctb5-utf8/chtb_8[2-7][0-9].fid \ ctb5-utf8/chtb_88[0-5].fid \ ctb5-utf8/chtb_113[7-9].fid \ ctb5-utf8/chtb_114[0-7].fid \ > ctb-test.fid Quick tip: Keep the file extensions that are used in the corpus files themselves: .mrg for WSJ, .penn for Negra, and .fid for CTB. The chunk.py script uses these extensions to guess the corpus type, if not specified otherwise. B. Replicating chunking results Assuming the TRAIN is the training file (for WSJ, Negra or CTB), and TEST is the test file (or development file), then chunking results from the ACL paper are replicated by executing the following: for the PRLG: $ ./scripts/chunk.py -t TRAIN -s TEST -m prlg-uni for the HMM: $ ./scripts/chunk.py -t TRAIN -s TEST -m hmm-uni But these commands just print out final evaluation numbers. To see the output of the runs, choose an output directory (e.g. testout -- but don't create the directory) and run: for the PRLG: $ ./scripts/chunk.py -t TRAIN -s TEST -m prlg-uni -o testout for the HMM: $ ./scripts/chunk.py -t TRAIN -s TEST -m hmm-uni -o testout Several files are created in the testout directory: testout/RESULTS is a text file with the results of the experiment testout/README is a text file with some information about the parameters used testout/STATUS is the output of the experimental run (progress of the experiment, the iterations and the model's estimate of the dataset complexity for each iteration of EM), and any error output. While running experiments, you can track progress by running $ tail -f testout/STATUS testout/OUTPUT is the final output of the system on the TEST dataset. To run experiments on the datasets, but completely ignoring punctuation, use the -P flag, e.g.: $ ./scripts/chunk.py -t TRAIN -s TEST -m prlg-uni -P To vary the degree of smoothing, use the -S flag, e.g.: $ ./scripts/chunk.py -t TRAIN -s TEST -m prlg-uni -S .001 C. Replicating cascaded-chunking/parsing results The same chunk.py script is used to drive the cascaded-chunking experiments, using the -C flag. An output directory is required, but if none is specified, the script will use the directory named 'out' by default. All of the options for chunking are available for cascaded chunking, since each step in the cascade is a chunker initialized as before. In fact, for each step in the cascade, the chunker is run twice: once to generate training material for the next step in the cascade, and once to run on TEST to evaluate. (This is obviously not an optimal setup, since the chunker has to train twice at each level in the cascade.) This script will print numerical results to screen as it operates. For info about final model output, and the saved results file, see above. III. Extending and contributing upparse is open-source software, under an Apache license. The project is currently hosted on Bitbucket: https://bitbucket.org/eponvert/upparse From there you can download source, clone or the project. There is an issue tracker for this project available at https://bitbucket.org/eponvert/upparse/issues There is also a project Wiki at https://bitbucket.org/eponvert/upparse/wiki though this resource contains much the same information as this README. This code is released under the Apache License Version 2.0. See LICENSE.txt for details.
About
Unsupervised parsing and noun phrase identification
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published