Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
momandine committed Aug 12, 2013
1 parent ab00d10 commit 680f051
Showing 1 changed file with 14 additions and 3 deletions.
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
Collins's NLP Lab II: PROBABLISTIC CONTEXT FREE GRAMMERS
========================================================

The goal is to generate parse trees for English sentences, which in this case happen to be trivia questions.
My job was to compile the probabilities of specific rules governing the branching of the tree using training data with smoothing,
and then implement the CKY dynamic programming algorithm, which recursively finds the subtree with the maximum probability
given my estimated probabilies.

Author: Amandine Lee

Email: [email protected]

Files
------
PYTHON FILES CREATED BY ME:
PYTHON FILES CREATED BY ME **ie. the most important parts**:

1. replace_rare_tree.py - Can be imported for member functions or run as a script for the given data files. Takes a JSON nested list representing a lexical tree (the training data) and a text file with the counts output by count_cfg_freq.py. Tallies the words that occur with a given tag < 5 times, creates a new training JSON file with those words replaced by '_RARE_'
1. replace_rare_tree.py - Can be imported for member functions or run as a script for the given data files. Takes a JSON nested list representing a parse tree (the training data) and a text file with the counts output by count_cfg_freq.py. Tallies the words that occur with a given tag < 5 times, creates a new training JSON file with those words replaced by '_RARE_'
2. probability_generator.py - A class that stores the counts of different rules from training data, and can be called to calculate probabilities from those counts.
3. cky_algo.py - Script that implements the CKY algorithm, calculating the maximum probable parse trees from newline seaparated sentences, and writes JSON-encoded trees to a file.

GIVEN PYTHON FILES:

Expand All @@ -16,6 +25,7 @@ GIVEN PYTHON FILES:
3. pretty_print_tree.py - Makes indented versions of trees. Takes single-line-tree fomrat files.

GIVEN TEXT FILES:

1. parse_train.dat - Each line represents a sentences, parsed into its lexical tree, stored in JSON format. The first is the data, the second the right branch, the third the left branch, until it terminates with a terminal (actual word) and it's tag, stored as ["TAG", "word"]
2. cfg.counts - Original counts from training data. Each line represents one piece of data, as: <count> <count-type> <nonterminal/terminal sympbols...>
3. parse_dev.dat - Each line is a sentence to be analyzes.
Expand All @@ -24,5 +34,6 @@ GIVEN TEXT FILES:
6. tree.example - A single tree in JSON as an example

GENERATED TEXT FILES

1. new.counts - Counts with _RARE_ type
2.

0 comments on commit 680f051

Please sign in to comment.