Update README.md

momandine · Aug 12, 2013 · 680f051 · 680f051
1 parent ab00d10
commit 680f051
Showing 1 changed file with 14 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,22 @@
 Collins's NLP Lab II: PROBABLISTIC CONTEXT FREE GRAMMERS
+========================================================
+
+The goal is to generate parse trees for English sentences, which in this case happen to be trivia questions. 
+My job was to compile the probabilities of specific rules governing the branching of the tree using training data with smoothing,
+and then implement the CKY dynamic programming algorithm, which recursively finds the subtree with the maximum probability 
+given my estimated probabilies.
 
 Author: Amandine Lee
+
 Email: [email protected]
 
 Files
 ------
-PYTHON FILES CREATED BY ME:
+PYTHON FILES CREATED BY ME **ie. the most important parts**:
 
-1. replace_rare_tree.py - Can be imported for member functions or run as a script for the given data files. Takes a JSON nested list representing a lexical tree (the training data) and a text file with the counts output by count_cfg_freq.py. Tallies the words that occur with a given tag < 5 times, creates a new training JSON file with those words replaced by '_RARE_'
+1. replace_rare_tree.py - Can be imported for member functions or run as a script for the given data files. Takes a JSON nested list representing a parse tree (the training data) and a text file with the counts output by count_cfg_freq.py. Tallies the words that occur with a given tag < 5 times, creates a new training JSON file with those words replaced by '_RARE_'
+2. probability_generator.py - A class that stores the counts of different rules from training data, and can be called to calculate probabilities from those counts.
+3. cky_algo.py - Script that implements the CKY algorithm, calculating the maximum probable parse trees from newline seaparated sentences, and writes JSON-encoded trees to a file.
 
 GIVEN PYTHON FILES:
 
@@ -16,6 +25,7 @@ GIVEN PYTHON FILES:
 3. pretty_print_tree.py - Makes indented versions of trees. Takes single-line-tree fomrat files. 
 
 GIVEN TEXT FILES:
+
 1. parse_train.dat - Each line represents a sentences, parsed into its lexical tree, stored in JSON format. The first is the data, the second the right branch, the third the left branch, until it terminates with a terminal (actual word) and it's tag, stored as ["TAG", "word"]
 2. cfg.counts - Original counts from training data. Each line represents one piece of data, as:  <count> <count-type> <nonterminal/terminal sympbols...>
 3. parse_dev.dat - Each line is a sentence to be analyzes.
@@ -24,5 +34,6 @@ GIVEN TEXT FILES:
 6. tree.example - A single tree in JSON as an example
 
 GENERATED TEXT FILES
+
 1. new.counts - Counts with _RARE_ type
-2. 
+