-
Notifications
You must be signed in to change notification settings - Fork 14
A system for connecting language to space and time.
License
utcompling/textgrounder
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
See `README.textgrounder` for an introduction to TextGrounder and to the Geolocate subproject. This file describes how exactly to run the applications in Geolocate. ======================================== Specifying where the corpora are located ======================================== The corpora are located using the environment variable TG_CORPUS_DIR. If this is unset, and TG_GROUPS_DIR is set, it will be initialized to the 'corpora' subdirectory of this variable. If you are running directly on Maverick, you can set TG_ON_MAVERICK to 'yes', which will set things up appropriately to use the corpora located in /work/01683/benwing/maverick/corpora. ==================== Running under Hadoop ==================== To run Geolocate under Hadoop with data in HDFS rather then locally, you will need to copy the data into HDFS, which you can do using 'tg-copy-data-to-hadoop'. You need to copy two things, the corpus or corpora you want to run on and some ancillary data needed for TextGrounder (basically the stop lists). You run the command as follows: $ tg-copy-data-to-hadoop CORPUS ... where CORPUS is the name of a corpus, similar to what is specified when running 'tg-geolocate'. Additionally, the pseudo-corpus 'textgrounder' will copy the ancillary TextGrounder data. For example, to copy the Portuguese Wikipedia for March 15, 2012, as well as the ancillary data, run the following: $ tg-copy-data-to-hadoop textgrounder ptwiki-20120315 Other possibilities for CORPUS are 'geotext' (the Twitter GeoText corpus), any other corpus listed in the corpus directory (TG_CORPUS_DIR), any tree containing corpora (all corpora underneath will be copied and the directory structure preserved), or any absolute path to a corpus or tree of corpora. Then, run as follows: $ TG_USE_HDFS=yes tg-geolocate --hadoop geotext output If you use '--verbose' as follows, you can see exactly which options are being passed to the underlying 'textgrounder' script: $ TG_USE_HDFS=yes tg-geolocate --hadoop --verbose geotext output By default, the data copied using 'tg-copy-data-to-hadoop' and referenced by 'tg-geolocate' or 'textgrounder' is placed in the directory 'textgrounder-data' under your home directory on HDFS. You can change this by setting TG_HADOOP_DIR before running 'tg-copy-data-to-hadoop'. ================== Obtaining the Data ================== If you don't have the data already (you do if you have access to the Comp Ling machines), download and unzip the processed Wikipedia/Twitter data from http://web.corral.tacc.utexas.edu/utcompling/wing-baldridge-2011/. There are two sets of data to download: * The processed Wikipedia data, in `enwiki-20100905/`. You need at least the files ending in '.data.txt.bz2' and '.schema.txt'. * The processed Twitter data, in `twitter-geotext/'. You need at least the subdirectories 'docthresh-*'. Download the data files. It is generally recommended that you create a directory and set `TG_GROUPS_DIR` to point to it; then put the Wikipedia and Twitter data underneath the `$TG_GROUPS_DIR/corpora` subdirectory. Alternatively, `TG_CORPUS_DIR` can be used to directly point to where the corpora are stored. In both cases, the directory storing the corpora should have subdirectories 'enwiki-20100905' and 'twitter-geotext'; underneath the former should be the downloaded '*.data.txt.bz2' and '*.schema.txt' files, and underneath the latter should be the 'docthresh-*' subdirectories. (Alternatively, if you are running on a UTexas Comp Ling machine, or a machine with a copy of the relevant portions of /groups/corpora and /groups/projects in the same places, set `TG_ON_COMP_LING_MACHINES` to a non-empty value and it will initialize those three for you. If you are running on the UTexas Maverick cluster, set `TG_ON_MAVERICK` to a non-empty value and the variables will be initialized appropriately for Maverick.) The Wikipedia data was generated from [http://download.wikimedia.org/enwiki/20100904/enwiki-20100904-pages-articles.xml.bz2 the original English-language Wikipedia dump of September 4, 2010]. The Twitter data was generated from [http://www.ark.cs.cmu.edu/GeoText/ The Geo-tagged Microblog corpus] created by [http://aclweb.org/anthology-new/D/D10/D10-1124.pdf Eisenstein et al (2010)]. =========================== Replicating the experiments =========================== The code in Geolocate.scala does the actual geolocating. Although these are written in Java and can conceivably be run directly using `java`, in practice it's much more convenient using either the `textgrounder` driver script or some other even higher-level front-end script. `textgrounder` sets up the paths correctly so that all libraries, etc. will be found, and takes an application to run, knowing how to map that application to the actual class that implements the application. Each application typically takes various command-line arguments, and `textgrounder` itself also takes various command-line options (given *before* the application name), which mostly control operation of the JVM. In this case, document geotagging can be invoked directly with `textgrounder` using `textgrounder geolocate-document`, but the normal route is to go through a front-end script. The following is a list of the front-end scripts available: * `tg-geolocate` is the script you probably want to use. It takes a CORPUS parameter to specify which corpus you want to act on (currently recognized: `geotext`; a Wikipedia corpus, e.g. `enwiki-20120307` or `ptwiki-20120315`; `wikipedia`, which picks some "default" Wikipedia corpus, specifically `enwiki-20100905` (the same one used for the original Wing+Baldridge paper, and quite old by now); and `geotext-wiki`, which is a combination of both the `wikipedia` and `geotext` corpora). This sets up additional arguments to specify the data files for the corpus/corpora to be loaded/evaluated, and the language of the data, e.g. to select the correct stopwords list. The application to run is specified by the `--app` option; if omitted, it defaults to `geolocate-document` (other possibilities are `generate-kml`). For the Twitter corpora, an additional option `--doc-thresh NUM` can be used to specify the threshold, i.e. minimum number of documents that a vocabulary item must be seen in; uncommon vocabulary before that is ignored (or rather, converted to an OOV token). Additional arguments to both the app and `textgrounder` itself can be given. Configuration values (e.g. indicating where to find Wikipedia and Twitter, given the above environment variables) are read from `config-geolocate` in the TextGrounder `bin` directory; additional site-specific configuration will be read from `local-config-geolocate`, if you create that file in the `bin` directory. There's a `sample.local-config-geolocate` file in the directory giving a sample local config file. * `tg-generate-kml` is exactly the same as `tg-geolocate --app generate-kml` but easier to type. * `run-nohup` is a script for wrapping other scripts. The other script is run using `nohup`, so that a long-running experiment will not get terminated if your shell session ends. In addition, starting times and arguments, along with all output, are logged to a file with a unique, not-currently existing name, where the name incorporates the name of the underlying script run, the current time and date, an optional ID string (specified using the `-i` or `--id` argument), and possibly an additional number needed to ensure that the file is unique -- it will refuse to overwrite an existing file. This ID is useful for identifying different experiments using the same script. The experiment runner `run-geolocate-exper.py`, which allows iterating over different parameter settings, generates an ID based on the current parameter settings. * `python/run-geolocate-exper.py` is a framework for running a series of experiments on similar arguments. It was used extensively in running the experiments for the paper. You can invoke `tg-geolocate wikipedia` with no options, and it will do something reasonable: It will attempt to geolocate the entire dev set of the old English Wikipedia corpus, using KL divergence as a strategy, with a grid size of 1 degrees. Options you may find useful (which also apply to `textgrounder geolocate` and all front ends): `--degrees-per-cell NUM` `--dpc NUM` Set the size of a cell in degrees, which can be a fractional value. `--eval-set SET` Set the split to evaluate on, either "dev" or "test". `--strategy STRAT ...` Set the strategy to use for geolocating. Sample strategies are `partial-kl-divergence` ("KL Divergence" in the paper), `average-cell-probability` ("ACP" in the paper), `naive-bayes-with-baseline` ("Naive Bayes" in the paper), and `baseline` (any of the baselines). You can specify multiple `--strategy` options on the command line, and the specified strategies will be tried one after the other. `--baseline-strategy STRAT ...` Set the baseline strategy to use for geolocating. (It's a separate argument because some strategies use a baseline strategy as a fallback, and in those cases, both the strategy and baseline strategy need to be given.) Sample strategies are `link-most-common-toponym` ("??" in the paper), `num-documents` ("??" in the paper), and `random` ("Random" in the paper). You can specify multiple `--baseline-strategy` options, just like for `--strategy`. `--num-training-docs, --num-test-docs` One way of controlling how much work is done. These specify the maximum number of documents (training and testing, respectively) to load/evaluate. `--max-time-per-stage SECS` `--mts SECS` Another way of controlling how much work is done. Set the maximum amount of time to spend in each "stage" of processing. A value of 300 will load enough to give you fairly reasonable results but not take too much time running. `--skip-initial N, --every-nth N` A final way of controlling how much work is done. `--skip-initial` specifies a number of test documents to skip at the beginning before stating to evaluate. `--every-nth` processes only every Nth document rather than all, if N > 1. Used judiciously, they can be used to split up a long run. An additional argument specific to the Twitter front ends is `--doc-thresh`, which specifies the threshold (in number of documents) below which vocabulary is ignored. See the paper for more details. ================== Extracting results ================== A few scripts are provided to extract the results (i.e. mean and median errors) from a series of runs with different parameters, and output the results either directly or sorted by error distance: * `extract-raw-results.sh` extracts results from a number of runs of any of the above front end programs. It extracts the mean and median errors from each specified file, computes the avg mean/median error, and outputs a line giving the errors along with relevant parameters for that particular run. * `extract-results.sh` is similar but also sorts by distance (both median and mean, as well as avg mean/median), to see which parameter combinations gave the best results. =============== Specifying data =============== Data is specified using the `--input-corpus` argument, which takes a directory. The corpus generally contains one or more "views" on the raw data comprising the corpus, with different views corresponding to differing ways of representing the original text of the documents -- as raw, word-split text (i.e. a list, in order, of the "tokens" that occur in the text, where punctuation marks count as their own tokens and hyphenated words may be split into separate tokens); as unigram word counts (for each unique word -- or more properly, token -- the number of times that word occurred); as bigram word counts (similar to unigram counts but pairs of adjacent tokens rather than single tokens are counted); etc. Each such view has a schema file and one or more document files. The schema file is a short file describing the structure of the document files. The document files contain all the data for describing each document, including title, split (training, dev or test) and other metadata, as well as the text or word counts that are used to create the textual distribution of the document. The document files are laid out in a very simple database format, consisting of one document per line, where each line is composed of a fixed number of fields, separated by TAB characters. (E.g. one field would list the title, another the split, another all the word counts, etc.) A separate schema file lists the name of each expected field. Some of these names (e.g. "title", "split", "text", "coord") have pre-defined meanings, but arbitrary names are allowed, so that additional corpus-specific information can be provided (e.g. retweet info for tweets that were retweeted from some other tweet, redirect info when a Wikipedia article is a redirect to another article, etc.). Additional data files (which are automatically handled by the `tg-geolocate` script) can be specified using `--stopwords-file` and `--whitelist-file`. The stopwords file is a list of stopwords (one per line), i.e. words to be ignored when generating a distribution from the word counts in the counts file. The whitelist file is the logical opposite of the stopwords file; if given, only words in the file will be used when generating a distribution from the word counts in the counts file. You don't normally have to specify these args at all, as the whitelist file is optional in any case and a default stopwords file will be retrieved from inside the TextGrounder distribution if necessary. The following is a list of the generally-applicable defined fields: * `title`: Title of the document. Must be unique within a given corpus, and must be present. If no title exists (e.g. for a unique tweet), but an ID exists, use the ID. If neither exists, make up a unique number or unique identifying string of some sort. * `id`: The (usually) numeric ID of the document, if such a thing exists. Currently used only when printing out documents. For Wikipedia articles, this corresponds to the internally-assigned ID. * `split`: One of the strings "training", "dev", "test". Must be present. * `corpus`: Name of the corpus (e.g. "enwiki-20111007" for the English Wikipedia of October 7, 2011. Must be present. The combination of title and corpus uniquely identifies a document in the presence of documents from multiple corpora. * `coord`: Coordinates of a document, or blank. If specified, the format is two floats separated by a comma, giving latitude and longitude, respectively (positive for north and east, negative for south and west). The following is a list of fields specific to Wikipedia: * `redir`: If this document is a Wikipedia redirect article, this specifies the title of the document redirected to; otherwise, blank. Its main use in document geotagging is in computing the incoming link count of a document (see below). * `incoming_links`: Number of incoming links, or blank if unknown. This specifies the number of links pointing to the document from anywhere else. This is primarily used as part of certain baselines (`internal-link` and `link-most-common-toponym`). Note that the actual incoming link count of a Wikipedia article includes the incoming link counts of any redirects to that article. * `namespace`: For Wikipedia articles, the namespace of the article. Articles not in the `Main` namespace have the namespace attached to the beginning of the article name, followed by a colon (but not all articles with a colon in them have a namespace prefix). The main significance of this field is that articles not in the `Main` namespace are ignored. For documents not from Wikipedia, this field should be blank. * `is_list_of`, `is_disambig`, `is_list`: These fields should either have the value of "yes" or "no". These are Wikipedia-specific fields (identifying, respectively, whether the article title is "List of ..."; whether the article is a Wikipedia "disambiguation" page; and whether the article is a list of any type, which includes the previous two categories as well as some others). None of these fields are currently used. ==================== Generating KML files ==================== It is possible to generate KML files showing the distribution of particular words over the Earth's surface, using `tg-generate-kml` (e.g. `tg-generate-kml wikipedia --kml-words mountain,beach,war`). The resulting KML files can be viewed using [http://earth.google.com Google Earth]. The only necessary arg is `--kml-words`, a comma-separated list of the words to generate distributions for. Each word is saved in a file named by appending the word to whatever is specified using `--kml-prefix`. Another argument is `--kml-transform`, which is used to specify a function to apply to transform the probabilities in order to make the distinctions among them more visible. It can be one of `none`, `log` and `logsquared` (actually computes the negative of the squared log). The argument `--kml-max-height` can be used to specify the heights of the bars in the graph. It is also possible to specify the colors of the bars in the graph by modifying constants given in `Geolocate.scala`, near the beginning (`class KMLParameters`). For example: For the Twitter corpus, running on different levels of the document threshold for discarding words, and for the four words "cool", "coo", "kool" and "kewl", the following code plots the distribution of each of the words across a cell of degree size 1x1. `--mts=300` is more for debugging and stops loading further data for generating the distribution after 300 seconds (5 minutes) has passed. It's unnecessary here but may be useful if you have an enormous amount of data (e.g. all of Wikipedia). {{{ for x in 0 5 40; do tg-geolocate geotext --doc-thresh $x --mts=300 --degrees-per-cell=1 --mode=generate-kml --kml-words='cool,coo,kool,kewl' --kml-prefix=kml-dist.$x.none. --kml-transform=none; done }}} Another example, just for the words "cool" and "coo", but with different kinds of transformation of the probabilities. {{{ for x in none log logsquared; do tg-geolocate geotext --doc-thresh 5 --mts=300 --degrees-per-cell=1 --mode=generate-kml --kml-words='cool,coo' --kml-prefix=kml-dist.5.$x. --kml-transform=$x; done }}}
About
A system for connecting language to space and time.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published