-
Notifications
You must be signed in to change notification settings - Fork 14
/
README.preprocess
705 lines (534 loc) · 32.4 KB
/
README.preprocess
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
See `README.textgrounder` for an introduction to TextGrounder and to the
Geolocate subproject. This file describes how to preprocess raw data from
other sources to generate corpora for use in TextGrounder.
=============
Output format
=============
The normal output format from the various preprocessing applications is
called the "textdb" format. This is output by (e.g.) ParseTweets.
This format stores data in as a tab-separated database, with one item
per line. Each textdb corpus consists of one or more files containing
data, ending in '.data.txt' (or '.data.txt.bz2' or '.data.txt.gz',
if the data is compressed; the code to read this format automatically
recognizes the compression suffixes and decompresses on-the-fly). In
addition, there is a single file called a "schema", whose name ends in
'.schema.txt'. This specifies the name of each of the fields, as well
as the name and value of any "fixed" fields (which, conceptually,
have the same value for all rows, and are used to store extra info
about the corpus). Generally, there will be multiple output files when
there are multiple reducers in Hadoop, since each reducer produces its
own output file. Storing the data this way makes it easy to read the
files from Hadoop. Large data files are typically stored compressed,
and are automatically uncompressed as they are read in.
The following formats are used for individual fields:
* Integers and floating-point values, output in the obvious way.
* Timestamps, output as long integers specifying milliseconds since the
Unix Epoch (Jan 1, 1970).
* Lat/long coordinates, output as a comma-separated pair of floating-point
values.
* Single strings. URL encoding is used to encode TAB characters (%09 = field
separator), newlines (%0A = record separator), the percent sign (%25), and
any other characters that may be used as separators (e.g. greater-than
signs, spaces, and colons in various cases below).
* Sequences of strings. The individual strings are URL-encoded and separated
by >> signs.
* "Count maps", i.e. maps from strings to integers (e.g. counts of words).
The format is "STRING:COUNT STRING:COUNT ...", where individual strings are
URL-encoded. N-grams use the format "WORD1:WORD2:...:COUNT" for an
individual n-gram. (If additional parameters need to be specified, they
should be given in a format such as ":PARAM:VALUE", with an initial colon.
This facility isn't currently used.)
All text should be encoded in UTF-8.
The schema file is currently in the following format:
-- The first line lists each field name separated by a tab.
-- Additional lines specify "fixed fields", which are conceptually like
normal fields but have the same value for every row. There is one
line per fixed field, with the field name, a tab, and the field value.
Optionally, there may be another tab followed by an "English" version
of the field name, for display purposes.
Possibly at some point it may be necessary to specify the field type as well
as its name.
========================
Downloading Twitter data
========================
There's a script called `twitter-pull` inside of the Twools package
(https://bitbucket.org/utcompling/twools), which can be used for downloading
tweets from the Twitter Streaming API. This automatically handles retrying
after errors along with exponential backoff (absolutely necessary when doing
Twitter streaming, or you will get locked out temporarily), and starts a new
file after a certain time period (by default, one day), to keep the resulting
BZIP-ped files from getting overly large.
WARNING: As of Jun 11, 2013, the Twitter API used in this script no longer
works. We are in the process of rewriting it to use the newer Twitter API's.
===============
Parsing Twitter
===============
ParseTweets (opennlp.textgrounder.preprocess.ParseTweets) is a preprocessing
application that parses tweets in various ways, optionally filtering and/or
grouping them by user, time, etc. Numerous options are provided.
ParseTweets can do a lot of things. By default, it reads in tweets in JSON
format (the raw format from Twitter) and converts them into textdb format
(see above). However, it can also handle different formats on input
and output, as well as group or filter the tweets. For example, it is
possible to read in tweets in textdb format, as well as write out tweets in
JSON format when no grouping is done.
When reading tweets in JSON format, duplicate tweets are automatically
filtered out; filtering is by user ID. This is important because Twitter
often returns duplicate tweets, and it also allows overlapping scrapes of
the same source data to be combined. However, when reading data in other
formats, no deduplication is done, because user ID's often aren't available
(e.g. when tweets have been grouped).
------ Input and output formats ------
Input and output formats are specified using '--input-format' and
'--output-format', respectively. Currently, the following formats are handled:
* "textdb" format, on both input and output.
* JSON format, on both input and output, but output in JSON format is not
possible when tweets are grouped.
* "raw-lines" format on input, treating the text of each line as if it
were a tweet. This makes it possible to do n-gram processing and other
such operations on raw text.
* "stats" format on output. This outputs statistics on the tweets in
"textdb" format.
Further control over the fields output in textdb format is possible using
'--output-fields' (specifying which fields to output). Output of n-grams
is possible using the '--max-ngram' option, specifying the maximum size
of n-grams to output (the default is 1, meaning output only unigrams).
In general, code written using Scoobi should "just work", and require litle
or no more effort to get working than anything else that uses Hadoop.
Specifically, you have to compile the code, build an assembly exactly as you
would do for running other Hadoop code (see above), copy the data into HDFS,
and run. For example, you can run `TwitterPull.scala` as follows using
the TextGrounder front end:
$ textgrounder --hadoop run opennlp.textgrounder.preprocess.TwitterPull input output
or you can run it directly using `hadoop`:
------ Grouping ------
Grouping is specified using '--grouping' and can be done in the following ways:
* No grouping (the default).
* By user. When doing this, the location (lat/long coordinates) of the group
are determined by the earliest tweet in the group that has a location.
(The 'geo-timestamp' field is used to help in tracking this.)
* By time slice. The '--timeslice' argument specifies the size of the time
slices, and all tweets within a given slice are aggregated.
* By file. All tweets in a given input file are aggregated. This is
particularly useful when outputting statistics, or when doing n-gram
processing on raw text.
------ Filtering ------
When filtering tweets, tweets can be filtered either by the presence of
particular words or word sequences in the text of a tweet or based on the
timestamp of the tweet. Arbitrary boolean expressions can be specified to
express more complex filters. Filtering can be done either on the individual
tweet level ('--filter-tweets') or the group level ('--filter-groups'); in
the latter case, a group will be passed through if any tweet in the group
matches the filter. Matching is normally case-insensitive, but case-sensitive
matching is possible using the '--cfilter-tweets' or '--cfilter-groups'
options.
It is possible to filter by group without actually grouping the tweets
together in the output; this is done by specifying the appropriate type of
grouping using '--grouping', and then specifying '--ungrouped-output'.
=================================
Other apps for processing Twitter
=================================
There's a script `python/run-process-geotext` used to preprocess the GeoText
corpus of Eisenstein et al. (2010) into a textdb corpus. This is a
front end for `python/twitter_geotext_process.py` and runs similarly to
the Wikipedia preprocessing code below.
============================
Parsing Wikipedia dump files
============================
*NOTE*: Parsing raw Wikipedia dump files is not easy. Perhaps better
would have been to download and run the MediaWiki software that generates
the HTML that is actually output. As it is, there may be occasional
errors in processing. For example, the code that locates the geotagged
coordinate from an article uses various heuristics to locate the coordinate
from various templates which might specify it, and other heuristics to
fetch the correct coordinate if there is more than one. In some cases,
this will fetch the correct coordinate even if the MediaWiki software
fails to find it (due to slightly incorrect formatting in the article);
but in other cases, it may find a spurious coordinate. (This happens
particularly for articles that mention a coordinate but don't happen to
be themselves tagged with a coordinate, or when two coordinates are
mentioned in an article. FIXME: We make things worse here by picking the
first coordinate and ignoring `display=title`. See comments in
`get_coord()` about how to fix this.)
------ Quick start ------
Use the following to download a Wikipedia dump and preprocess it to create
a corpus for input to TextGrounder:
$ download-preprocess-wiki WIKITAG
where WIKITAG is something like 'dewiki-20120225', which names an existing
version of Wikipedia on http://dumps.wikipedia.org. This downloads the
given Wikipedia dump into a subdirectory of the current directory
named WIKITAG, and then preprocesses it into a corpus of the format
needed by TextGrounder.
------ How to rerun a single step ------
The 'download-preprocess-wiki' script executes 5 separate steps, in order:
1. download: Download Wikipedia corpus from the web
2. permute: Permute the corpus randomly
3. preprocess: Convert the corpus into preprocessed files
4. convert: Convert into a textdb corpus
5. set-permissions: Set permissions appropriately
These steps can also be run individually by specifying them as additional
arguments to 'download-preprocess-wiki'. For example, to rerun only the
last three steps, do this:
$ download-preprocess-wiki preprocess convert set-permissions
This can be useful either when the process crashes due to a bug or runs
out of time on Maverick (esp. when processing the English wikipedia).
------ In detail ------
There are a large number of steps that go into preprocessing a Wikipedia
dump, and a number of different files containing various sorts of
information. Not all of them actually go into the final corpus normally
used by TextGrounder. Some of the complexity of this process is due to
the fact that it was created bit-by-bit and evolved over time at the same
time that TextGrounder itself did.
The preprocessing process involves downloading the dump file from
http://dumps.wikipedia.org (which has a name like
enwiki-20120211-pages-articles.xml.bz2), and going through a series of
steps that involve generating various files, terminating with files in the
standard TextGrounder corpus format, which will have names similar to the
following:
1. enwiki-20120211-permuted-dev-unigram-counts.txt (the actual data for the
documents in the dev set)
2. enwiki-20120211-permuted-dev-unigram-counts-schema.txt (a small "schema"
file describing the fields contained in the data file, plus certain other
information)
3. a similar pair of files for the training and test sets.
Note that part of the proess involves permuting (i.e. randomly reordering)
the articles in the dump file, leading to the creation of a new permuted
dump file with a name like enwiki-20120211-permuted-pages-articles.xml.bz2.
(The order of articles in the raw, unpermuted dump file is highly non-random.)
This is done so that the training, dev and test sets can be extracted in
a simple fashion and so that partial results can be reported during the
testing phase, with an expectation that they will reliably reflect the
final results.
From highest to lowest, there are the following levels of scripts:
LEVEL 1: HIGHEST LEVEL
This consists of the script 'download-preprocess-wiki', which does
everything from start to finish, downloading a dump, permuting it,
preprocessing it into various output files, using them to generate
a TextGrounder corpus, and setting appropriate permissions. It uses
'wget' to download the dump, 'chmod' to set permissions and otherwise
uses scripts from level 2 ('permute-dump', 'preprocess-dump',
'convert-corpus-to-latest').
LEVEL 2: SECONDARY LEVEL
This consists of scripts to execute each of the discrete steps run by
'download-preprocess-wiki'. Each of these scripts in turn consists of
various discrete steps, and runs scripts from level 3 to do them.
The following scripts exist on this level:
1. 'permute-dump' converts a raw dump file into a permuted dump file.
The format of the permuted dump file is exactly the same as the original
dump file but the articles are in a randomized order (which will be
different each time the permuting happens). This is done so that the
later process of splitting the articles into training, dev and test
subsets, and the subsets themselves, will not be affected by any
artifacts resulting from the original non-random ordering of articles
in the raw dump file.
2. 'preprocess-dump' takes a raw dump and processes it in various ways in
order to generate a number of output files. For historical reasons,
a TextGrounder corpus isn't directly generated but rather an older
data format containing equivalent information. Eventually, this should
be fixed.
In addition, 'preprocess-dump' can generate some extra data files
that aren't currently used by TextGrounder, but this isn't done by
default.
'preprocess-dump' uses some lower-level scripts to generate the data
files. Most important among them is 'run-processwiki'.
3. 'convert-corpus-to-latest' converts the older format generated by
'preprocess-dump' to the TextGrounder corpus format.
LEVEL 3: TERTIARY LEVEL
This consists of scripts to execute the substeps needed to be run by
the primary discrete steps of 'download-preprocess-wiki'. Some of
these scripts use even lower-level scripts.
The following scripts exist on this level:
1. 'run-processwiki' is a driver script that can be run in various modes
to generate various sorts of preprocessed data files from the raw
Wikipedia dump. It runs the actual preprocessing script (processwiki.py)
with the appropriate arguments, providing the dump file as standard input.
It is used by 'preprocess-dump' and 'permute-dump'.
2. 'run-permute' implements the dump-file permutation needed by 'permute-dump'.
3. 'run-convert-corpus' implements the conversion done by
'convert-corpus-to-latest'.
LEVEL 4: LOWEST LEVEL
This consists of some lowest-level scripts, particularly those that directly
process a dump file. These scripts take command-line arguments specifying
modes to run in and files containing ancillary information of various sorts,
including information previously generated the same script when run in a
different mode. The dump file is provided as the standard input of the
scripts script rather than through a command-line argument, and the scripts
generates data files on standard output. The scripts have no concept of any
naming conventions for the input or output files; it is the job of
higher-level scripts to keep track of the various files.
1. 'processwiki.py' is the actual script that creates the data files.
2. 'permute_wiki.py' is the actual script that handles permutation of a
dump file, as well as splitting, which is also done on the permuted
dump file. It has three modes:
-- 'permute' generates a permutation of the article metadata file.
-- 'split' splits the raw dump file into separate splits (usually 20),
dividing the articles according to the corresponding split of the
permuted article metadata file but otherwise preserving the original
ordering. This mode is also used on the permuted dump file, to
allow for parallel execution of further steps.
-- 'sort' sorts the articles in a given split file according to the
ordering in the permuted dump file.
A fourth step then concatenates the sorted split files. The process is
done as such to allow it to execute on a machine with somewhat limited
memory and to allow it to be parallelized.
3. 'generate_combined.py' does something like an SQL join, combining
metadata from three separate passes of 'processwiki.py' into a single
combined article metadata file.
FURTHER COMMENTS
For 'run-processwiki' and 'preprocess-dump', the dump file used for input
and the various generated files are stored into and read from the current
directory. 'download-preprocess-wiki' creates a subdirectory of the
current directory with the name of the dump being processed (e.g.
'enwiki-20120317').
When directly running any script below the top-level script
'download-preprocess-wiki', it's strongly suggested that
1) The dump file be made read-only using 'chmod a-w'.
2) A separate directory is created to hold the various generated preprocessed
data files. Either the dump file should be moved into this directory
or a symlink to the dump file created in the directory.
Note also the file 'config-geolocate'. This is normally sourced into the
various scripts, and sets a number of environment variables, including those
naming the dump file and the various intermediate data files produced.
This relies on the following environment variables:
WP_VERSION Specifies which dump file to use, e.g. "enwiki-20111007".
USE_PERMUTED If set to "no", uses the non-permuted version of the dump
file. If set to "yes", always try to use the permuted
version. If blank or unset, use permuted version if it
appears to exist, non-permuted otherwise.
MAX_DEV_TEST_SIZE If set to a number, the dev and test sets will have at most
this many instances, with the remainder in the training set.
These variables can be set on the command line used to execute
'run-processwiki'.
------ In even more detail ------
NOTE: This may be somewhat out of date.
The following describes in detail the set of steps executed by
'download-preprocess-wiki' to generate a corpus. It shows how individual
steps could be rerun if needed. We assume that we are working with the
dump named "enwiki-20111007".
1. A new directory 'enwiki-20111007' is created to hold the downloaded dump
file and all the generated files. The dump is downloaded into this
directory as follows:
wget -nd http://dumps.wikimedia.org/enwiki/20111007/enwiki-20111007-pages-articles.xml.bz2
In all future steps, we work with this new directory as the current
directory.
2. The basic article data file 'enwiki-20111007-document-data.txt' is
created, giving metadata on the articles in the raw dump file
'enwiki-20111007-pages-articles.xml.bz2', one article per line:
USE_PERMUTED=no run-processwiki article-data
Note that, in this and future invocations of 'run-processwiki', we set
the environment variable WP_VERSION to 'enwiki-20111007' to tell it
the prefix of the dump file and the generated output files.
3. A permuted dump file 'enwiki-20111007-permuted-pages-articles.xml.bz2' is
generated. This uses the article data file from the previous step to
specify the list of articles. Basically, this list is permuted randomly,
then divided into 20 parts according to the new order. Then, the dump file
is processed serially article-by-article, outputting the article into one
of the 20 split files, according to the part it belongs in. Then, each
split file is sorted in memory and written out again, and finally the
parts are concatenated and compressed. This avoids having to read the
entire dump file into memory and sort it, which may not be possible.
run-permute all
4. Split the permuted dump into 20 parts, for faster processing. This could
probably re-use the parts from the previous step, but currently it is
written to be an independent operation.
run-processwiki split-dump
Note also that from here on out, we set USE_PERMUTED to 'yes' to make
sure that we use the permuted dump file. ('run-processwiki' attempts to
auto-detect whether to use the permuted file if USE_PERMUTED isn't set,
but it's safer to specify this explicitly.)
5. Generate a combined article data file from the permuted dump file.
run-processwiki combined-article-data
Note that from here on out, we set NUM_SIMULTANEOUS to 20 to indicate that
we should do parallel operation on the 20 split parts. This will spawn off
20 processes to handle the separate parts, except in certain cases where
the parts cannot be processed separately.
This step actually has a number of subparts:
a. Generate 'enwiki-20111007-permuted-article-data.txt', a basic article
data file for the permuted dump. This is like the article data file
generated above for the raw, non-permuted dump. It has one line of
metadata for each article, with a simple database-like format with
fields separated by tab characters, where the first line specifies the
name of each field.
b. Generate 'enwiki-20111007-permuted-coords.txt', specifying the
coordinates (geotags) of articles in the dump file, when such
information can be extracted. Extracting the information involves a
large amount of tricky and tedious parsing of the articles.
c. Generate 'enwiki-20111007-permuted-links-only-coord-documents.txt',
listing, for all articles with coordinates, the count of incoming
internal links pointing to them from some other article. Also contains,
for each such article, a list of all the anchor texts used when pointing
to that article, along with associated counts. (This additional
information is not currently used.)
d. Generate 'enwiki-20111007-permuted-combined-document-data.txt'.
This combines the information from the previous three steps. This
file has the same format as the basic article data file, but has two
additional fields, specifying the coordinates and incoming link count.
Unlike the basic article data file, it only lists articles with
coordinates.
6. Generate 'enwiki-20111007-permuted-counts-only-coord-documents.txt',
the word-counts file for articles with coordinates. This is the remaining
info needed for creating a TextGrounder corpus.
run-processwiki coord-counts
7. Generate some additional files, not currently needed:
a. 'enwiki-20111007-permuted-counts-all-documents.txt': Word counts for
all articles, not just those with coordinates.
b. 'enwiki-20111007-permuted-text-only-coord-documents.txt': The actual
text, pre-tokenized into words, of articles with coordinates.
c. 'enwiki-20111007-permuted-text-all-documents.txt': The actual text,
pre-tokenized into words, of all articles.
8. Remove the various intermediate split files created by the above steps.
9. Generate the corpus files. This involves combining the metadata in
'enwiki-20111007-permuted-combined-document-data.txt' with the word-count
information in 'enwiki-20111007-permuted-counts-only-coord-documents.txt'
into a combined file with one line per document; splitting this data into
three files, one each for the training, dev and test sets; and creating
corresponding schema files.
This is done as follows:
convert-corpus-to-latest enwiki-20111007
------ How to add additional steps of rerun a single low-level step ------
If all the preprocessing has already been done for you, and you simply want
to run a single step, then you don't need to do all of the above steps.
However, it's still strongly recommended that you do your work in a fresh
directory, and symlink the dump file into that directory -- in this case the
*permuted* dump file. We use the permuted dump file for experiments because
the raw dump file has a non-uniform distribution of articles, and so we can't
e.g. count on our splits being uniformly distributed. Randomly permuting
the dump file and article lists takes care of that. The permuted dump file
has a name like
enwiki-20111007-permuted-pages-articles.xml.bz2
For example, if want to change processwiki.py to generate bigrams, and then
run it to generate the bigram counts, you might do this:
1. Note that there are currently options `output-coord-counts` to output
unigram counts only for articles with coordinates (which are the only ones
needed for standard document geotagging), and `output-all-counts` to
output unigram counts for all articles. You want to add corresponding
options for bigram counts -- either something like
`output-coord-bigram-counts` and `output-all-bigram-counts`, or an option
`--n-gram` to specify the N-gram size (1 for unigrams, 2 for bigrams,
3 for trigrams if that's implemented, etc.). *DO NOT* in any circumstance
simply hack the code so that it automatically outputs bigrams instead of
unigrams -- such code CANNOT be incorporated into the repository, which
means your mods will become orphaned and unavailable for anyone else.
2. Modify 'config-geolocate' so that it has additional sets of environment
variables for bigram counts. For example, after these lines:
COORD_COUNTS_SUFFIX="counts-only-coord-documents.txt"
ALL_COUNTS_SUFFIX="counts-all-documents.txt"
you'd add
COORD_BIGRAM_COUNTS_SUFFIX="bigram-counts-only-coord-documents.txt"
ALL_BIGRAM_COUNTS_SUFFIX="bigram-counts-all-documents.txt"
Similarly, after these lines:
OUT_COORD_COUNTS_FILE="$DUMP_PREFIX-$COORD_COUNTS_SUFFIX"
OUT_ALL_COUNTS_FILE="$DUMP_PREFIX-$ALL_COUNTS_SUFFIX"
you'd add
OUT_COORD_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$COORD_BIGRAM_COUNTS_SUFFIX"
OUT_ALL_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$ALL_BIGRAM_COUNTS_SUFFIX"
And then you'd do the same thing for IN_COORD_COUNTS_FILE and
IN_ALL_COUNTS_FILE.
3. Modify 'run-processwiki', adding new targets ("steps")
'coord-bigram-counts' and 'all-bigram-counts'. Here, you would just
copy the existing lines for 'coord-counts' and 'all-counts' and modify
them appropriately.
4. Now finally you can run it:
WP_VERSION=enwiki-20111007 run-processwiki coord-bigram-counts
This generates the bigram counts for geotagged articles -- the minimum
necessary for document geotagging.
Actually, since the above might take awhile and generate a fair amount
of diagnostic input, you might want to run it in the background
under nohup, so that it won't die if your terminal connection suddenly
dies. One way to do that is to use the TextGrounder 'run-nohup' script:
WP_VERSION=enwiki-20111007 run-nohup --id do-coord-bigram-counts run-processwiki coord-bigram-counts
Note that the '--id do-coord-bigram-counts' is optional; all it does is
insert the text "do-coord-bigram-counts" into the file that it stores
stdout and stderr output into. This file will have a name beginning
'run-nohup.' and ending with a timestamp. The beginning and ending of the
file will indicate the starting and ending times, so you can see how long
it took.
If you want to generate bigram counts for all articles, you could use a
similar command line, although it might take a couple of days to complete.
If you're on Maverick, where you only have 24-hour time slots (FIXME:
This was true under Longhorn -- is it still the case for Maverick?),
you might consider using the "divide-and-conquer" mode. The first thing
is to split the dump file, like this:
WP_VERSION=enwiki-20111007 run-processwiki split-dump
This takes maybe 45 mins and splits the whole dump file into 20 pieces.
(Controllable through NUM_SPLITS.)
Then, each operation you want to do in divide-and-conquer mode, run it
by setting NUM_SIMULTANEOUS to something more than 1, e.g.
WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=20 run-processwiki all-bigram-counts
(although you probably want to wrap it in 'run-nohup'). Essentially,
this runs 20 simultaneous run-processwiki processes (which fits well with
the workhorse Maverick machines, since they are 20-core), one on each of
the 20 splits, and then concatenates the results together at the end.
You can set a NUM_SIMULTANEOUS that's lower than the number of splits,
and you get only that much simultaneity.
------ Input files, intermediate files, output files ------
The following are the files in a directory, from the perspective of a dump
named `enwiki-20131104`, meaning it's a dump of the English Wikipedia made
approximately November 4, 2013.
1. `enwiki-20131104-pages-articles.xml.bz2`
Raw dump of article text, downloaded from
http://download.wikimedia.org/enwiki/20131104/
, specifying coordinates (geotags) of article in the dump file,
when such information can be extracted. Extracting the information
involves a large amount of tricky and tedious parsing of the articles.
Parsing is done using `preprocess-dump`, its subscript `run-processwiki`,
and the actual script `processwiki.py` that does the work, along with
various ancillary scripts.
2. `enwiki-20131104-permuted-pages-articles.xml.bz2`
Permuted dump. This was created by scanning the raw dump, extracting
a list of articles, randomly permuting the list, and sorting the whole
dump according to that order. This is actually done by splitting the
dump file into pieces (20 pieces when run on Maverick, with 20 cores),
loading each piece into memory and writing out according to the
permuted order, and then doing a disk merge of the resulting sorted pieces.
Note that this essentially does a merge sort on disk, and as a result
can be run on a system where the uncompressed dump file cannot fit into
main memory. See `run-permute` and `permute_wiki.py`.
3. `enwiki-20131104-permuted-combined-document-info.txt`
Flat-file database giving information on each article, such as its name,
ID, whether it's a redirect, the namespace it's in, whether it's a
disambiguation page, the split (training/dev/test) it's in, the
latitude/longitude coordinate, and the number of incoming links.
It is generated from three files using the analog of an SQL JOIN operation:
1. `enwiki-20131104-permuted-document-info.txt`
This file is generated from the permuted dump and has all info except
the latitude/longitude coordinates and the internal links.
2. `enwiki-20131104-permuted-coords.txt`
Lists articles and their extracted coordinates.
3. `enwiki-20131104-permuted-links-only-coord-documents.txt`
This file lists, for each article, how many incoming links point to that
article. It also lists, for each article, the different anchor texts
used to link to that article and counts of how often they are used.
(This info isn't currently used elsewhere but could be. For example,
it might be useful in constructing a disambiguator for mapping surface
forms to canonical forms for geographic entities -- i.e. mapping
toponyms to locations. This is actually used in the LISTR resolver
in FieldSpring, I think.)
4. `enwiki-20131104-permuted-counts-only-coord-documents.txt`
For each article with a coordinate, this gives unigram counts for all words
in the text of the article, one word per line. The preprocessing scripts
have an elaborate mechanism to try and sort out "useful" text from
directives. When in doubt, they err on the side of inclusion rather than
exclusion, and thus may include some directives by mistake.
5. `enwiki-20131104-permuted-training.data.txt`
`enwiki-20131104-permuted-dev.data.txt`
`enwiki-20131104-permuted-test.data.txt`
Combination of the info in `enwiki-20131104-permuted-document-info.txt` and
`enwiki-20131104-permuted-counts-only-coord-documents.txt`, with one
line per article, split into training/dev/test files according to the
splits given in `enwiki-20131104-permuted-document-info.txt`.
6. `enwiki-20131104-permuted-training.schema.txt`
`enwiki-20131104-permuted-dev.schema.txt`
`enwiki-20131104-permuted-test.schema.txt`
Corresponding schema files for the `.data.txt` files described just above.
The combination of a schema and data file makes a TextDB database, a type
of flat-file database used in TextGrounder.
Other data that could be useful:
-- Map of all redirects (e.g. "U.S." -> United States)
-- Map of all disambiguation pages and the articles linked to
-- The inverse surface->article mapping, i.e. for each article, list of
surface forms that link to the article, along with counts
-- List of all pairs of articles that mutually link to each other, along with
counts: Useful for indicating closely linked articles
-- Tables of categories (constructed from articles with a [[Category:foo]]
link in them) and lists (construct from "List of foo" articles)