Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some taken some given #202

Open
wants to merge 110 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
e2d332e
Compacted Headers
HjalmarrSv Dec 13, 2019
053fde3
Update WikiExtractor.py
HjalmarrSv Dec 13, 2019
59e597c
Update WikiExtractor.py
HjalmarrSv Dec 13, 2019
5870040
Update WikiExtractor.py
HjalmarrSv Dec 13, 2019
d2e6330
fixed added errors
HjalmarrSv Dec 13, 2019
eeda954
works but errors may exist as well as quirks
HjalmarrSv Dec 13, 2019
8d10750
Added Example
HjalmarrSv Dec 13, 2019
3b6a710
Added 2 examples
HjalmarrSv Dec 13, 2019
ef9c07a
-- spacefree now also removes line = " "
HjalmarrSv Dec 14, 2019
7b2abd4
Changed --spacefree to --squeeze-blank (same as in cat --squeeze-blan…
HjalmarrSv Dec 20, 2019
de81a98
Update
HjalmarrSv Dec 23, 2019
866e073
Update README.md
HjalmarrSv Dec 23, 2019
22da7ea
--no-title as alias for --titlefree
HjalmarrSv Dec 28, 2019
8378de6
--no-title should work as --titlefree
HjalmarrSv Dec 28, 2019
dc56429
--no-templates as alias for --no_templates
HjalmarrSv Dec 28, 2019
7198d0a
temporarily change --no-title to -no-title
HjalmarrSv Dec 28, 2019
67a0ce7
Update WikiExtractor.py
HjalmarrSv Dec 29, 2019
49ad988
About cleaning
HjalmarrSv Dec 29, 2019
bb68ed7
Update README.md
HjalmarrSv Dec 29, 2019
4f192ed
Update README.md
HjalmarrSv Dec 29, 2019
d2ce1a8
spelling
HjalmarrSv Dec 31, 2019
ea31c41
Added mediawiki links in comments
HjalmarrSv Dec 31, 2019
67903ab
Implement josecannete/wikiextractorforBERT
HjalmarrSv Jan 3, 2020
a32e7b1
Fix: separate docs
HjalmarrSv Jan 3, 2020
3325153
--for-bert
HjalmarrSv Jan 3, 2020
8bed747
Added better example
HjalmarrSv Jan 4, 2020
b4420f3
Update README.md
HjalmarrSv Jan 4, 2020
703249a
Update README.md
HjalmarrSv Jan 4, 2020
58d4352
Concatenate many files to one file
HjalmarrSv Jan 4, 2020
bfda827
Template expansion
HjalmarrSv Jan 4, 2020
d7eb406
broken middoc interwiki link
HjalmarrSv Jan 4, 2020
a4e4fdc
Augmented key regex to catch plus/minus signs
HjalmarrSv Jan 4, 2020
82f6d3c
New example for Bert and json
HjalmarrSv Jan 4, 2020
5ae4bcd
Force 'utf-8' encoding
HjalmarrSv Jan 4, 2020
c6b5d39
Remove inline flags from the middle of a regex
HjalmarrSv Jan 4, 2020
41fdefc
Update README.md
HjalmarrSv Jan 4, 2020
b2c4a47
Option to restrict to specific pages by title
HjalmarrSv Jan 5, 2020
2f80d1c
--max_articles
HjalmarrSv Jan 6, 2020
9683248
bug fixes
HjalmarrSv Jan 6, 2020
761f67b
max_articles, remove-html-tags in tested example
HjalmarrSv Jan 6, 2020
aef4bc8
remove mapframe, maplink
HjalmarrSv Jan 6, 2020
7c37e62
discard score tags
HjalmarrSv Jan 6, 2020
fe2b6f2
cgi.escape with html.escape
HjalmarrSv Jan 7, 2020
02757c6
Handle broken pipe and keyboard interrupts
HjalmarrSv Jan 7, 2020
050b586
fixed the last introduced bug
HjalmarrSv Jan 7, 2020
a63ab3a
Rollback of code for termination
HjalmarrSv Jan 7, 2020
0b9091b
--verbose and <BR>
HjalmarrSv Jan 8, 2020
ea73ede
bug fix + a bit of removing of debug comments
HjalmarrSv Jan 8, 2020
e190f33
Update README.md
HjalmarrSv Jan 8, 2020
5ab7485
Update README.md
HjalmarrSv Jan 8, 2020
48dc351
Update README.md
HjalmarrSv Jan 9, 2020
ff7cf4d
templates_only option and dropNested within and between ref tags
HjalmarrSv Jan 9, 2020
3255fae
currently no install
HjalmarrSv Jan 12, 2020
eaf3d34
--raw, --abstract_only
HjalmarrSv Jan 16, 2020
6088107
Update README.md
HjalmarrSv Jan 16, 2020
d831277
bugfixes
HjalmarrSv Jan 16, 2020
a3cc036
fixed breaking errors
HjalmarrSv Jan 19, 2020
7047392
Example on cirrus dump
HjalmarrSv Jan 19, 2020
de50af0
text only option
HjalmarrSv Jan 19, 2020
7b8a9d1
Example on text only
HjalmarrSv Jan 19, 2020
e399a97
Remove code making gzip conditional
HjalmarrSv Jan 20, 2020
e3657c1
Add urlbase to verbosity
HjalmarrSv Jan 20, 2020
d875198
only one space before caret
HjalmarrSv Jan 20, 2020
c4321d0
remove broken sentences
HjalmarrSv Jan 21, 2020
8a78aea
--sentences and --raw
HjalmarrSv Jan 23, 2020
24e6e85
fix tab error
HjalmarrSv Jan 23, 2020
6c64e75
--sentences, now at least 2 per article
HjalmarrSv Jan 23, 2020
00a322e
Update README.md
HjalmarrSv Jan 24, 2020
8bb7497
bug fix
HjalmarrSv Jan 24, 2020
3b8bb70
Make visible what is not supported currently
HjalmarrSv Jan 25, 2020
6d755bf
Added switch: '__NOGLOBAL__'
HjalmarrSv Jan 25, 2020
49e3cd8
Added {{grammar: ...}}
HjalmarrSv Jan 27, 2020
34ee0dc
disambiguate command line parameter
HjalmarrSv Jan 27, 2020
fd1c7ca
Update WikiExtractor.py
HjalmarrSv Jan 27, 2020
bbdb4a8
revert on 'small'
HjalmarrSv Jan 27, 2020
62d2b06
basic support for 'formatnum'
HjalmarrSv Jan 28, 2020
7ccd3f8
added example
HjalmarrSv Jan 28, 2020
63278c7
refresh comments
HjalmarrSv Jan 28, 2020
0950b89
--decimalcomma; decimal separator comma (,) for formatnum
HjalmarrSv Jan 29, 2020
8910474
bug fix
HjalmarrSv Jan 29, 2020
a729546
a little better formatnum
HjalmarrSv Jan 29, 2020
fbb1199
version update
HjalmarrSv Jan 29, 2020
effcc94
clean up templatestyles
HjalmarrSv Jan 29, 2020
063d3d1
preparing for #dateformat tags
HjalmarrSv Feb 2, 2020
40bd592
clean '! ...' - lines
HjalmarrSv Feb 2, 2020
c235357
this way it actually works
HjalmarrSv Feb 2, 2020
970f94b
set 'options.cleaned = True'
HjalmarrSv Feb 2, 2020
7a824b8
Additional cleaning needed
HjalmarrSv Feb 2, 2020
56d9852
improve cleaning
HjalmarrSv Feb 3, 2020
43a9b9d
basic formatdate and dateformat parsing
HjalmarrSv Feb 3, 2020
c4136d0
remove multiples of '/n'
HjalmarrSv Feb 3, 2020
d90b9e7
remove empty ( )
HjalmarrSv Feb 3, 2020
9c7d502
Update comments in code
HjalmarrSv Feb 5, 2020
01668eb
use raw, especially where there are \
HjalmarrSv Feb 6, 2020
3f672b0
expand formatnum functionality
HjalmarrSv Feb 6, 2020
1b2a695
the obvious bugs
HjalmarrSv Feb 6, 2020
e974947
too buggy - works when function shortcut, else not
HjalmarrSv Feb 6, 2020
3133906
Update comments: Lua wants!
HjalmarrSv Feb 7, 2020
d0f2529
fix for revid
HjalmarrSv Feb 28, 2020
4a4650d
creates article files instead
HjalmarrSv Feb 29, 2020
392fddf
Update README.md
HjalmarrSv Feb 29, 2020
bc781f2
Adding comments
HjalmarrSv Feb 29, 2020
86adb42
If / gives problems in an os that uses \ in dir path
HjalmarrSv Mar 2, 2020
1f89728
fix for en, new folder structure
HjalmarrSv Mar 7, 2020
69d0e76
Update README.md
HjalmarrSv Mar 7, 2020
1f70817
Updated
HjalmarrSv Mar 8, 2020
f3a1892
Update README.md
HjalmarrSv Mar 8, 2020
0654bab
Update cirrus-extract.py
HjalmarrSv Mar 13, 2020
fbeef46
Update cirrus-extract.py
HjalmarrSv May 19, 2020
c423d2d
Update README.md
HjalmarrSv Oct 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 78 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,28 @@
# WikiExtractor
[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database dump](http://download.wikimedia.org/).

The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library. Python 2 may not work properly any longer, testing may be needed.

For further information, see the [project Home Page](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) or the [Wiki](https://github.com/attardi/wikiextractor/wiki).

# Wikipedia Cirrus Extractor

`cirrus-extractor.py` is a version of the script that performs extraction from a Wikipedia Cirrus dump.
Cirrus dumps contain text with already expanded templates.
Cirrus dumps contain text with already expanded templates. The Cirrus extractor does not suffer fron somewhat inadequate template expansion. Until WikiExtractors template expansion has been fixed this may be used instead. Although some templates, such as the one for stub articles, is not useful expanded.

<b>Examples:</b><br>
json output: python3 cirrus-extract.py -o wiki/test wiki/wiki-20191104-cirrussearch-content.json.gz<br>
text output: python3 cirrus-extract.py -o wiki/test -t wiki/wiki-20191104-cirrussearch-content.json.gz

Text output is without titles, etc. It contains only the article texts separated with empty lines.

<b>Some additional switches are:</b><br>
--raw : basically no cleaning.<br>
--sentences : basic sentence based cleaning, based on dot and space, producing at least two sentences ending with a dot - but can be tricked by dots in names, etc.

<b>If you want, or do not want, every article in a separate file</b><br>
Change line 53 accordingly. Note that if you want something else than ./A/ABC/abc... as directory structure you need to change in the code. I have commented where (lines 123-127). Please, also look at line 281 for file name variations.<br>
Example: python3 cirrus-extract.py -o wiki/test -t --sentences wiki/wiki-20191216-cirrussearch-content.json.gz

Cirrus dumps are available at:
[cirrussearch](http://dumps.wikimedia.org/other/cirrussearch/).
Expand All @@ -24,25 +38,65 @@ In order to speed up processing:

## Installation

The script may be invoked directly, however it can be installed by doing:

(sudo) python setup.py install
Currently no installation. The script may be invoked directly.

## Usage
The script is invoked with a Wikipedia dump file as an argument.
The output is stored in several files of similar size in a given directory.
Each file will contains several documents in this [document format](http://medialab.di.unipi.it/wiki/Document_Format).

usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html]
[-l] [-s] [--lists] [-ns ns1,ns2]
[--templates TEMPLATES] [--no-templates] [-r]
[--min_text_length MIN_TEXT_LENGTH]
[--filter_category path_of_categories_file]
[--filter_disambig_pages] [-it abbr,b,big]
[-de gallery,timeline,noinclude] [--keep_tables]
[--processes PROCESSES] [-q] [--debug] [-a] [-v]
[--log_file]
input
usage: WikiExtractor.py <br>
[-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html]<br>
[-l] [-s] [--headersfooters] [--noLineAfterHeader]<br>
[-no-title] [--squeeze_blank] [--for-bert]<br>
[--remove-special-tokens] [--remove-html-tags]<br>
[--point-separated]<br>
[--restrict_pages_to RESTRICT_PAGES_TO]<br>
[--max_articles MAX_ARTICLES] [--verbose] [--lists]<br>
[-ns ns1,ns2] [--templates TEMPLATES] [--no-templates]<br>
[-r] [--min_text_length MIN_TEXT_LENGTH]<br>
[--filter_disambig_pages] [-it abbr,b,big]<br>
[-de gallery,timeline,noinclude] [--keep_tables]<br>
[--processes PROCESSES] [-q] [--debug] [-a]<br>
[--log_file LOG_FILE] [-v]<br>
[--filter_category FILTER_CATEGORY]<br>
input

## Examples (tested for "correct" output)
<b>Debug and testing (short and fast):</b>
python3 WikiExtractor.py -o wiki/test --templates templat.txt --max_articles 10 --verbose wiki/wiki-20191101-pages-articles.xml<br>
<b>Debug and testing (more info on screen and a log):</b> python3 WikiExtractor.py -o wiki/test --templates templat.txt --max_articles 10 --verbose --debug --log_file log.txt wiki/wiki-20191101-pages-articles.xml

<b>JSON (most extracted information):</b>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --templates templat.txt --titlefree --json --min_text_length 100 wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --templates templat.txt --json --for-bert --min_text_length 100 wiki/wiki-20191101-pages-articles.xml

<b>Text only with "extra cleaning" (change --min_text_length to suit your use cases):</b>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --no_templates --remove-html-tags --remove-special-tokens --min_text_length 100 wiki/wiki-20191101-pages-articles.xml

<b>Other combinations:</b>
python3 WikiExtractor.py -o wiki/test --headersfooters --titlefree --squeeze-blank wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --titlefree --squeeze-blank wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --noLineAfterHeader --squeeze-blank wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --for-bert wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --no_templates --for-bert --min_text_length 100 wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --templates templat.txt --titlefree --json --for-bert --min_text_length 100 wiki/wiki-20191101-pages-articles.xml<br>
python3 WikiExtractor.py -o wiki/test --filter_disambig_pages --templates templat.txt --squeeze-blank --titlefree --max_articles 10 --remove-html-tags --min_text_length 100 wiki/wiki-20191101-pages-articles.xml<br>

<b>Postprocessing</b>
After running the extractor there may be a need for cleaning the output. In linux you may use any of the following examples. Please copy all the files to a safe place first. ANY ERROR IN THE CODE WILL DESTROY YOUR TEXT. You can be sure your text will be destroyed many times before you find the right cleaning scripts.<br>
left trim on one file: sed -i 's/^[ ]*//g' YOURTEXT<br>
right trim on one file: sed -i 's/[ ]*$//g' YOURTEXT<br>
If you want to work many files at a time use (do NOT have any othe files in the folder or subfolders):<br>
left trim on all files in folder or subfolder: find wiki/* -type f -exec sed -i 's/^[ ]*//g' {} \;<br>
right trim on all files in folder or subfolder: find wiki/* -type f -exec sed -i 's/[ ]*$//g' {} \;<br>
remove a line that starts with < and ends with > on all files in folder or subfolder: find wiki/* -type f -exec sed -E -i '/^<[^<]*>$/d' {} \;<br>
remove a line that starts with ( and ends with ) on all files in folder or subfolder: find wiki/* -type f -exec sed -E -i '/^[(][^(]*[)]$/d' {} \;<br>
Search Internet for variations and how to use with other operating systems. One variation would be to remove option "-i" and write changes to new files, instead of -i[nline] - although not very useful if you do more than one cleaning operation.

For those use cases where only on large file is needed, in linux use: cat --squeeze-blank wiki/\*/\* > wiki/wiki.txt



Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
Expand All @@ -59,7 +113,7 @@ Each file will contains several documents in this [document format](http://media

{"id": "", "revid": "", "url":"", "title": "", "text": "..."}

Template expansion requires preprocesssng first the whole dump and
Template expansion requires preprocessing first the whole dump and
collecting template definitions.

positional arguments:
Expand Down Expand Up @@ -116,6 +170,14 @@ Each file will contains several documents in this [document format](http://media
from the article text
--keep_tables Preserve tables in the output article text
(default=False)
--headersfooters Adds header and footer to each article
(default=False)
--noLineAfterHeader Does not add line below title. Title is directly on article.
(default=False)
--titlefree No titles on articles
(default=False)
--squeeze-blank Minimize empty lines, that is, only empty lines are before/after title.
(default=False)

Special:
-q, --quiet suppress reporting progress info
Expand Down
Loading