Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #296

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Dev #296

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 40 additions & 167 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,196 +1,69 @@
# WikiExtractor
[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database backup dump](https://dumps.wikimedia.org/), e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.
## 2.1 wikiextractor

The tool is written in Python and requires Python 3 but no additional library.
**Warning**: problems have been reported on Windows due to poor support for `StringIO` in the Python implementation on Windows.
### 2.1.1 简介

For further information, see the [Wiki](https://github.com/attardi/wikiextractor/wiki).
基于维基百科的语料生成训练数据。

# Wikipedia Cirrus Extractor
### 2.1.2 GitHub链接

`cirrus-extractor.py` is a version of the script that performs extraction from a Wikipedia Cirrus dump.
Cirrus dumps contain text with already expanded templates.
[https://github.com/attardi/wikiextractor](https://github.com/attardi/wikiextractor)

Cirrus dumps are available at:
[cirrussearch](http://dumps.wikimedia.org/other/cirrussearch/).
### 2.1.3 安装环境

# Details
#### 2.1.3.1 Python环境

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.
安装python环境,建议采用anaconda方式安装,版本3.7。

In order to speed up processing:
conda create -n wikiextractor python=3.7

- multiprocessing is used for dealing with articles in parallel
- a cache is kept of parsed templates (only useful for repeated extractions).
激活环境,

## Installation
conda activate wikiextractor

The script may be invoked directly:
#### 2.1.3.2 拉取代码

python -m wikiextractor.WikiExtractor <Wikipedia dump file>
git clone https://github.com/attardi/wikiextractor.git

It can also be installed from `PyPi` by doing:
***注意**:*需要将项目中*./wikiextractor/extract.py*文件中的两行pdb相关的代码注释掉。

pip install wikiextractor

or locally with:
#### 2.1.3.3 安装依赖包

(sudo) python setup.py install
pip install wikiextractor

The installer also installs two scripts for direct invocation:
### 2.1.4 下载原始语料文件

wikiextractor (equivalent to python -m wikiextractor.WikiExtractor)
extractPage (to extract a single page from a dump)
#### 2.1.4.1 链接

## Usage
链接:[https://dumps.wikimedia.org/zhwiki](https://dumps.wikimedia.org/zhwiki)
若我们把 zhwiki 替换为 enwiki,就能找到英文语料,如果替换为 frwiki,就能找到法语语料,依次类推。
具体语言列表可参考,[**ISO 639-1语言列表**](https://baike.baidu.com/item/ISO%20639-1/8292914?fr=aladdin)

### Wikiextractor
The script is invoked with a Wikipedia dump file as an argument:
#### 2.1.4.2 下载原始语料

python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]
以英文为例,可下载如下文件,
[https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)

The option `--templates` extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.
### 2.1.5 执行命令

The output is stored in several files of similar size in a given directory.
Each file will contains several documents in this [document format](https://github.com/attardi/wikiextractor/wiki/File-Format).
将下载的语料置于项目根目录下后执行下述命令,

```
usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2]
[--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES]
[-q] [--debug] [-a] [-v]
input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

<doc id="" url="" title="">
...
</doc>

If the program is invoked with the --json flag, then each file will
contain several documents formatted as json ojects, one per line, with
the following structure

{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

The program performs template expansion by preprocesssng the whole dump and
collecting template definitions.

positional arguments:
input XML wiki dump file

optional arguments:
-h, --help show this help message and exit
--processes PROCESSES
Number of processes to use (default 79)

Output:
-o OUTPUT, --output OUTPUT
directory for extracted files (or '-' for dumping to stdout)
-b n[KMG], --bytes n[KMG]
maximum bytes per output file (default 1M)
-c, --compress compress output files using bzip
--json write output in json format instead of the default <doc> format

Processing:
--html produce HTML output, subsumes --links
-l, --links preserve links
-ns ns1,ns2, --namespaces ns1,ns2
accepted namespaces
--templates TEMPLATES
use or create file containing templates
--no-templates Do not expand templates
--html-safe HTML_SAFE
use to produce HTML safe output within <doc>...</doc>

Special:
-q, --quiet suppress reporting progress info
--debug print debug info
-a, --article analyze a file containing a single article (debug option)
-v, --version print program version
python -m wikiextractor.WikiExtractor \
-b 100M \
--processes 4 \
--json \
-o data \
下载的语料包.bz2
```

Saving templates to a file will speed up performing extraction the next time,
assuming template definitions have not changed.

Option `--no-templates` significantly speeds up the extractor, avoiding the cost
of expanding [MediaWiki templates](https://www.mediawiki.org/wiki/Help:Templates).

For further information, visit [the documentation](http://attardi.github.io/wikiextractor).

### Cirrus Extractor

~~~
usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]
[-v]
input

Wikipedia Cirrus Extractor:
Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

<doc id="" url="" title="" language="" revision="">
...
</doc>

positional arguments:
input Cirrus Json wiki dump file

optional arguments:
-h, --help show this help message and exit

Output:
-o OUTPUT, --output OUTPUT
directory for extracted files (or '-' for dumping to
stdin)
-b n[KMG], --bytes n[KMG]
maximum bytes per output file (default 1M)
-c, --compress compress output files using bzip

Processing:
-ns ns1,ns2, --namespaces ns1,ns2
accepted namespaces

Special:
-q, --quiet suppress reporting progress info
-v, --version print program version
~~~

### extractPage
Extract a single page from a Wikipedia dump file.

~~~
usage: extractPage [-h] [--id ID] [--template] [-v] input

Wikipedia Page Extractor:
Extracts a single page from a Wikipedia dump file.

positional arguments:
input XML wiki dump file

optional arguments:
-h, --help show this help message and exit
--id ID article number
--template template number
-v, --version print program version
~~~

## License
The code is made available under the [GNU Affero General Public License v3.0](LICENSE).
-o用来指定输出目录,--process 用来指定使用的进程数目(默认为 1),-b 选项用来控制单个生成文件的大小(默认为 1M,文件越大,包含的词条也越多),最后的参数为要处理的原始压缩语料文件名称。程序运行完成以后,在输出目录下面会生成多个子目录,每个目录下面有一些生成的文件。

## Reference
If you find this code useful, please refer it in publications as:
| 参数 | 含义 |
| ------- | ---------------------- |
| o | 输出目录 |
| b | 控制单个生成文件的大小 |
| process | 进程数 |
| json | 生成json格式 |

~~~
@misc{Wikiextractor2015,
author = {Giusepppe Attardi},
title = {WikiExtractor},
year = {2015},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/attardi/wikiextractor}}
}
~~~

4 changes: 2 additions & 2 deletions wikiextractor/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from html.entities import name2codepoint
import logging
import time
import pdb # DEBUG
#import pdb # DEBUG

# ----------------------------------------------------------------------

Expand Down Expand Up @@ -82,7 +82,7 @@ def clean(extractor, text, expand_templates=False, html_safe=True):
if expand_templates:
# expand templates
# See: http://www.mediawiki.org/wiki/Help:Templates
pdb.set_trace() # DEBUG
#pdb.set_trace() # DEBUG
text = extractor.expandTemplates(text)
else:
# Drop transclusions (template, parser functions)
Expand Down