Skip to content

Latest commit

 

History

History
19 lines (11 loc) · 570 Bytes

README.md

File metadata and controls

19 lines (11 loc) · 570 Bytes

WikiExtractor_To_the_one_text

Simple extension for Python script that extracts and cleans text from a Wikipedia database dump. Most of the codes are from WikiExtrator

##Installation

(sudo) python setup.py install

Usage

python WikiExtractor.py Wiki_dump.xml -options

ex) python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted

For detailed options, see WikiExtrator

python To_the_one_text.py Input_directory Name_of_the_single_output_file