Skip to content

3. Preprocessing the Corpus

akoksal edited this page May 1, 2018 · 2 revisions

To train word2vec model with gensim library, you need to put each document into a line without punctuations. So, the output file should include all articles and each article should be in a line. Gensim library provides methods to do this preprocessing step. However, tokenize function is modified for Turkish language. You can run preprocess.py to modify your wikipedia dump corpus. It takes two arguments. First one is the path to the wikipedia dump(without extracting). Second one is the path to the output file. For example:

python3 preprocess.py trwiki-20180101-pages-articles.xml.bz2 wiki.tr.txt