Add a proper corpus tokenizer #24

gsoul · 2017-11-17T12:58:34Z

While working on the big-corpus MT training I noticed, that current implementation lacks some proper tokenization script. Currently available script only makes splits on whitespaces. So I wrote one for myself based on Spacy library. I'm not sure what are your thoughts about 3rd-party tools, but if you're ok with it, I can make a pull-request.

"""Standalone script to tokenize a corpus based on Spacy NLP library."""

from __future__ import print_function

import argparse
import sys
import spacy

reload(sys)
sys.setdefaultencoding('utf-8')


def main():

  parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
  parser.add_argument(
      "--lang", default="en",
      help="Language of your text.")
  parser.add_argument(
      "--delimiter", default=" ",
      help="Token delimiter for text serialization.")
  args = parser.parse_args()


  nlp = spacy.load(args.lang, disable=['parser', 'tagger', 'ner'])

  lines = []
  for line in sys.stdin:
    line = line.strip().decode("utf-8")

    tokens = nlp(line, disable=['parser', 'tagger', 'ner'])
    merged_tokens = args.delimiter.join([str(token) for token in tokens])
    print(merged_tokens)


if __name__ == "__main__":
  main()

The usage:
python -m bin.tokenize_text_spacy < data/PathTo/giga-fren.release2.fixed.en > data/PathTo/giga-fren.release2.token.en

Performance:
~1.1GB of text per hour OR ~6-8mln of sentences per hour
22.5mln sentences for En<->Fr where processed in ~3hrs for Eng and 4.5hrs fo French

The text was updated successfully, but these errors were encountered:

guillaumekln · 2017-11-17T13:39:45Z

Thanks for sharing that.

We have some advanced tokenization tools in OpenNMT-lua. For now, people are encouraged to use those to prepare their data. We are trying to centralize everything in a C++ implementation and hopefully provide wrappers to use this in Lua and Python. So I would prefer integrating the OpenNMT tokenization instead of external tokenizers like this one.

gsoul · 2017-11-17T14:03:18Z

I see, thanks.

And after a brief look at the tokenizer, it seemed to me that it's language-independent. Which is a great thing in general, but it seems like language-dependent parser, in theory, might give better results. Though its influence on MT results might be insignificant, I'm not sure here.

guillaumekln added the enhancement label Nov 17, 2017

gsoul mentioned this issue Nov 21, 2017

Baseline results are far off from Lua version #21

Closed

guillaumekln self-assigned this Dec 20, 2017

guillaumekln mentioned this issue Jan 3, 2018

Advanced tokenization #42

Merged

guillaumekln closed this as completed in #42 Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a proper corpus tokenizer #24

Add a proper corpus tokenizer #24

gsoul commented Nov 17, 2017 •

edited

Loading

guillaumekln commented Nov 17, 2017

gsoul commented Nov 17, 2017

Add a proper corpus tokenizer #24

Add a proper corpus tokenizer #24

Comments

gsoul commented Nov 17, 2017 • edited Loading

guillaumekln commented Nov 17, 2017

gsoul commented Nov 17, 2017

gsoul commented Nov 17, 2017 •

edited

Loading