Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a proper corpus tokenizer #24

Closed
gsoul opened this issue Nov 17, 2017 · 2 comments
Closed

Add a proper corpus tokenizer #24

gsoul opened this issue Nov 17, 2017 · 2 comments
Assignees

Comments

@gsoul
Copy link

gsoul commented Nov 17, 2017

While working on the big-corpus MT training I noticed, that current implementation lacks some proper tokenization script. Currently available script only makes splits on whitespaces. So I wrote one for myself based on Spacy library. I'm not sure what are your thoughts about 3rd-party tools, but if you're ok with it, I can make a pull-request.

"""Standalone script to tokenize a corpus based on Spacy NLP library."""

from __future__ import print_function

import argparse
import sys
import spacy

reload(sys)
sys.setdefaultencoding('utf-8')


def main():

  parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
  parser.add_argument(
      "--lang", default="en",
      help="Language of your text.")
  parser.add_argument(
      "--delimiter", default=" ",
      help="Token delimiter for text serialization.")
  args = parser.parse_args()


  nlp = spacy.load(args.lang, disable=['parser', 'tagger', 'ner'])

  lines = []
  for line in sys.stdin:
    line = line.strip().decode("utf-8")

    tokens = nlp(line, disable=['parser', 'tagger', 'ner'])
    merged_tokens = args.delimiter.join([str(token) for token in tokens])
    print(merged_tokens)


if __name__ == "__main__":
  main()

The usage:
python -m bin.tokenize_text_spacy < data/PathTo/giga-fren.release2.fixed.en > data/PathTo/giga-fren.release2.token.en

Performance:
~1.1GB of text per hour OR ~6-8mln of sentences per hour
22.5mln sentences for En<->Fr where processed in ~3hrs for Eng and 4.5hrs fo French

@guillaumekln
Copy link
Contributor

Thanks for sharing that.

We have some advanced tokenization tools in OpenNMT-lua. For now, people are encouraged to use those to prepare their data. We are trying to centralize everything in a C++ implementation and hopefully provide wrappers to use this in Lua and Python. So I would prefer integrating the OpenNMT tokenization instead of external tokenizers like this one.

@gsoul
Copy link
Author

gsoul commented Nov 17, 2017

I see, thanks.

And after a brief look at the tokenizer, it seemed to me that it's language-independent. Which is a great thing in general, but it seems like language-dependent parser, in theory, might give better results. Though its influence on MT results might be insignificant, I'm not sure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants