You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on the big-corpus MT training I noticed, that current implementation lacks some proper tokenization script. Currently available script only makes splits on whitespaces. So I wrote one for myself based on Spacy library. I'm not sure what are your thoughts about 3rd-party tools, but if you're ok with it, I can make a pull-request.
"""Standalone script to tokenize a corpus based on Spacy NLP library."""from __future__ importprint_functionimportargparseimportsysimportspacyreload(sys)
sys.setdefaultencoding('utf-8')
defmain():
parser=argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
"--lang", default="en",
help="Language of your text.")
parser.add_argument(
"--delimiter", default=" ",
help="Token delimiter for text serialization.")
args=parser.parse_args()
nlp=spacy.load(args.lang, disable=['parser', 'tagger', 'ner'])
lines= []
forlineinsys.stdin:
line=line.strip().decode("utf-8")
tokens=nlp(line, disable=['parser', 'tagger', 'ner'])
merged_tokens=args.delimiter.join([str(token) fortokenintokens])
print(merged_tokens)
if__name__=="__main__":
main()
The usage: python -m bin.tokenize_text_spacy < data/PathTo/giga-fren.release2.fixed.en > data/PathTo/giga-fren.release2.token.en
Performance:
~1.1GB of text per hour OR ~6-8mln of sentences per hour
22.5mln sentences for En<->Fr where processed in ~3hrs for Eng and 4.5hrs fo French
The text was updated successfully, but these errors were encountered:
We have some advanced tokenization tools in OpenNMT-lua. For now, people are encouraged to use those to prepare their data. We are trying to centralize everything in a C++ implementation and hopefully provide wrappers to use this in Lua and Python. So I would prefer integrating the OpenNMT tokenization instead of external tokenizers like this one.
And after a brief look at the tokenizer, it seemed to me that it's language-independent. Which is a great thing in general, but it seems like language-dependent parser, in theory, might give better results. Though its influence on MT results might be insignificant, I'm not sure here.
While working on the big-corpus MT training I noticed, that current implementation lacks some proper tokenization script. Currently available script only makes splits on whitespaces. So I wrote one for myself based on Spacy library. I'm not sure what are your thoughts about 3rd-party tools, but if you're ok with it, I can make a pull-request.
The usage:
python -m bin.tokenize_text_spacy < data/PathTo/giga-fren.release2.fixed.en > data/PathTo/giga-fren.release2.token.en
Performance:
~1.1GB of text per hour OR ~6-8mln of sentences per hour
22.5mln sentences for En<->Fr where processed in ~3hrs for Eng and 4.5hrs fo French
The text was updated successfully, but these errors were encountered: