Enabling in-memory inputs for training a new tokenizer #88

yharuvi · 2020-01-19T13:37:12Z

Hi,
Thanks for the release!
I was wondering whether there's a possibility to feed BPETokenizer.train() method with input other than list of file names. To be more specific, I'd like to feed it with an in-memory data structure like Pandas Series or list of lists (each representing a doc).
Is that possible without being forced to write to a .txt file?
Thx!

ksopyla · 2020-07-02T14:42:33Z

Similar to #198

kkpsiren · 2020-08-19T06:17:24Z

Any hints how to do this?

n1t0 · 2020-10-20T20:49:54Z

Closing in favor of duplicate #198

n1t0 closed this as completed Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling in-memory inputs for training a new tokenizer #88

Enabling in-memory inputs for training a new tokenizer #88

yharuvi commented Jan 19, 2020

ksopyla commented Jul 2, 2020

kkpsiren commented Aug 19, 2020

n1t0 commented Oct 20, 2020

Enabling in-memory inputs for training a new tokenizer #88

Enabling in-memory inputs for training a new tokenizer #88

Comments

yharuvi commented Jan 19, 2020

ksopyla commented Jul 2, 2020

kkpsiren commented Aug 19, 2020

n1t0 commented Oct 20, 2020