Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling in-memory inputs for training a new tokenizer #88

Closed
yharuvi opened this issue Jan 19, 2020 · 3 comments
Closed

Enabling in-memory inputs for training a new tokenizer #88

yharuvi opened this issue Jan 19, 2020 · 3 comments

Comments

@yharuvi
Copy link

yharuvi commented Jan 19, 2020

Hi,
Thanks for the release!
I was wondering whether there's a possibility to feed BPETokenizer.train() method with input other than list of file names. To be more specific, I'd like to feed it with an in-memory data structure like Pandas Series or list of lists (each representing a doc).
Is that possible without being forced to write to a .txt file?
Thx!

@ksopyla
Copy link

ksopyla commented Jul 2, 2020

Similar to #198

@kkpsiren
Copy link

Any hints how to do this?

@n1t0
Copy link
Member

n1t0 commented Oct 20, 2020

Closing in favor of duplicate #198

@n1t0 n1t0 closed this as completed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants