-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial support of compressed corpora in FastText model #3246
Comments
Thanks for reporting. Yes, the optimized/compiled What do you suggest the expected outcome for |
Hello Radim and thanks for the prompt reply! Ideal solution for me is of course support of compressed files in compiled code (actually I just bought 2TB of space for my google colab exactly because of this :)). Working with compressed files might be faster, especially on slow io, when reading is a bottleneck. But I understand, that probably this might be too much of burden to add the proper support (or at least to do it now). |
Okay, thanks. Are you able to open a PR? |
I do not have a solution yet (otherwise I'd opening PR rather than reporting an issue). Will look into it (one option is to use https://github.com/ahupp/python-magic, but I'm not sure if we need an extra dependency just for such a small use-case, what do you think?). On a similar note, when I'm initializing training like this: model = FastText()
logger.info("Building vocabulary...")
model.build_vocab(corpus_file=str(settings.corpus_path))
model.train(
corpus_file=str(settings.corpus_path),
total_words=model.corpus_total_words,
total_examples=model.corpus_count,
) I'm receiving strange log messages like this:
|
This is still for the case where your If so, nonsensical log statistics are expected: the (correct)
Yeah, extra dependencies are not great. Simply looking at the file extension (using built-in mimetypes) of the |
No, I'm not :) Everything is uncompressed now :)
Okay, deal. |
Notably, the difficulty with using arbitrary compressed files for Obviously it shouldn't fail this way - with no clean error message, and nonsensical log lines based on the data seen by Fixes could include one or more of:
|
Well, I'm planning to implement the first idea, simply because I don't have enough knowledge in the gensim internals to play with the rest. On a separate (but somewhat) related note, we've implemented a script on top of gensim to perform a grid training using various combinations of hyperparams. Maybe it'll be useful for gensim (as an example for docs or something). |
@piskvorky sorry for disturbing you again. I've checked the known mime types on my laptop and relevant part of it looks like this:
It seems to me that it's not a very good idea to rely on mimetypes (the mapping might differ greatly on different machines). My proposal is to make a simple validation on extracted extension against the list of file extensions (or read first 100 bytes of the file) to determine if it's binary or textual. |
Sure – something like https://github.com/ahupp/python-magic might be handy. I have a battle-tested implementation if A simple MIME-type check of file extension seems "good enough" too, although less robust. |
I'd think the safest/most-urgent approach would check for, & warn about, the common error of supplying a known-compressed format file - either by extension or some other heuristic. (Perhaps just: the exact same heuristic that causes Actual good plain-text data could come in a file with any extension, and nearly any 1st 4 'magic' bytes - so trying to be strict according to other assumptions could be bad. Other potential heuristics:
|
This is a great point. It's in the "good enough" direction, because smart_open does rely on the file extension only. Ideal solution would be to support compressed files in |
Yes, totally agree, brilliant idea.
…On Mon, Oct 25, 2021 at 10:00 PM Radim Řehůřek ***@***.***> wrote:
the exact same heuristic that causes smart_open to auto-decompress
This is a good point. It's in the "good enough" direction, because
smart_open relies on the extension.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3246 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABAA4WDUQU46NLLIBVR77DUIWSLRANCNFSM5FVGPQUA>
.
|
@piskvorky sorry for bugging again. Am I correct, that the only types that smart_open supports by default are bz2/gz: I quickly grepped through gensim repo for |
I think so. @mpenkov WDYT? |
A simple warning when the affected file-extensions (as per that
Given the duplication. limitations, & bugs in the |
Ok, here is my humble attempt. |
Indeed. But we don't know how to do that. |
Not exactly! But we also don't know how exactly to let I keep mentioning this because if others agree there's a fair chance it'd work, it may not be much more total effort than designing/documenting/debugging extra conventions/steps for compressed/multifile |
Problem description
Seems that vocab part of the FastText model are totally fine with compressed files. I.e you can pass bzipped file to
build_vocab
method and be perfectly fine. Until you try to train the model on the same file. It seems thattrain_epoch_sg
/train_epoch_cbow
(or more precisely,CythonLineSentence
) doesn't support the same logic and only can accept text files.Nevertheless you can pass bzip file to train method and it won't object at all. It'll even train on it. But as you can imagine, results are peculiar
This is a bit misleading. While I understand why it happens it took me sometime to debug why my vectors are rubbish.
Steps/code/corpus to reproduce
Versions
Please provide the output of:
The text was updated successfully, but these errors were encountered: