Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial support of compressed corpora in FastText model #3246

Closed
dchaplinsky opened this issue Oct 9, 2021 · 19 comments · Fixed by #3258
Closed

Partial support of compressed corpora in FastText model #3246

dchaplinsky opened this issue Oct 9, 2021 · 19 comments · Fixed by #3258

Comments

@dchaplinsky
Copy link
Contributor

Problem description

Seems that vocab part of the FastText model are totally fine with compressed files. I.e you can pass bzipped file to build_vocab method and be perfectly fine. Until you try to train the model on the same file. It seems that train_epoch_sg/train_epoch_cbow (or more precisely, CythonLineSentence) doesn't support the same logic and only can accept text files.

Nevertheless you can pass bzip file to train method and it won't object at all. It'll even train on it. But as you can imagine, results are peculiar

This is a bit misleading. While I understand why it happens it took me sometime to debug why my vectors are rubbish.

Steps/code/corpus to reproduce

    model = FastText()
    model.build_vocab(corpus_file="/my/corpus.txt.bz2")

    model.train(
        corpus_file="/my/corpus.txt.bz2",
        total_words=model.corpus_total_words,
        total_examples=model.corpus_count,
    )

Versions

Please provide the output of:

macOS-11.5.2-x86_64-i386-64bit
Python 3.8.12 (default, Aug 31 2021, 04:09:21)
[Clang 12.0.5 (clang-1205.0.22.9)]
Bits 64
NumPy 1.21.2
SciPy 1.7.1
gensim 4.1.2
FAST_VERSION 0
@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2021

Thanks for reporting. Yes, the optimized/compiled corpus_file code path does not support compressed inputs, unlike the rest of Gensin (which supports compressed inputs transparently, thanks to its use of smart_open).

What do you suggest the expected outcome for corpus_file should be?

@dchaplinsky
Copy link
Contributor Author

Hello Radim and thanks for the prompt reply!

Ideal solution for me is of course support of compressed files in compiled code (actually I just bought 2TB of space for my google colab exactly because of this :)). Working with compressed files might be faster, especially on slow io, when reading is a bottleneck.

But I understand, that probably this might be too much of burden to add the proper support (or at least to do it now).
An alternative solution might be a warning that is issued by the compiled reader when reading (presumably) binary files and a note in the documentation

@piskvorky
Copy link
Owner

An alternative solution might be a warning that is issued by the compiled reader when reading (presumably) binary files and a note in the documentation

Okay, thanks. Are you able to open a PR?

@dchaplinsky
Copy link
Contributor Author

dchaplinsky commented Oct 10, 2021

I do not have a solution yet (otherwise I'd opening PR rather than reporting an issue). Will look into it (one option is to use https://github.com/ahupp/python-magic, but I'm not sure if we need an extra dependency just for such a small use-case, what do you think?).

On a similar note, when I'm initializing training like this:

    model = FastText()

    logger.info("Building vocabulary...")
    model.build_vocab(corpus_file=str(settings.corpus_path))

    model.train(
        corpus_file=str(settings.corpus_path),
        total_words=model.corpus_total_words,
        total_examples=model.corpus_count,
    )

I'm receiving strange log messages like this:

gensim.models.word2vec:INFO 2021-10-10 12:55:38,419 EPOCH 2 - PROGRESS: at 123.22% examples, 233055 words/s, in_qsize -1, out_qsize 1

@piskvorky
Copy link
Owner

piskvorky commented Oct 10, 2021

I'm receiving strange log messages like this:

This is still for the case where your settings.corpus_path is compressed?

If so, nonsensical log statistics are expected: the (correct) model.X stats created in build_vocab() do not match the (incorrect) stats that train() calculates from the mishandled binary stream.

Will look into it (one option is to use https://github.com/ahupp/python-magic, but I'm not sure if we need an extra dependency just for such a small use-case, what do you think?).

Yeah, extra dependencies are not great. Simply looking at the file extension (using built-in mimetypes) of the corpus_file argument might cover most use-cases – including yours – at near-zero cost.

@dchaplinsky
Copy link
Contributor Author

dchaplinsky commented Oct 10, 2021

This is still for the case where your settings.corpus_path is compressed?

No, I'm not :) Everything is uncompressed now :)

Yeah, extra dependencies are not great. Simply looking at the file extension (using built-in mimetypes)

Okay, deal.

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2021

Notably, the difficulty with using arbitrary compressed files for train() in the corpus_file path is that code requires random-seeks to separate parts of the file, which are tricky in compressed formats without either (a) extra support up-front; or (b) lots of extra decompression from the front.

Obviously it shouldn't fail this way - with no clean error message, and nonsensical log lines based on the data seen by train() not matching the tallies collected in build_vocab() (which has no problem using compressed data, because it only needs one front-to-back pass with no seeks).

Fixes could include one or more of:

  • make train() warn or fail if supplied a compressed file
  • turn off build_vocab()'s auto-decompressing, so at least both steps see the same raw data - and users hit the problem earlier. (Can smart_open toggle off the magic decompress-if-compressed feature?)
  • extend corpus_file mode to accept a list of balanced files, 1 (or N) per worker thread, so that each thread can just read a full files, even if compressed, front-to-end without seeks or seek-workarounds. Provide tools to split/join such balanced corpus segments as an extra step.
  • start the process of deprecating corpus_file if something like the latest proposal-to-narrow-the-GIL (see here seems likely to land in a future 3.X Python, and makes the corpus_iterable approach competitive with corpus_file. This would eliminate the duplication (& perhaps other bugs/gaps, see Number of Sentences in corpusfile don't match trained sentences. #2693, Doc2vec corpus_file mode skips some documents during training #2757, there is no log when i use word2vec by corpus_file #2342) in the code for corpus_file.

@dchaplinsky
Copy link
Contributor Author

Well, I'm planning to implement the first idea, simply because I don't have enough knowledge in the gensim internals to play with the rest.

On a separate (but somewhat) related note, we've implemented a script on top of gensim to perform a grid training using various combinations of hyperparams. Maybe it'll be useful for gensim (as an example for docs or something).

@dchaplinsky
Copy link
Contributor Author

@piskvorky sorry for disturbing you again.

I've checked the known mime types on my laptop and relevant part of it looks like this:

 '.csv': 'text/csv',
 '.html': 'text/html',
 '.htm': 'text/html',
 '.txt': 'text/plain',
 '.bat': 'text/plain',
 '.c': 'text/plain',
 '.h': 'text/plain',
 '.ksh': 'text/plain',
 '.pl': 'text/plain',

It seems to me that it's not a very good idea to rely on mimetypes (the mapping might differ greatly on different machines).
Also, bat/c/h/ksp/pl doesn't seems relevant to me. On the other hand, types like md (markdown) and (probably) csv/tsv should be included.

My proposal is to make a simple validation on extracted extension against the list of file extensions (or read first 100 bytes of the file) to determine if it's binary or textual.

@piskvorky
Copy link
Owner

piskvorky commented Oct 25, 2021

Sure – something like https://github.com/ahupp/python-magic might be handy.

I have a battle-tested implementation if is_plaintext() + sniff_encoding() in PII Tools that doesn't rely on any external libraries. It's not open source unfortunately, but I can try to offer you tips where I can, if necessary.

A simple MIME-type check of file extension seems "good enough" too, although less robust.

@gojomo
Copy link
Collaborator

gojomo commented Oct 25, 2021

I'd think the safest/most-urgent approach would check for, & warn about, the common error of supplying a known-compressed format file - either by extension or some other heuristic. (Perhaps just: the exact same heuristic that causes smart_open to auto-decompress, and thus for things to seem successful through the build_vocab() step.)

Actual good plain-text data could come in a file with any extension, and nearly any 1st 4 'magic' bytes - so trying to be strict according to other assumptions could be bad.

Other potential heuristics:

  • if you can turn off smart_open's auto-decompression, the resulting 'vocabulary' (of binary junk) is likely to look very different from any real data, in (say) the number of CTRL-characters/unprintable-characters in tokens, or perhaps in other statistical measures (of token length, etc).
  • alternatively, just scanning 1st N bytes of a file for CTRL-chars might work fine (unless some 2-byte/4-byte encodings use them as prefixes/escapes, of which I'm not sure)

@piskvorky
Copy link
Owner

piskvorky commented Oct 25, 2021

the exact same heuristic that causes smart_open to auto-decompress

This is a great point. It's in the "good enough" direction, because smart_open does rely on the file extension only.

Ideal solution would be to support compressed files in corpus_file code path too, not just in build_vocab(). But I understand that's extra work.

@dchaplinsky
Copy link
Contributor Author

dchaplinsky commented Oct 25, 2021 via email

@dchaplinsky
Copy link
Contributor Author

@piskvorky sorry for bugging again.

Am I correct, that the only types that smart_open supports by default are bz2/gz:
https://github.com/RaRe-Technologies/smart_open/blob/35d80d3bec5324c19427ce49fa8284f5b1c2c112/smart_open/compression.py#L146

I quickly grepped through gensim repo for register_compressor and found nothing on top of those default compressors.
So I'll simply rely on https://github.com/RaRe-Technologies/smart_open/blob/35d80d3bec5324c19427ce49fa8284f5b1c2c112/smart_open/compression.py#L33 and I'm fine?

@piskvorky
Copy link
Owner

I think so. @mpenkov WDYT?

@gojomo
Copy link
Collaborator

gojomo commented Oct 26, 2021

A simple warning when the affected file-extensions (as per that smart_open list) are detected seems a quick & easy improvement.

Ideal solution would be to support compressed files in corpus_file code path too, not just in build_vocab(). But I understand that's extra work.

Given the duplication. limitations, & bugs in the corpus_file path (per my earlier comment), I'd still suggest that an even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file, then deprecate corpus_file entirely.

@dchaplinsky
Copy link
Contributor Author

Ok, here is my humble attempt.
Just let me know if it should be warning (how do I assert for warnings?) or exception.

@piskvorky
Copy link
Owner

piskvorky commented Oct 27, 2021

even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file

Indeed. But we don't know how to do that.

@gojomo
Copy link
Collaborator

gojomo commented Oct 27, 2021

even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file

Indeed. But we don't know how to do that.

Not exactly! But we also don't know how exactly to let corpus_file work on one or more compressed files, nor the fixes for its outstanding bugs.
I suspect the strategy of writing a indexes-only corpus to a file that's then mmapped for multithread access, that I suggested when corpus_file was in progress (1st here & with more details here), would nearly match (& possibly exceed!) the corpus_file training throughput.

I keep mentioning this because if others agree there's a fair chance it'd work, it may not be much more total effort than designing/documenting/debugging extra conventions/steps for compressed/multifile corpus_file. But a corpus_iterable breakthrough achieved, it would then render any such corpus_file fixes/.improvements superfluous. In fact, lots of duplication/complexity/special-casing could then be discarded... and even supporting a backward-compatible corpus_file parameter might then just be a small wrapper re-feeding the contents of that file back through the improved corpus_iterable path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants