Partial support of compressed corpora in FastText model #3246

dchaplinsky · 2021-10-09T11:31:27Z

Problem description

Seems that vocab part of the FastText model are totally fine with compressed files. I.e you can pass bzipped file to build_vocab method and be perfectly fine. Until you try to train the model on the same file. It seems that train_epoch_sg/train_epoch_cbow (or more precisely, CythonLineSentence) doesn't support the same logic and only can accept text files.

Nevertheless you can pass bzip file to train method and it won't object at all. It'll even train on it. But as you can imagine, results are peculiar

This is a bit misleading. While I understand why it happens it took me sometime to debug why my vectors are rubbish.

Steps/code/corpus to reproduce

    model = FastText()
    model.build_vocab(corpus_file="/my/corpus.txt.bz2")

    model.train(
        corpus_file="/my/corpus.txt.bz2",
        total_words=model.corpus_total_words,
        total_examples=model.corpus_count,
    )

Versions

Please provide the output of:

macOS-11.5.2-x86_64-i386-64bit
Python 3.8.12 (default, Aug 31 2021, 04:09:21)
[Clang 12.0.5 (clang-1205.0.22.9)]
Bits 64
NumPy 1.21.2
SciPy 1.7.1
gensim 4.1.2
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

piskvorky · 2021-10-09T16:48:54Z

Thanks for reporting. Yes, the optimized/compiled corpus_file code path does not support compressed inputs, unlike the rest of Gensin (which supports compressed inputs transparently, thanks to its use of smart_open).

What do you suggest the expected outcome for corpus_file should be?

dchaplinsky · 2021-10-10T08:59:29Z

Hello Radim and thanks for the prompt reply!

Ideal solution for me is of course support of compressed files in compiled code (actually I just bought 2TB of space for my google colab exactly because of this :)). Working with compressed files might be faster, especially on slow io, when reading is a bottleneck.

But I understand, that probably this might be too much of burden to add the proper support (or at least to do it now).
An alternative solution might be a warning that is issued by the compiled reader when reading (presumably) binary files and a note in the documentation

piskvorky · 2021-10-10T12:28:16Z

An alternative solution might be a warning that is issued by the compiled reader when reading (presumably) binary files and a note in the documentation

Okay, thanks. Are you able to open a PR?

dchaplinsky · 2021-10-10T13:53:01Z

I do not have a solution yet (otherwise I'd opening PR rather than reporting an issue). Will look into it (one option is to use https://github.com/ahupp/python-magic, but I'm not sure if we need an extra dependency just for such a small use-case, what do you think?).

On a similar note, when I'm initializing training like this:

    model = FastText()

    logger.info("Building vocabulary...")
    model.build_vocab(corpus_file=str(settings.corpus_path))

    model.train(
        corpus_file=str(settings.corpus_path),
        total_words=model.corpus_total_words,
        total_examples=model.corpus_count,
    )

I'm receiving strange log messages like this:

gensim.models.word2vec:INFO 2021-10-10 12:55:38,419 EPOCH 2 - PROGRESS: at 123.22% examples, 233055 words/s, in_qsize -1, out_qsize 1

piskvorky · 2021-10-10T14:02:30Z

I'm receiving strange log messages like this:

This is still for the case where your settings.corpus_path is compressed?

If so, nonsensical log statistics are expected: the (correct) model.X stats created in build_vocab() do not match the (incorrect) stats that train() calculates from the mishandled binary stream.

Will look into it (one option is to use https://github.com/ahupp/python-magic, but I'm not sure if we need an extra dependency just for such a small use-case, what do you think?).

Yeah, extra dependencies are not great. Simply looking at the file extension (using built-in mimetypes) of the corpus_file argument might cover most use-cases – including yours – at near-zero cost.

dchaplinsky · 2021-10-10T14:06:13Z

This is still for the case where your settings.corpus_path is compressed?

No, I'm not :) Everything is uncompressed now :)

Yeah, extra dependencies are not great. Simply looking at the file extension (using built-in mimetypes)

Okay, deal.

gojomo · 2021-10-20T17:30:44Z

Notably, the difficulty with using arbitrary compressed files for train() in the corpus_file path is that code requires random-seeks to separate parts of the file, which are tricky in compressed formats without either (a) extra support up-front; or (b) lots of extra decompression from the front.

Obviously it shouldn't fail this way - with no clean error message, and nonsensical log lines based on the data seen by train() not matching the tallies collected in build_vocab() (which has no problem using compressed data, because it only needs one front-to-back pass with no seeks).

Fixes could include one or more of:

make train() warn or fail if supplied a compressed file
turn off build_vocab()'s auto-decompressing, so at least both steps see the same raw data - and users hit the problem earlier. (Can smart_open toggle off the magic decompress-if-compressed feature?)
extend corpus_file mode to accept a list of balanced files, 1 (or N) per worker thread, so that each thread can just read a full files, even if compressed, front-to-end without seeks or seek-workarounds. Provide tools to split/join such balanced corpus segments as an extra step.
start the process of deprecating corpus_file if something like the latest proposal-to-narrow-the-GIL (see here seems likely to land in a future 3.X Python, and makes the corpus_iterable approach competitive with corpus_file. This would eliminate the duplication (& perhaps other bugs/gaps, see Number of Sentences in corpusfile don't match trained sentences. #2693, Doc2vec corpus_file mode skips some documents during training #2757, there is no log when i use word2vec by corpus_file #2342) in the code for corpus_file.

dchaplinsky · 2021-10-20T20:49:42Z

Well, I'm planning to implement the first idea, simply because I don't have enough knowledge in the gensim internals to play with the rest.

On a separate (but somewhat) related note, we've implemented a script on top of gensim to perform a grid training using various combinations of hyperparams. Maybe it'll be useful for gensim (as an example for docs or something).

dchaplinsky · 2021-10-25T12:27:32Z

@piskvorky sorry for disturbing you again.

I've checked the known mime types on my laptop and relevant part of it looks like this:

 '.csv': 'text/csv',
 '.html': 'text/html',
 '.htm': 'text/html',
 '.txt': 'text/plain',
 '.bat': 'text/plain',
 '.c': 'text/plain',
 '.h': 'text/plain',
 '.ksh': 'text/plain',
 '.pl': 'text/plain',

It seems to me that it's not a very good idea to rely on mimetypes (the mapping might differ greatly on different machines).
Also, bat/c/h/ksp/pl doesn't seems relevant to me. On the other hand, types like md (markdown) and (probably) csv/tsv should be included.

My proposal is to make a simple validation on extracted extension against the list of file extensions (or read first 100 bytes of the file) to determine if it's binary or textual.

piskvorky · 2021-10-25T12:45:43Z

Sure – something like https://github.com/ahupp/python-magic might be handy.

I have a battle-tested implementation if is_plaintext() + sniff_encoding() in PII Tools that doesn't rely on any external libraries. It's not open source unfortunately, but I can try to offer you tips where I can, if necessary.

A simple MIME-type check of file extension seems "good enough" too, although less robust.

gojomo · 2021-10-25T18:32:16Z

I'd think the safest/most-urgent approach would check for, & warn about, the common error of supplying a known-compressed format file - either by extension or some other heuristic. (Perhaps just: the exact same heuristic that causes smart_open to auto-decompress, and thus for things to seem successful through the build_vocab() step.)

Actual good plain-text data could come in a file with any extension, and nearly any 1st 4 'magic' bytes - so trying to be strict according to other assumptions could be bad.

Other potential heuristics:

if you can turn off smart_open's auto-decompression, the resulting 'vocabulary' (of binary junk) is likely to look very different from any real data, in (say) the number of CTRL-characters/unprintable-characters in tokens, or perhaps in other statistical measures (of token length, etc).
alternatively, just scanning 1st N bytes of a file for CTRL-chars might work fine (unless some 2-byte/4-byte encodings use them as prefixes/escapes, of which I'm not sure)

piskvorky · 2021-10-25T18:59:56Z

the exact same heuristic that causes smart_open to auto-decompress

This is a great point. It's in the "good enough" direction, because smart_open does rely on the file extension only.

Ideal solution would be to support compressed files in corpus_file code path too, not just in build_vocab(). But I understand that's extra work.

dchaplinsky · 2021-10-25T20:55:17Z

Yes, totally agree, brilliant idea.

…

On Mon, Oct 25, 2021 at 10:00 PM Radim Řehůřek ***@***.***> wrote: the exact same heuristic that causes smart_open to auto-decompress This is a good point. It's in the "good enough" direction, because smart_open relies on the extension. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4WDUQU46NLLIBVR77DUIWSLRANCNFSM5FVGPQUA> .

dchaplinsky · 2021-10-26T14:38:50Z

@piskvorky sorry for bugging again.

Am I correct, that the only types that smart_open supports by default are bz2/gz:
https://github.com/RaRe-Technologies/smart_open/blob/35d80d3bec5324c19427ce49fa8284f5b1c2c112/smart_open/compression.py#L146

I quickly grepped through gensim repo for register_compressor and found nothing on top of those default compressors.
So I'll simply rely on https://github.com/RaRe-Technologies/smart_open/blob/35d80d3bec5324c19427ce49fa8284f5b1c2c112/smart_open/compression.py#L33 and I'm fine?

piskvorky · 2021-10-26T20:30:42Z

I think so. @mpenkov WDYT?

gojomo · 2021-10-26T21:49:49Z

A simple warning when the affected file-extensions (as per that smart_open list) are detected seems a quick & easy improvement.

Ideal solution would be to support compressed files in corpus_file code path too, not just in build_vocab(). But I understand that's extra work.

Given the duplication. limitations, & bugs in the corpus_file path (per my earlier comment), I'd still suggest that an even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file, then deprecate corpus_file entirely.

dchaplinsky · 2021-10-26T22:03:19Z

Ok, here is my humble attempt.
Just let me know if it should be warning (how do I assert for warnings?) or exception.

piskvorky · 2021-10-27T07:14:29Z

even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file

Indeed. But we don't know how to do that.

gojomo · 2021-10-27T18:07:25Z

even-more-ideal solution would improve the corpus_iterable path to be performance-competitive with corpus_file

Indeed. But we don't know how to do that.

Not exactly! But we also don't know how exactly to let corpus_file work on one or more compressed files, nor the fixes for its outstanding bugs.
I suspect the strategy of writing a indexes-only corpus to a file that's then mmapped for multithread access, that I suggested when corpus_file was in progress (1st here & with more details here), would nearly match (& possibly exceed!) the corpus_file training throughput.

I keep mentioning this because if others agree there's a fair chance it'd work, it may not be much more total effort than designing/documenting/debugging extra conventions/steps for compressed/multifile corpus_file. But a corpus_iterable breakthrough achieved, it would then render any such corpus_file fixes/.improvements superfluous. In fact, lots of duplication/complexity/special-casing could then be discarded... and even supporting a backward-compatible corpus_file parameter might then just be a small wrapper re-feeding the contents of that file back through the improved corpus_iterable path.

dchaplinsky mentioned this issue Oct 26, 2021

Adding another check to _check_corpus_sanity for compressed files, adding test #3258

Merged

mpenkov closed this as completed in #3258 Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial support of compressed corpora in FastText model #3246

Partial support of compressed corpora in FastText model #3246

dchaplinsky commented Oct 9, 2021

piskvorky commented Oct 9, 2021 •

edited

Loading

dchaplinsky commented Oct 10, 2021

piskvorky commented Oct 10, 2021

dchaplinsky commented Oct 10, 2021 •

edited

Loading

piskvorky commented Oct 10, 2021 •

edited

Loading

dchaplinsky commented Oct 10, 2021 •

edited

Loading

gojomo commented Oct 20, 2021 •

edited

Loading

dchaplinsky commented Oct 20, 2021

dchaplinsky commented Oct 25, 2021

piskvorky commented Oct 25, 2021 •

edited

Loading

gojomo commented Oct 25, 2021

piskvorky commented Oct 25, 2021 •

edited

Loading

dchaplinsky commented Oct 25, 2021 via email

dchaplinsky commented Oct 26, 2021

piskvorky commented Oct 26, 2021

gojomo commented Oct 26, 2021 •

edited

Loading

dchaplinsky commented Oct 26, 2021

piskvorky commented Oct 27, 2021 •

edited

Loading

gojomo commented Oct 27, 2021 •

edited

Loading

Partial support of compressed corpora in FastText model #3246

Partial support of compressed corpora in FastText model #3246

Comments

dchaplinsky commented Oct 9, 2021

Problem description

Steps/code/corpus to reproduce

Versions

piskvorky commented Oct 9, 2021 • edited Loading

dchaplinsky commented Oct 10, 2021

piskvorky commented Oct 10, 2021

dchaplinsky commented Oct 10, 2021 • edited Loading

piskvorky commented Oct 10, 2021 • edited Loading

dchaplinsky commented Oct 10, 2021 • edited Loading

gojomo commented Oct 20, 2021 • edited Loading

dchaplinsky commented Oct 20, 2021

dchaplinsky commented Oct 25, 2021

piskvorky commented Oct 25, 2021 • edited Loading

gojomo commented Oct 25, 2021

piskvorky commented Oct 25, 2021 • edited Loading

dchaplinsky commented Oct 25, 2021 via email

dchaplinsky commented Oct 26, 2021

piskvorky commented Oct 26, 2021

gojomo commented Oct 26, 2021 • edited Loading

dchaplinsky commented Oct 26, 2021

piskvorky commented Oct 27, 2021 • edited Loading

gojomo commented Oct 27, 2021 • edited Loading

piskvorky commented Oct 9, 2021 •

edited

Loading

dchaplinsky commented Oct 10, 2021 •

edited

Loading

piskvorky commented Oct 10, 2021 •

edited

Loading

dchaplinsky commented Oct 10, 2021 •

edited

Loading

gojomo commented Oct 20, 2021 •

edited

Loading

piskvorky commented Oct 25, 2021 •

edited

Loading

piskvorky commented Oct 25, 2021 •

edited

Loading

gojomo commented Oct 26, 2021 •

edited

Loading

piskvorky commented Oct 27, 2021 •

edited

Loading

gojomo commented Oct 27, 2021 •

edited

Loading