native fastText (unsupervised) in gensim #1471

prakhar2b · 2017-07-06T14:04:07Z

Currently, gensim has a wrapper for fastText. As discussed here, we need to implement training code (subword n-grams, hashing trick) for unsupervised fastText in gensim in python. As fastText is only a slight modification to word2vec, we will need to refactor the word2vec code to properly reuse the overlapping codes.

However, fastText outputs two files .vec and .bin which is C-standard. Should the python implementation in gensim provide pkl format output ?

This thread is intended to discuss and streamline all the requirements and deliverables regarding native fastText in gensim.

The text was updated successfully, but these errors were encountered:

jayantj · 2017-07-06T15:49:31Z

Hi @prakhar2b Let's simply use the pickle-style format from utils.SaveLoad we already use for word2vec models to persist the fastText models to disk.
Writing out the models in .bin format is a useful feature, but it comes later.

Also, as discussed, please look at the word2vec code in detail, figure out what is needed for fastText, formulate a clear plan of action and post it here. It should contain details about -

Class structure
- Do we subclass Word2Vec/create a common base class for Word2Vec and FastText/use composition?
Refactoring/code reuse
- Analyze if it makes sense to reuse code from the word2vec training methods (I think it would, there should be a lot of overlap)
- Analyze what methods can be reused, and how (or whether) they would have to be refactored to make them re-usable
Integrating with existing FastText wrapper
- For actual use of the model (word similarity etc), we should be able to reuse a lot of code from the FastText wrapper
API of the new class (IMO, it's going to be pretty much the same as the FastText wrapper class)

IMO, this design process is just as challenging and important as writing the code itself, and it would be good if you spent a good amount of time to come up with a clear plan.

piskvorky · 2017-07-07T03:28:05Z

Awesome feature! Let me add that having FastText in gensim will open up other unsupervised possibilities, such as sent2vec in #1376.

gojomo · 2017-07-09T18:13:50Z

Even though the gensim mission is more 'unsupervised', the addition of known-labels in the FastText-for-classification mode is such a small delta I would suggest it be in-scope. It's really just adding in another kind of known-data, during training, as possible 'target' outputs of the internal NN, that may make the resulting vectors better. (Potentially, even if not then using the resulting word-vecs for the exact same classification problem, the inclusion of these extra targets during training may have made the word-vecs better for other tasks.)

Also, it will otherwise be a constant exception-to-be-mentioned, in docs/support: "Yes gensim implements FastText except not FastText mode X".

dsouzadaniel · 2017-07-10T20:54:09Z

Is this issue still open ?

prakhar2b · 2017-07-10T21:13:10Z

@dsouzadaniel yes, this is a part of ongoing Google summer of code project.

prakhar2b · 2017-07-11T10:28:28Z

I further looked into fasttext and word2vec code, and this is how I plan to approach -

Class structure/ code reuse

As fasttext is a slight modification of word2vec, we will be mostly using word2vec training code with very slight modification. So, I think we should create two modules - one for moving the common overlapping codes from word2vec (it's better to decouple word2vec and fasttext as much as possible, IMO), and second fasttext.py for exclusive fasttext codes like subword n-grams or hashing tricks.

The training codes from fasttext.cc/ model.cc is very similar to codes in word2vec.py like functions train_batch_cbow or train_batch_sg or functions for sampling etc. The codes for n-grams from dictionary.cc/ matrix.cc needs to be written in python in fasttext.py.

Integrating with existing fastText wrapper

IMO, it would be better to move the python codes (for loading and the hashing trick code etc) from wrapper into native fasttext, and then import these codes there in the wrapper, rather than the other way around.

API

I think API should be somewhat similar to word2vec.
I think something like this should be better -

model = FastText(sentences, model, size, window, ......)
model.wv['example'] # similar to model.wv in word2vec
model.save(fname)
model = FastText.load(fname)

cc @jayantj @piskvorky

piskvorky · 2017-07-11T14:27:19Z

Sounds good -- it's a good idea to start with a PR that shows the new proposed package structure and refactoring. In clear (unoptimized) Python to start with, for concept clarity and to make discussions easier.

What is that model.ft in 3. API though? I'd prefer not to use obscure acronyms / variable names, unless it's really standard terminology. Isn't there something more descriptive?

prakhar2b · 2017-07-11T15:12:45Z

@piskvorky ohh, model.ft was a mistake. I thought wv in model.wv stands for word2vec.

prakhar2b · 2017-07-12T22:20:40Z

@gojomo yes, regarding fasttext supervised classification, I think we should later incorporate labeledw2v #1153 into the fasttext implementation from this PR. Currently, just like facebook's implementation, gensim's fasttext will have two param skipgram and cbow , we will add param 'supervised` later with labeledw2v(different PR maybe). This is the plan as of now.

piskvorky · 2017-07-13T04:14:36Z

Oh, I see. I'd say naming the variable wv was also unfortunate (word_vectors better).

@menshikh-iv how about we change the name to word_vectors, in all documentation, but keep wv as an alias to word_vectors too, for backward compatibility?

piskvorky · 2017-07-13T04:18:44Z

@prakhar2b People are reporting segfaults and limitations of the FB fastText implementation (how to continue training). A clean, flexible, supported implementation in Python is long overdue I'd say :)

menshikh-iv · 2017-07-13T06:58:56Z

@piskvorky yes, we can do this, you think that abbreviation wv confusing our users?

piskvorky · 2017-07-13T07:01:06Z

I think so, yes. At least it is to me, and I am a user too :)

gojomo · 2017-07-13T16:41:33Z

Re: un-abbreviating wv

To fully communicate genericness across all uses, the property could also be called token_vectors. Depending on the general style preferences for/against any abbreviations, it could be wordvecs or tokenvecs. (For Doc2Vec the very KeyedVectors-like subcomponent that holds the doc-vectors is named docvecs.)

Aliases may need to be handled carefully given the SaveLoad/pickling approach, both across versions and to prevent duplicate writing of the same info. (Though perhaps, the syn0 -> wv.syn0 changes already paved the way for that.)

piskvorky · 2017-07-14T04:07:14Z

Good point on being careful with pickling! (although I think (un)pickle handles such references correctly, but worth double checking)

Possible alternatives: token_vectors, word_vectors, vectors (more generic/universal?), embeddings...?

menshikh-iv · 2017-08-14T10:50:30Z

Current PR for this is #1525

menshikh-iv · 2017-10-02T08:45:10Z

Resolved in #1525

prakhar2b mentioned this issue Jul 13, 2017

[WIP] unsupervised fasttext #1482

Closed

menshikh-iv closed this as completed Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native fastText (unsupervised) in gensim #1471

native fastText (unsupervised) in gensim #1471

prakhar2b commented Jul 6, 2017

jayantj commented Jul 6, 2017 •

edited

Loading

piskvorky commented Jul 7, 2017 •

edited

Loading

gojomo commented Jul 9, 2017

dsouzadaniel commented Jul 10, 2017

prakhar2b commented Jul 10, 2017

prakhar2b commented Jul 11, 2017 •

edited

Loading

piskvorky commented Jul 11, 2017

prakhar2b commented Jul 11, 2017

prakhar2b commented Jul 12, 2017

piskvorky commented Jul 13, 2017 •

edited

Loading

piskvorky commented Jul 13, 2017 •

edited

Loading

menshikh-iv commented Jul 13, 2017

piskvorky commented Jul 13, 2017

gojomo commented Jul 13, 2017

piskvorky commented Jul 14, 2017 •

edited

Loading

menshikh-iv commented Aug 14, 2017 •

edited

Loading

menshikh-iv commented Oct 2, 2017

native fastText (unsupervised) in gensim #1471

native fastText (unsupervised) in gensim #1471

Comments

prakhar2b commented Jul 6, 2017

jayantj commented Jul 6, 2017 • edited Loading

piskvorky commented Jul 7, 2017 • edited Loading

gojomo commented Jul 9, 2017

dsouzadaniel commented Jul 10, 2017

prakhar2b commented Jul 10, 2017

prakhar2b commented Jul 11, 2017 • edited Loading

piskvorky commented Jul 11, 2017

prakhar2b commented Jul 11, 2017

prakhar2b commented Jul 12, 2017

piskvorky commented Jul 13, 2017 • edited Loading

piskvorky commented Jul 13, 2017 • edited Loading

menshikh-iv commented Jul 13, 2017

piskvorky commented Jul 13, 2017

gojomo commented Jul 13, 2017

piskvorky commented Jul 14, 2017 • edited Loading

menshikh-iv commented Aug 14, 2017 • edited Loading

menshikh-iv commented Oct 2, 2017

jayantj commented Jul 6, 2017 •

edited

Loading

piskvorky commented Jul 7, 2017 •

edited

Loading

prakhar2b commented Jul 11, 2017 •

edited

Loading

piskvorky commented Jul 13, 2017 •

edited

Loading

piskvorky commented Jul 13, 2017 •

edited

Loading

piskvorky commented Jul 14, 2017 •

edited

Loading

menshikh-iv commented Aug 14, 2017 •

edited

Loading