Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native fastText (unsupervised) in gensim #1471

Closed
prakhar2b opened this issue Jul 6, 2017 · 17 comments
Closed

native fastText (unsupervised) in gensim #1471

prakhar2b opened this issue Jul 6, 2017 · 17 comments

Comments

@prakhar2b
Copy link
Contributor

Currently, gensim has a wrapper for fastText. As discussed here, we need to implement training code (subword n-grams, hashing trick) for unsupervised fastText in gensim in python. As fastText is only a slight modification to word2vec, we will need to refactor the word2vec code to properly reuse the overlapping codes.

However, fastText outputs two files .vec and .bin which is C-standard. Should the python implementation in gensim provide pkl format output ?

This thread is intended to discuss and streamline all the requirements and deliverables regarding native fastText in gensim.

@jayantj
Copy link
Contributor

jayantj commented Jul 6, 2017

Hi @prakhar2b Let's simply use the pickle-style format from utils.SaveLoad we already use for word2vec models to persist the fastText models to disk.
Writing out the models in .bin format is a useful feature, but it comes later.

Also, as discussed, please look at the word2vec code in detail, figure out what is needed for fastText, formulate a clear plan of action and post it here. It should contain details about -

  1. Class structure
    • Do we subclass Word2Vec/create a common base class for Word2Vec and FastText/use composition?
  2. Refactoring/code reuse
    • Analyze if it makes sense to reuse code from the word2vec training methods (I think it would, there should be a lot of overlap)
    • Analyze what methods can be reused, and how (or whether) they would have to be refactored to make them re-usable
  3. Integrating with existing FastText wrapper
    • For actual use of the model (word similarity etc), we should be able to reuse a lot of code from the FastText wrapper
  4. API of the new class (IMO, it's going to be pretty much the same as the FastText wrapper class)

IMO, this design process is just as challenging and important as writing the code itself, and it would be good if you spent a good amount of time to come up with a clear plan.

@piskvorky
Copy link
Owner

piskvorky commented Jul 7, 2017

Awesome feature! Let me add that having FastText in gensim will open up other unsupervised possibilities, such as sent2vec in #1376.

@gojomo
Copy link
Collaborator

gojomo commented Jul 9, 2017

Even though the gensim mission is more 'unsupervised', the addition of known-labels in the FastText-for-classification mode is such a small delta I would suggest it be in-scope. It's really just adding in another kind of known-data, during training, as possible 'target' outputs of the internal NN, that may make the resulting vectors better. (Potentially, even if not then using the resulting word-vecs for the exact same classification problem, the inclusion of these extra targets during training may have made the word-vecs better for other tasks.)

Also, it will otherwise be a constant exception-to-be-mentioned, in docs/support: "Yes gensim implements FastText except not FastText mode X".

@dsouzadaniel
Copy link

Is this issue still open ?

@prakhar2b
Copy link
Contributor Author

@dsouzadaniel yes, this is a part of ongoing Google summer of code project.

@prakhar2b
Copy link
Contributor Author

prakhar2b commented Jul 11, 2017

I further looked into fasttext and word2vec code, and this is how I plan to approach -

  1. Class structure/ code reuse

As fasttext is a slight modification of word2vec, we will be mostly using word2vec training code with very slight modification. So, I think we should create two modules - one for moving the common overlapping codes from word2vec (it's better to decouple word2vec and fasttext as much as possible, IMO), and second fasttext.py for exclusive fasttext codes like subword n-grams or hashing tricks.

The training codes from fasttext.cc/ model.cc is very similar to codes in word2vec.py like functions train_batch_cbow or train_batch_sg or functions for sampling etc. The codes for n-grams from dictionary.cc/ matrix.cc needs to be written in python in fasttext.py.

  1. Integrating with existing fastText wrapper

IMO, it would be better to move the python codes (for loading and the hashing trick code etc) from wrapper into native fasttext, and then import these codes there in the wrapper, rather than the other way around.

  1. API

I think API should be somewhat similar to word2vec.
I think something like this should be better -

model = FastText(sentences, model, size, window, ......)
model.wv['example'] # similar to model.wv in word2vec
model.save(fname)
model = FastText.load(fname)

cc @jayantj @piskvorky

@piskvorky
Copy link
Owner

Sounds good -- it's a good idea to start with a PR that shows the new proposed package structure and refactoring. In clear (unoptimized) Python to start with, for concept clarity and to make discussions easier.

What is that model.ft in 3. API though? I'd prefer not to use obscure acronyms / variable names, unless it's really standard terminology. Isn't there something more descriptive?

@prakhar2b
Copy link
Contributor Author

@piskvorky ohh, model.ft was a mistake. I thought wv in model.wv stands for word2vec.

@prakhar2b
Copy link
Contributor Author

@gojomo yes, regarding fasttext supervised classification, I think we should later incorporate labeledw2v #1153 into the fasttext implementation from this PR. Currently, just like facebook's implementation, gensim's fasttext will have two param skipgram and cbow , we will add param 'supervised` later with labeledw2v(different PR maybe). This is the plan as of now.

@piskvorky
Copy link
Owner

piskvorky commented Jul 13, 2017

Oh, I see. I'd say naming the variable wv was also unfortunate (word_vectors better).

@menshikh-iv how about we change the name to word_vectors, in all documentation, but keep wv as an alias to word_vectors too, for backward compatibility?

@piskvorky
Copy link
Owner

piskvorky commented Jul 13, 2017

@prakhar2b People are reporting segfaults and limitations of the FB fastText implementation (how to continue training). A clean, flexible, supported implementation in Python is long overdue I'd say :)

@menshikh-iv
Copy link
Contributor

@piskvorky yes, we can do this, you think that abbreviation wv confusing our users?

@piskvorky
Copy link
Owner

I think so, yes. At least it is to me, and I am a user too :)

@gojomo
Copy link
Collaborator

gojomo commented Jul 13, 2017

Re: un-abbreviating wv

To fully communicate genericness across all uses, the property could also be called token_vectors. Depending on the general style preferences for/against any abbreviations, it could be wordvecs or tokenvecs. (For Doc2Vec the very KeyedVectors-like subcomponent that holds the doc-vectors is named docvecs.)

Aliases may need to be handled carefully given the SaveLoad/pickling approach, both across versions and to prevent duplicate writing of the same info. (Though perhaps, the syn0 -> wv.syn0 changes already paved the way for that.)

@piskvorky
Copy link
Owner

piskvorky commented Jul 14, 2017

Good point on being careful with pickling! (although I think (un)pickle handles such references correctly, but worth double checking)

Possible alternatives: token_vectors, word_vectors, vectors (more generic/universal?), embeddings...?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 14, 2017

Current PR for this is #1525

@menshikh-iv
Copy link
Contributor

Resolved in #1525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants