-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
native fastText (unsupervised) in gensim #1471
Comments
Hi @prakhar2b Let's simply use the pickle-style format from Also, as discussed, please look at the word2vec code in detail, figure out what is needed for fastText, formulate a clear plan of action and post it here. It should contain details about -
IMO, this design process is just as challenging and important as writing the code itself, and it would be good if you spent a good amount of time to come up with a clear plan. |
Awesome feature! Let me add that having FastText in gensim will open up other unsupervised possibilities, such as sent2vec in #1376. |
Even though the gensim mission is more 'unsupervised', the addition of known-labels in the FastText-for-classification mode is such a small delta I would suggest it be in-scope. It's really just adding in another kind of known-data, during training, as possible 'target' outputs of the internal NN, that may make the resulting vectors better. (Potentially, even if not then using the resulting word-vecs for the exact same classification problem, the inclusion of these extra targets during training may have made the word-vecs better for other tasks.) Also, it will otherwise be a constant exception-to-be-mentioned, in docs/support: "Yes gensim implements FastText except not FastText mode X". |
Is this issue still open ? |
@dsouzadaniel yes, this is a part of ongoing Google summer of code project. |
I further looked into fasttext and word2vec code, and this is how I plan to approach -
As fasttext is a slight modification of word2vec, we will be mostly using word2vec training code with very slight modification. So, I think we should create The training codes from fasttext.cc/ model.cc is very similar to codes in word2vec.py like functions
IMO, it would be better to move the python codes (for loading and the hashing trick code etc) from wrapper into native fasttext, and then import these codes there in the wrapper, rather than the other way around.
I think API should be somewhat similar to word2vec.
|
Sounds good -- it's a good idea to start with a PR that shows the new proposed package structure and refactoring. In clear (unoptimized) Python to start with, for concept clarity and to make discussions easier. What is that |
@piskvorky ohh, |
@gojomo yes, regarding fasttext supervised classification, I think we should later incorporate labeledw2v #1153 into the fasttext implementation from this PR. Currently, just like facebook's implementation, gensim's fasttext will have two param |
Oh, I see. I'd say naming the variable @menshikh-iv how about we change the name to |
@prakhar2b People are reporting segfaults and limitations of the FB fastText implementation (how to continue training). A clean, flexible, supported implementation in Python is long overdue I'd say :) |
@piskvorky yes, we can do this, you think that abbreviation |
I think so, yes. At least it is to me, and I am a user too :) |
Re: un-abbreviating To fully communicate genericness across all uses, the property could also be called Aliases may need to be handled carefully given the |
Good point on being careful with pickling! (although I think (un)pickle handles such references correctly, but worth double checking) Possible alternatives: |
Current PR for this is #1525 |
Resolved in #1525 |
Currently, gensim has a wrapper for fastText. As discussed here, we need to implement training code (
subword n-grams
,hashing trick
) for unsupervised fastText in gensim in python. As fastText is only a slight modification to word2vec, we will need to refactor the word2vec code to properly reuse the overlapping codes.However, fastText outputs two files
.vec
and.bin
which is C-standard. Should the python implementation in gensim providepkl
format output ?This thread is intended to discuss and streamline all the requirements and deliverables regarding native fastText in gensim.
The text was updated successfully, but these errors were encountered: