-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Updating Gensim's Word2vec-Keras integration #7
[WIP] Updating Gensim's Word2vec-Keras integration #7
Conversation
Example usage of the code added in this PR : import shorttext
from gensim.models import word2vec as w2v
input_text_file = "word_vectors_training_data.txt"
input_data = w2v.LineSentence(input_text_file)
w2v_model = w2v.Word2Vec(input_data, size=300)
w2v = w2v_model.wv
trainclassdict = shorttext.data.subjectkeywords()
kmodel = shorttext.classifiers.frameworks.CNNWord2Vec(len(trainclassdict.keys()), w2v)
classifier = shorttext.classifiers.VarNNWord2VecClassifier(w2v)
classifier.train(trainclassdict, kmodel)
print(classifier.score('artificial intelligence')) Output: |
@stephenhky I have refactored the code in the latest commit by removing some redundant files and classes. I have also added a boolean variable |
from keras.preprocessing.sequence import pad_sequences | ||
|
||
@cio.compactio({'classifier': 'nnlibvec'}, 'nnlibvec', ['_classlabels.txt', '.json', '.h5']) | ||
class VarNNWord2VecClassifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was a lousy name for calling another class VarNNEmbedVecClassification
. I know you named this new class after the original class, but maybe you want to refactor it to another name. On the other hand, I know gensim
supported various embedded vectors, and thus I want the name not to be restricted to word2vec.
Do you want to come up with a new name?
:param wvmodel: Word2Vec model | ||
:param vecsize: length of the embedded vectors in the model (Default: 300) | ||
:param maxlen: maximum number of words in a sentence (Default: 15) | ||
:type wvmodel: gensim.models.word2vec.Word2Vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type needs to be changed to gensim.models.keyedvectors.KeyedVectors
.
:param classdict: training data | ||
:return: a tuple of three, containing a list of class labels, matrix of embedded word vectors, and corresponding outputs | ||
:type classdict: dict | ||
:rtype: (list, numpy.ndarray, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second element of the returned type needs to be changed
# convert classdict to training input vectors | ||
self.classlabels, x_train, y_train = self.convert_trainingdata_matrix(classdict) | ||
tokenizer = Tokenizer() | ||
tokenizer.fit_on_texts(x_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular reason choosing the tokenizer provided by keras?
Thanks for making the codes reusable and backward compatible. Right now, with this implementation, a trained model is saved without specifying whether use_gensim is True or not. Similarly, when a saved model is being loaded, this information is lacking. But it is important. So somehow there has to be a way to specify. |
Let me authorize this pull request first. But remember the I/O needs to be done. |
@stephenhky Sure. I'll update the code to include the modifications to the |
Thanks. Also do the same thing for the |
Sure. :) |
What I mean is that, whatever changes you have done to |
This PR adds support for integration of Keras with Gensim's word2vec model using the
get_embedding_layer
function added to Gensim in PR#1248.