-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sklearn API for Gensim models #1473
Update sklearn API for Gensim models #1473
Conversation
from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import LsiTransformer | ||
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldaseqmodel import LdaSeqTransformer | ||
from gensim.sklearn_integration.sklearn_wrapper_gensim_w2vmodel import W2VTransformer | ||
from gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel import AuthorTopicTransformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we're making breaking renames/refactors for clarity/concision, these package-names have a lot of redundancy, too. gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel
repeats both sklearn
and gensim
, and integration
and wrapper
have overlapping meanings.
So perhaps just: gensim.sklearn_wrappers.atmodel
? And that's if at
is already easily recognized from use elsewhere as abbreviation; otherwise gensim.sklearn_wrappers.authortopicmodel
or even gensim.sklearn_wrappers.authortopic
?
Or maybe even:, given that these wrappers are each pretty slight-in-size, they could either go: (1) in the respective associated model file – that is W2VTransformer goes into gensim.models.word2vec
(and probably loses its abbreviation, to match the non-abbreviation used in that file's classes); or (2) into a single sklearn_wrappers
file, alongside the BaseSklearnWrapper
?
(And regarding BaseSklearnWrapper
- not sure it adds much beyond alternative of using skearn's own BaseEstimator and TransformerMixin. It forces more rigor in overriding the necessary abstract-methods, which sklearn itself never does, but has less functionality in set_params()
and other aspects of introspectability.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @piskvorky for overall library naming/organization priorities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for from gensim.models.atmodel import AuthorTopicTransformer
, but would agree with from gensim.sklearn_api import AuthorTopicTransformer
too. It is neither a wrapper nor an integration, but just an API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to share the same module for gensim/sklearn classes, we'd have to tiptoe around the sklearn imports, because sklearn is not a dependency of gensim.
An extra subpackage sounds cleaner to me: one subpackage for sklearn / keras / spark / whatever API. Not imported automatically from gensim __init__
, so that users need to import it explicitly, after installing sklearn/keras/spark/tensorflow/whatever.
And if the subpackage becomes too unwieldy or complex (not the case with sklearn now), a separate library would make sense to me too, to decouple the release and maintenance cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for from gensim.sklearn_api import AuthorTopicTransformer
with subpackage sklearn_api
.
We will not have "sklearn as dependency", but have "short and uniq import path" for sklearn wrappers.
gensim/sklearn_api/basemodel.py
Outdated
@@ -11,7 +11,7 @@ | |||
from abc import ABCMeta, abstractmethod | |||
|
|||
|
|||
class BaseSklearnWrapper(object): | |||
class BaseTransformer(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this refactoring... I'm again wondering, is this class even helpful? Its set_params()
does less than the implementation that would be inherited from sklearn's own BaseEstimator
. The other methods just serve to enforce that any subclass override certain methods – a rigor that is possibly useful, but that sklearn does not, itself, impose within its own superclass-hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo scikit-learn indeed does not impose such a hierarchy among its classes. However, as you stated above, in addition to not having to implement the set_params()
every time, we know which functions are to be implemented for each model. Also, if in the future, we decide to refine/improve the set_params()
function and make it more sophisticated (e.g. as in sklearn's BaseEstimator
), we would only have to do it once in the base class rather than having to make the same change in the sklearn-api class of each Gensim model. Hence, having an abstract base class and deriving classes for Gensim models from it would help us in writing new sklearn-api classes in the future as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the existing set_params()
looks worse than just doing nothing (and inheriting from BaseEstimator
as all current subclasses of this class already do) – unless I'm missing something. So there's no need to plan for a future where "we decide to refine/improve" it, and do such improvements in the base class. We can make that decision now, and delete 8 lines of code from the base class, and all the subclasses automatically get the better & more standard (for sklearn) implementation of set_params()
.
And then, the only benefit of this shared superclass is forcing an error as a reminder of what to implement. But if sklearn doesn't do that itself, is it that important to do so? It's also over-enforcing – it is not a strict requirement that all transformers support partial_fit()
- not every algorithm will necessarily support that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head I expect all sklearn models to have partial_fit
, fit
, transform
and be used in a pipeline. We constantly get questions on the mailing list about how to update model X
for almost every sklearn algo. Raising a NotImplemented
with an explanation for word2vec and doc2vec use gensim api update function but it's not recommended as it hasn't been researched.
And throw an explicitly that RandomProj can't do it because of algo limitations.
Also set_params
method shouldn't be abstract and shouldn't be defined in all the children -it's the same code. Having common set_params
makes the base useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, providing a place to include a error/warning is a potential reason to implement partial_fit()
even where the underlying algorithm can't (or shouldn't casually) be used that way. BUT, simply not implementing a method is also a pretty good way of signalling that something isn't implemented – and indeed that seems to be what scikit-learn itself does. For that reason, it's even possible other code will use introspection about whether an sklearn pipeline object implements partial_fit()
as a test of support, in which case having an implemented-method that just errors could be suboptimal.
But with set_params()
, again, why have any lesser-featured implementation here, when every subclass (so far) inherits from scikit-learn's own BaseEstimator
, which provides a more fully-featured implementation? Why wouldn't we want fewer limitations (and match the source library's behavior) for zero lines of code, rather than more limitations with more code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi,
jumping in the conversation a little late here, but I do agree that the old BaseSklearnWrapper
does very little since
- the default
get_params
andset_params
from sklearn'sBaseEstimator
are very reasonable and should only be overriden when necessary; if all the concrete wrapper classes inherit fromBaseEstimator
(as they do now) then there's no need to force them to re-implement these methods - @tmylk
partial_fit
is absolutely not a requirement for a class to be compatible with sklearnPipeline
since lots of algorithm are not online in nature. If there is a particular algorithm that most people are confused about, then the warning should be raised for that class only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would struggle to find a gensim algorithm for which I haven't seen a "is it possible to update the model post-training?" question on the mailing list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's always asked, but not always supported. The scikit-learn convention for not supporting incremental training is to simply not implement partial_fit()
. Why not adopt the same convention for scikit-learn compatible classes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @gojomo 's point of view here. The point of these adapters is to assume the "philosophy" of another package, whatever structure or idiosyncrasies that may have. The least amount of surprises and novelty for users of that package.
It is not the place of the adapters to introduce their own philosophy on class structures or missing methods or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the BaseTransformer
class now. Thanks for the useful feedback and suggestions. :)
gensim/sklearn_api/w2vmodel.py
Outdated
@@ -83,4 +83,5 @@ def transform(self, words): | |||
return np.reshape(np.array(X), (len(words), self.size)) | |||
|
|||
def partial_fit(self, X): | |||
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer") | |||
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since 'update()' function for Word2Vec class is experimental. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Why would the update()
method of word2vec be experimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even see any update()
method. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky The original error message was not correct (since there is no update()
function in Word2Vec
class as you correctly pointed out) and I have updated the message now. My apologies for the confusion.
This change was made in reference to this comment by @gojomo in one of the older PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chinmayapancholi13 thanks for the clarification.
So the concern is about initializing the vocabulary. That's a separate concern to incremental training, which is fully supported and not experimental (in fact, it's the same code as non-incremental training).
I don't think partial_fit
is a high priority, but at the same time, users ask for it all the time, especially in combination with incremental vocab updates (which are indeed experimental).
So I see two directions (both non-critical, nice-to-have) here:
- support incremental training with a fixed, pre-defined vocabulary
- implies finding a natural way to initialize vocabulary before hand
- implies finding a natural way to control the learning rate (alpha) during the incremental calls.
- support fully online training, including updating the vocabulary incrementally
- implies changing the w2v algo(s) to support this (hashing trick with fixed hash space? what would this do to HS? or do we not support HS in this online mode?)
- still implies finding a way to control alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. So the issue is actually "updating the vocab" (and not "training") being experimental. I have updated the error message in partial_fit
for W2VTransformer
for this accordingly.
Thanks for your explanation and suggestions about resolving this concern. I guess it would be nice to create an issue for this. :)
gensim/sklearn_api/w2vmodel.py
Outdated
@@ -101,4 +83,4 @@ def transform(self, words): | |||
return np.reshape(np.array(X), (len(words), self.size)) | |||
|
|||
def partial_fit(self, X): | |||
raise NotImplementedError("'partial_fit' has not been implemented for SklW2VModel") | |||
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since updating vocabulary incrementally for Word2Vec model is experimental.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"However the model can be updated with a fixed vocab using Gensim API call"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -295,19 +292,6 @@ def testTransform(self): | |||
self.assertEqual(transformed_vecs.shape[0], 1) | |||
self.assertEqual(transformed_vecs.shape[1], self.model.num_topics) | |||
|
|||
def testSetGetParams(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rewrite these tests to check the gensim model inside the wrapper is affected. Check gensim_model after fit
call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks
@menshikh-iv LGTM please |
gensim/sklearn_api/__init__.py
Outdated
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2011 Radim Rehurek <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chinmayapancholi13 please add yourself as the author. This is your baby :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe. Ok! 😄
""" | ||
|
||
|
||
from .ldamodel import LdaTransformer # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is all that noqa: F401
for? Is it really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A line having noqa
doesn't issue flake8 errors. Writing noqa: F401
saves us from flake8 errors saying that we have imports (e.g. LdaTransformer
) which we haven't used anywhere in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels wrong to litter the gensim code with such constructs -- the gensim code is correct, this is essentially working around some idiosyncrasy (bug?) of an unrelated library.
By the way, how come we don't these errors from all the other __init__
imports in gensim? Or do we? CC @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky because flake8 analyze the only diff every time, if you change same lines in any other __init__
file you will get same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the fake warnings will go away immediately? Why did we add these comments then. Let's remove it.
docs/notebooks/sklearn_api.ipynb
Outdated
"metadata": {}, | ||
"source": [ | ||
"The wrappers available (as of now) are :\n", | ||
"* LdaModel (```gensim.sklearn_api.ldaModel.SklLdaModel```),which implements gensim's ```LDA Model``` in a scikit-learn interface\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't these been updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I had changed the transformer names in the code in the ipynb but missed updating it here. Thanks a lot for pointing this out. :)
@menshikh-iv May I ask for a date of a new release with this new API? |
@tmylk ~ first half of September |
This PR does the following things:
set_params()
&get_params()
functions.transform
implementation by making use ofsparse2full
function.testConsistencyWithGensimModel
test for LdaModel.