Update sklearn API for Gensim models #1473

chinmayapancholi13 · 2017-07-06T23:17:57Z

This PR does the following things:

Renames scikit-learn wrapper classes for Gensim models as suggested here.
Updates existing API by removing the base sklearn-API class and set_params() & get_params() functions.
Updates transform implementation by making use of sparse2full function.
Adds testConsistencyWithGensimModel test for LdaModel.
Updates ipynb corresponding to these changes.

gojomo · 2017-07-07T11:08:14Z

gensim/test/test_sklearn_integration.py

+from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import LsiTransformer
+from gensim.sklearn_integration.sklearn_wrapper_gensim_ldaseqmodel import LdaSeqTransformer
+from gensim.sklearn_integration.sklearn_wrapper_gensim_w2vmodel import W2VTransformer
+from gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel import AuthorTopicTransformer


As long as we're making breaking renames/refactors for clarity/concision, these package-names have a lot of redundancy, too. gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel repeats both sklearn and gensim, and integration and wrapper have overlapping meanings.

So perhaps just: gensim.sklearn_wrappers.atmodel? And that's if at is already easily recognized from use elsewhere as abbreviation; otherwise gensim.sklearn_wrappers.authortopicmodel or even gensim.sklearn_wrappers.authortopic?

Or maybe even:, given that these wrappers are each pretty slight-in-size, they could either go: (1) in the respective associated model file – that is W2VTransformer goes into gensim.models.word2vec (and probably loses its abbreviation, to match the non-abbreviation used in that file's classes); or (2) into a single sklearn_wrappers file, alongside the BaseSklearnWrapper?

(And regarding BaseSklearnWrapper - not sure it adds much beyond alternative of using skearn's own BaseEstimator and TransformerMixin. It forces more rigor in overriding the necessary abstract-methods, which sklearn itself never does, but has less functionality in set_params() and other aspects of introspectability.)

cc @piskvorky for overall library naming/organization priorities

I vote for from gensim.models.atmodel import AuthorTopicTransformer, but would agree with from gensim.sklearn_api import AuthorTopicTransformer too. It is neither a wrapper nor an integration, but just an API.

If we were to share the same module for gensim/sklearn classes, we'd have to tiptoe around the sklearn imports, because sklearn is not a dependency of gensim.

An extra subpackage sounds cleaner to me: one subpackage for sklearn / keras / spark / whatever API. Not imported automatically from gensim __init__, so that users need to import it explicitly, after installing sklearn/keras/spark/tensorflow/whatever.

And if the subpackage becomes too unwieldy or complex (not the case with sklearn now), a separate library would make sense to me too, to decouple the release and maintenance cycle.

I vote for from gensim.sklearn_api import AuthorTopicTransformer with subpackage sklearn_api.
We will not have "sklearn as dependency", but have "short and uniq import path" for sklearn wrappers.

gojomo · 2017-07-10T15:17:49Z

gensim/sklearn_api/basemodel.py

@@ -11,7 +11,7 @@
 from abc import ABCMeta, abstractmethod


-class BaseSklearnWrapper(object):
+class BaseTransformer(object):


With this refactoring... I'm again wondering, is this class even helpful? Its set_params() does less than the implementation that would be inherited from sklearn's own BaseEstimator. The other methods just serve to enforce that any subclass override certain methods – a rigor that is possibly useful, but that sklearn does not, itself, impose within its own superclass-hierarchy.

@gojomo scikit-learn indeed does not impose such a hierarchy among its classes. However, as you stated above, in addition to not having to implement the set_params() every time, we know which functions are to be implemented for each model. Also, if in the future, we decide to refine/improve the set_params() function and make it more sophisticated (e.g. as in sklearn's BaseEstimator), we would only have to do it once in the base class rather than having to make the same change in the sklearn-api class of each Gensim model. Hence, having an abstract base class and deriving classes for Gensim models from it would help us in writing new sklearn-api classes in the future as well.

But, the existing set_params() looks worse than just doing nothing (and inheriting from BaseEstimator as all current subclasses of this class already do) – unless I'm missing something. So there's no need to plan for a future where "we decide to refine/improve" it, and do such improvements in the base class. We can make that decision now, and delete 8 lines of code from the base class, and all the subclasses automatically get the better & more standard (for sklearn) implementation of set_params().

And then, the only benefit of this shared superclass is forcing an error as a reminder of what to implement. But if sklearn doesn't do that itself, is it that important to do so? It's also over-enforcing – it is not a strict requirement that all transformers support partial_fit() - not every algorithm will necessarily support that.

In my head I expect all sklearn models to have partial_fit, fit, transform and be used in a pipeline. We constantly get questions on the mailing list about how to update model X for almost every sklearn algo. Raising a NotImplemented with an explanation for word2vec and doc2vec use gensim api update function but it's not recommended as it hasn't been researched. And throw an explicitly that RandomProj can't do it because of algo limitations.

Also set_params method shouldn't be abstract and shouldn't be defined in all the children -it's the same code. Having common set_params makes the base useful.

OK, providing a place to include a error/warning is a potential reason to implement partial_fit() even where the underlying algorithm can't (or shouldn't casually) be used that way. BUT, simply not implementing a method is also a pretty good way of signalling that something isn't implemented – and indeed that seems to be what scikit-learn itself does. For that reason, it's even possible other code will use introspection about whether an sklearn pipeline object implements partial_fit() as a test of support, in which case having an implemented-method that just errors could be suboptimal.

But with set_params(), again, why have any lesser-featured implementation here, when every subclass (so far) inherits from scikit-learn's own BaseEstimator, which provides a more fully-featured implementation? Why wouldn't we want fewer limitations (and match the source library's behavior) for zero lines of code, rather than more limitations with more code?

hi,

jumping in the conversation a little late here, but I do agree that the old BaseSklearnWrapper does very little since

the default get_params and set_params from sklearn's BaseEstimator are very reasonable and should only be overriden when necessary; if all the concrete wrapper classes inherit from BaseEstimator (as they do now) then there's no need to force them to re-implement these methods

@tmylk partial_fit is absolutely not a requirement for a class to be compatible with sklearn Pipeline since lots of algorithm are not online in nature. If there is a particular algorithm that most people are confused about, then the warning should be raised for that class only.

I would struggle to find a gensim algorithm for which I haven't seen a "is it possible to update the model post-training?" question on the mailing list...

Right, it's always asked, but not always supported. The scikit-learn convention for not supporting incremental training is to simply not implement partial_fit(). Why not adopt the same convention for scikit-learn compatible classes here?

I agree with @gojomo 's point of view here. The point of these adapters is to assume the "philosophy" of another package, whatever structure or idiosyncrasies that may have. The least amount of surprises and novelty for users of that package.

It is not the place of the adapters to introduce their own philosophy on class structures or missing methods or whatever.

I have removed the BaseTransformer class now. Thanks for the useful feedback and suggestions. :)

…ic model

piskvorky · 2017-07-19T11:13:26Z

gensim/sklearn_api/w2vmodel.py

@@ -83,4 +83,5 @@ def transform(self, words):
        return np.reshape(np.array(X), (len(words), self.size))

    def partial_fit(self, X):
-        raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer")
+        raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since 'update()' function for Word2Vec class is experimental. "


Is this correct? Why would the update() method of word2vec be experimental?

Yes, it's correct.

I don't even see any update() method. Can you elaborate?

@piskvorky The original error message was not correct (since there is no update() function in Word2Vec class as you correctly pointed out) and I have updated the message now. My apologies for the confusion.
This change was made in reference to this comment by @gojomo in one of the older PRs.

@chinmayapancholi13 thanks for the clarification.

So the concern is about initializing the vocabulary. That's a separate concern to incremental training, which is fully supported and not experimental (in fact, it's the same code as non-incremental training).

I don't think partial_fit is a high priority, but at the same time, users ask for it all the time, especially in combination with incremental vocab updates (which are indeed experimental).

So I see two directions (both non-critical, nice-to-have) here:

support incremental training with a fixed, pre-defined vocabulary

implies finding a natural way to initialize vocabulary before hand

implies finding a natural way to control the learning rate (alpha) during the incremental calls.

support fully online training, including updating the vocabulary incrementally

implies changing the w2v algo(s) to support this (hashing trick with fixed hash space? what would this do to HS? or do we not support HS in this online mode?)

still implies finding a way to control alpha

Ah, I see. So the issue is actually "updating the vocab" (and not "training") being experimental. I have updated the error message in partial_fit for W2VTransformer for this accordingly.
Thanks for your explanation and suggestions about resolving this concern. I guess it would be nice to create an issue for this. :)

tmylk · 2017-07-21T11:49:57Z

gensim/sklearn_api/w2vmodel.py

@@ -101,4 +83,4 @@ def transform(self, words):
        return np.reshape(np.array(X), (len(words), self.size))

    def partial_fit(self, X):
-        raise NotImplementedError("'partial_fit' has not been implemented for SklW2VModel")
+        raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since updating vocabulary incrementally for Word2Vec model is experimental.")


"However the model can be updated with a fixed vocab using Gensim API call"

tmylk · 2017-07-21T11:52:23Z

gensim/test/test_sklearn_api.py

@@ -295,19 +292,6 @@ def testTransform(self):
        self.assertEqual(transformed_vecs.shape[0], 1)
        self.assertEqual(transformed_vecs.shape[1], self.model.num_topics)

-    def testSetGetParams(self):


Please rewrite these tests to check the gensim model inside the wrapper is affected. Check gensim_model after fit call

Looks good. Thanks

tmylk · 2017-07-27T13:13:10Z

@menshikh-iv LGTM please

piskvorky · 2017-07-27T13:27:28Z

gensim/sklearn_api/__init__.py

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2011 Radim Rehurek <[email protected]>


@chinmayapancholi13 please add yourself as the author. This is your baby :)

Hehe. Ok! 😄

piskvorky · 2017-07-27T13:28:09Z

gensim/sklearn_api/__init__.py

+"""
+
+
+from .ldamodel import LdaTransformer  # noqa: F401


What is all that noqa: F401 for? Is it really necessary?

A line having noqa doesn't issue flake8 errors. Writing noqa: F401 saves us from flake8 errors saying that we have imports (e.g. LdaTransformer) which we haven't used anywhere in the file.

It feels wrong to litter the gensim code with such constructs -- the gensim code is correct, this is essentially working around some idiosyncrasy (bug?) of an unrelated library.

By the way, how come we don't these errors from all the other __init__ imports in gensim? Or do we? CC @menshikh-iv

@piskvorky because flake8 analyze the only diff every time, if you change same lines in any other __init__ file you will get same error.

OK, so the fake warnings will go away immediately? Why did we add these comments then. Let's remove it.

piskvorky · 2017-07-27T13:30:16Z

docs/notebooks/sklearn_api.ipynb

+   "metadata": {},
+   "source": [
+    "The wrappers available (as of now) are :\n",
+    "* LdaModel (```gensim.sklearn_api.ldaModel.SklLdaModel```),which implements gensim's ```LDA Model``` in a scikit-learn interface\n",


Haven't these been updated?

Yes. I had changed the transformer names in the code in the ipynb but missed updating it here. Thanks a lot for pointing this out. :)

…ction

tmylk · 2017-08-17T17:23:29Z

@menshikh-iv May I ask for a date of a new release with this new API?

menshikh-iv · 2017-08-31T08:06:54Z

@tmylk ~ first half of September

renamed sklearn wrapper classes

c55fcc9

menshikh-iv added the breaks backward-compatibility Change breaks backward compatibility label Jul 7, 2017

added newline for flake8 check

dde234f

gojomo reviewed Jul 7, 2017

View reviewed changes

gojomo requested review from piskvorky and removed request for piskvorky July 7, 2017 11:08

Chinmaya Pancholi added 4 commits July 10, 2017 02:34

renamed sklearn api files

721806b

updated tests for sklearn api

3cdcfde

updated ipynb for sklearn api

99dba85

PEP8 changes

4d1eaf4

gojomo reviewed Jul 10, 2017

View reviewed changes

Chinmaya Pancholi and others added 6 commits July 11, 2017 00:43

updated docstrings for sklearn wrappers

155a1ec

added 'testPersistence' and 'testModelNotFitted' tests for author top…

ae6c0f3

…ic model

removed 'set_params' function from all wrappers

3c78873

removed 'get_params' function from base class

341ed1f

removed 'get_params' function from all api classes

9113f82

removed 'partial_fit()' from base class

2935680

piskvorky reviewed Jul 19, 2017

View reviewed changes

chinmayapancholi13 added 2 commits July 19, 2017 17:33

updated error message

9628f99

updated error message for 'partial_fit' function in W2VTransformer

3849d06

chinmayapancholi13 mentioned this pull request Jul 20, 2017

Fully supporting incremental updation of vocabulary in Word2Vec model #1493

Open

tmylk reviewed Jul 21, 2017

View reviewed changes

chinmayapancholi13 added 6 commits July 27, 2017 04:17

removed 'BaseTransformer' class

6097349

updated error message for 'partial_fit' in 'W2VTransformer'

5b21875

added checks for setting attributes after calling 'fit'

6bfdb4d

flake8 fix

9f0be87

using 'sparse2full' in 'transform' function

6004eee

added missing imports

3262ec2

piskvorky reviewed Jul 27, 2017

View reviewed changes

piskvorky added the style checking label Jul 27, 2017

chinmayapancholi13 added 3 commits July 28, 2017 03:31

added comment about returning dense representation in 'transform' fun…

d4e560e

…ction

added 'testConsistencyWithGensimModel' for ldamodel

ad3f1f7

updated ipynb

877632e

chinmayapancholi13 changed the title ~~Renaming sklearn wrapper classes~~ [WIP] Updating sklearn API for Gensim models Jul 27, 2017

chinmayapancholi13 added 4 commits July 28, 2017 11:04

updated 'testPartialFit' for Lda and Lsi transformers

0871b50

added author info

3f363a1

added 'testConsistencyWithGensimModel' for w2v transformer

c0894bc

removed merge conflicts

9b7402d

menshikh-iv changed the title ~~[WIP] Updating sklearn API for Gensim models~~ Update sklearn API for Gensim models Aug 10, 2017

menshikh-iv merged commit 718b1c6 into piskvorky:develop Aug 10, 2017

menshikh-iv mentioned this pull request Oct 2, 2017

Support sklearn pipeline interface. Continuing #932. #1123

Closed

menshikh-iv removed the style checking label Oct 6, 2017

Update sklearn API for Gensim models #1473

Update sklearn API for Gensim models #1473

Conversation

chinmayapancholi13 commented Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk Jul 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Aug 17, 2017

menshikh-iv commented Aug 31, 2017

chinmayapancholi13 commented Jul 6, 2017 •

edited

Loading

piskvorky Jul 7, 2017 •

edited

Loading

tmylk Jul 12, 2017 •

edited

Loading

piskvorky Jul 21, 2017 •

edited

Loading

piskvorky Jul 19, 2017 •

edited

Loading

piskvorky Jul 20, 2017 •

edited

Loading

tmylk Jul 21, 2017 •

edited

Loading

tmylk Jul 21, 2017 •

edited

Loading

piskvorky Jul 28, 2017 •

edited

Loading