[SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API #20058

MrBago · 2017-12-22T21:55:44Z

What changes were proposed in this pull request?

Adding fitMultiple API to Estimator with default implementation. Also update have ml.tuning meta-estimators use this API.

How was this patch tested?

Unit tests.

SparkQA · 2017-12-22T21:59:42Z

Test build #85318 has finished for PR 20058 at commit fdef9d5.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-22T22:04:42Z

Test build #85319 has finished for PR 20058 at commit 49c8332.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-27T21:23:06Z

Test build #85443 has finished for PR 20058 at commit d73af1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-27T23:46:04Z

reviewing now

jkbradley

Just minor comments. Thanks!

jkbradley · 2017-12-27T23:45:59Z

python/pyspark/ml/tests.py

@@ -2359,6 +2359,21 @@ def test_unary_transformer_transform(self):
            self.assertEqual(res.input + shiftVal, res.output)


+class TestFit(unittest.TestCase):


nit: How about EstimatorTest since this is testing part of the Estimator API?

jkbradley · 2017-12-27T23:52:56Z

python/pyspark/ml/base.py

+from pyspark.sql.types import StructField, StructType
+
+
+class FitMutlipleIterator(object):


typo: Mutliple -> Multiple

jkbradley · 2017-12-27T23:57:05Z

python/pyspark/ml/base.py

+class FitMutlipleIterator(object):
+    """
+    Used by default implementation of Estimator.fitMultiple to produce models in a thread safe
+    iterator.


It'd be nice to document what fitSingleModel should do, plus what the iterator returns.

nit: How about renaming numModel -> numModels ?

jkbradley · 2017-12-28T00:08:06Z

python/pyspark/ml/base.py

+                 using `params[index]`. Params maps may be fit in an order different than their
+                 order in params.
+
+        .. note:: Experimental


Let's use .. note:: DeveloperApi too.

jkbradley · 2017-12-28T00:11:17Z

python/pyspark/ml/base.py

+        .. note:: Experimental
+        """
+        def fitSingleModel(index):
+            return self.fit(dataset, params[index])


Shall we make a copy of the Estimator before defining fitSingleModel to be extra safe (in case some other thread modifies the Params in this Estimator before a call to fit()? You can do self.copy() beforehand to get a copy.

jkbradley · 2017-12-28T00:15:20Z

python/pyspark/ml/tuning.py

@@ -31,6 +31,17 @@
           'TrainValidationSplitModel']


+def parallelFitTasks(est, train, eva, validation, epm):


How about a brief doc string?

jkbradley · 2017-12-28T00:20:21Z

python/pyspark/ml/base.py

+        Fits a model to the input dataset for each param map in params.
+
+        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
+        :param params: A list/tuple of param maps.


Let's explicitly check that this is a list or tuple and throw a good error message if not.

I changed the docstring to Sequence instead of list/tuple, is that ok? Do you want to explicitly restrict the input to be a list or tuple?

Is there another Sequence type this could be other than list or tuple?

WeichenXu123 · 2017-12-28T11:40:49Z

python/pyspark/ml/base.py

+from pyspark.sql.types import StructField, StructType
+
+
+class FitMutlipleIterator(object):


What about change this FitMutlipleIterator class to be an inner class in default implementation method fitMultiple ? I think put it outside will be no other usage.

I'm open to this, but I didn't initially do it this way because I've been bit by nested classes in python before. There are subtle issues with nested classes in python. The one that comes to mind is serialization (which isn't an issue here) but that's not the only one.

@jkbradley @WeichenXu123 I made FitMultipleIterator a private class, is that good enough or should I make it internal to the fitMultiple method?

SparkQA · 2017-12-28T22:17:02Z

Test build #85481 has finished for PR 20058 at commit 209278d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FitMultipleIterator(object):

SparkQA · 2017-12-28T22:59:55Z

Test build #85483 has finished for PR 20058 at commit fe3d6bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FitMultipleIterator(object):

holdenk

I have some initial quick questions, but this looks interesting :)

holdenk · 2017-12-29T02:44:21Z

python/pyspark/ml/base.py

@@ -47,6 +86,28 @@ def _fit(self, dataset):
        """
        raise NotImplementedError()

+    @since("2.3.0")
+    def fitMultiple(self, dataset, params):


So in Scala Spark we use the fit function rather than separate functions. Also the params name is different than the Scala one. Any reason for the difference?

Check out the discussion on the JIRA and the linked design doc. Basically, we need the same argument types but different return types from what the current fit() method provides. (It's a somewhat long chain of discussion stemming from adding the "parallelism" Param to meta-algorithms in master.)

We couldn't use fit because it's going to have the same signature as the existing fit method but return a different type, (Iterator[(Int, Model)] instead of Seq[Model]). I was trying to be consistent with Estimator.fit which uses the name params which is different than the name of the same argument in Scala :/. Happy to change it.

That's a good point that we could rename "params" to be clearer in this new API. How about "paramMaps"?

I made this change.

holdenk · 2017-12-29T02:47:58Z

python/pyspark/ml/base.py

+        def fitSingleModel(index):
+            return estimator.fit(dataset, params[index])
+
+        return FitMultipleIterator(fitSingleModel, len(params))


So whats the benefit of FitMultipleIterator v.s. using imap_unordered?

The idea is you should be able to do something like this:

pool = ... modelIter = estimator.fitMultiple(params) rng = range(len(params)) for index, model in pool.imap_unordered(lambda _: next(modelIter), rng): pass

That's pretty much how I've set up corss validator to use it, https://github.com/apache/spark/pull/20058/files/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3#diff-cbc8c36bfdd245e4e4d5bd27f9b95359R292

The reason for set it up this way is so that, when appropriate, Estimators can implement their own optimized fitMultiple methods that just need to return an "iterator", A class with __iter__ and __next__. For examples models that use maxIter and maxDepth params.

private.

SparkQA · 2017-12-29T22:03:44Z

Test build #85529 has finished for PR 20058 at commit c44db97.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _FitMultipleIterator(object):

jkbradley · 2017-12-30T00:30:25Z

LGTM
I'll merge this b/c of the time pressure for 2.3, but @holdenk please follow up if you have more comments on this.
Thanks!

MrBago added 2 commits December 18, 2017 14:39

Add fitMultiple method to Estimator.

15e9c33

A test, some docs, and python2/3 compatibility.

fdef9d5

Added version & experimental tags.

49c8332

Style fix.

d73af1f

jkbradley reviewed Dec 28, 2017

View reviewed changes

WeichenXu123 reviewed Dec 28, 2017

View reviewed changes

PR feedback.

fe3d6bd

MrBago force-pushed the python-fitMultiple branch from 209278d to fe3d6bd Compare December 28, 2017 22:34

holdenk reviewed Dec 29, 2017

View reviewed changes

MrBago changed the title ~~[SPARK-22126][ML][PySpark] Pyspark portion of the fit-multiple API~~ [SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API Dec 29, 2017

Update params argument to be paramMaps & Make _FitMultipleIterator

c44db97

private.

asfgit closed this in 30fcdc0 Dec 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API #20058

[SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API #20058

MrBago commented Dec 22, 2017

SparkQA commented Dec 22, 2017

SparkQA commented Dec 22, 2017

SparkQA commented Dec 27, 2017

jkbradley commented Dec 27, 2017

jkbradley left a comment

jkbradley Dec 27, 2017

jkbradley Dec 27, 2017

jkbradley Dec 27, 2017

jkbradley Dec 28, 2017

jkbradley Dec 28, 2017

jkbradley Dec 28, 2017

jkbradley Dec 28, 2017

MrBago Dec 28, 2017

jkbradley Dec 29, 2017

WeichenXu123 Dec 28, 2017

MrBago Dec 28, 2017

MrBago Dec 29, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 28, 2017

holdenk left a comment

holdenk Dec 29, 2017

jkbradley Dec 29, 2017 •

edited

Loading

MrBago Dec 29, 2017

jkbradley Dec 29, 2017 •

edited

Loading

MrBago Dec 29, 2017

holdenk Dec 29, 2017

MrBago Dec 29, 2017

SparkQA commented Dec 29, 2017

jkbradley commented Dec 30, 2017

		@@ -2359,6 +2359,21 @@ def test_unary_transformer_transform(self):
		self.assertEqual(res.input + shiftVal, res.output)


		class TestFit(unittest.TestCase):

		from pyspark.sql.types import StructField, StructType


		class FitMutlipleIterator(object):

		@@ -31,6 +31,17 @@
		'TrainValidationSplitModel']


		def parallelFitTasks(est, train, eva, validation, epm):

[SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API #20058

[SPARK-22922][ML][PySpark] Pyspark portion of the fit-multiple API #20058

Conversation

MrBago commented Dec 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 22, 2017

SparkQA commented Dec 22, 2017

SparkQA commented Dec 27, 2017

jkbradley commented Dec 27, 2017

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 28, 2017

SparkQA commented Dec 28, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley Dec 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley Dec 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2017

jkbradley commented Dec 30, 2017

jkbradley Dec 29, 2017 •

edited

Loading

jkbradley Dec 29, 2017 •

edited

Loading