Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24155][ML] Instrumentation improvements for clustering #21218

Closed
wants to merge 4 commits into from

Conversation

lu-wang-dl
Copy link
Contributor

What changes were proposed in this pull request?

changed the instrument for all of the clustering methods

How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a pull request.

@lu-wang-dl lu-wang-dl changed the title [SPARK-24155][ML] Instrument improvements for clustering [SPARK-24155][ML] Instrumentation improvements for clustering May 2, 2018
@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90075 has finished for PR 21218 at commit 07c20a4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -423,6 +423,8 @@ class GaussianMixture @Since("2.0.0") (
val summary = new GaussianMixtureSummary(model.transform(dataset),
$(predictionCol), $(probabilityCol), $(featuresCol), $(k), logLikelihood)
model.setSummary(Some(summary))
instr.logNamedValue("logLikelihood", logLikelihood)
instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clusterSizes.toString will get an unreadable object address string.
We need to print the content of the array. I suggest add a method like:
def logNamedArray[T](array: Array[T]): Unit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 The function clusterSizes.mkString(", ") could change the array to a string, separating each String in the array with comma. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also ok I think.

@@ -378,6 +378,7 @@ class KMeans @Since("1.5.0") (
model.transform(dataset), $(predictionCol), $(featuresCol), $(k))

model.setSummary(Some(summary))
instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90152 has finished for PR 21218 at commit 9053281.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -278,6 +279,7 @@ class BisectingKMeans @Since("2.0.0") (
val summary = new BisectingKMeansSummary(
model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
model.setSummary(Some(summary))
instr.logNamedValue("clusterSizes", summary.clusterSizes.mkString(", "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires parsing the result string if we want to do some analysis. How do we do this in other places? Do we always log in JSON format, e.g.? cc: @WeichenXu123 @jkbradley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log Params in JSON format, but we don't have a precedent for logging Array values. I'd be OK with ducking the issue by not logging this value, or with logging it as JSON. If the latter, then we could add a method Instrumentation.logNamedValue taking Array types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. @ludatabricks could you update the value to mkString("[", ",", "]") for now and leave a TODO to extend logNamedValue?

@SparkQA
Copy link

SparkQA commented May 11, 2018

Test build #90529 has finished for PR 21218 at commit 4e2cb81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 14, 2018

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 075d678 May 14, 2018
@lu-wang-dl lu-wang-dl deleted the SPARK-23686-1 branch May 16, 2018 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants