Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22750][SQL] Reuse mutable states when possible #19940

Closed
wants to merge 9 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Dec 10, 2017

What changes were proposed in this pull request?

The PR introduces a new method addImmutableStateIfNotExists to CodeGenerator to allow reusing and sharing the same global variable between different Expressions. This helps reducing the number of global variables needed, which is important to limit the impact on the constant pool.

How was this patch tested?

added UTs

@SparkQA
Copy link

SparkQA commented Dec 10, 2017

Test build #84695 has finished for PR 19940 at commit 978bfd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@cloud-fan @kiszk @viirya might you please help reviewing this? Thanks.

@viirya
Copy link
Member

viirya commented Dec 11, 2017

A high level question is, do we need to share mutable status, if we can compact global variables into array later?

Will sharing mutable status increase the difficulty of debugging codegen in the future?

@mgaido91
Copy link
Contributor Author

@viirya we have seen that using arrays affects performance. Thus if we can reduce their usage it is better.

I don't think that debugging is harder. These variables I made shared are never assigned, but in the initialization. Do you have an other opinion? Or are you thinking for something specific?

def addSingleMutableState(
javaType: String,
variableName: String,
initCode: String = ""): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we support different initCode for the same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not supported.

variableName: String,
initCode: String = ""): Unit = {
if (!singleMutableStates.contains(variableName)) {
addMutableState(javaType, variableName, initCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add an assert here to make sure initCode is same with the previous one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want, I can add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check if the java type is the same. If one expression uses the same name with different type, we should alert it early.

@viirya
Copy link
Member

viirya commented Dec 11, 2017

@mgaido91 Do you mean the shared global variables are required to be only assigned once (initialization) and never changed again?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 11, 2017

@viirya this is the requirement I followed in this change which ensures that it is safe to share the variable across all the operators, since all the access are read only and there cannot be influences. Maybe this might be relaxed in the future, but if we follow this requirement, we are sure that this is safe.

@viirya
Copy link
Member

viirya commented Dec 11, 2017

Shall we make them as final variables to guarantee this? I think this is an important requirement to prevent something wrong when wrongly using the shared global variables.

@viirya
Copy link
Member

viirya commented Dec 11, 2017

Oh, the initialization is not right away in declaration.

@SparkQA
Copy link

SparkQA commented Dec 11, 2017

Test build #84714 has finished for PR 19940 at commit ce76015.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Dec 11, 2017

I have one question. We are implementing #19811 to compact mutable states. When it will be merged, does this PR can reduce large number of constant pool entries?

@mgaido91
Copy link
Contributor Author

@kiszk for instance it can remove one entry for every timestamp function (to_timestamp or from_utc_timestamp). Of course #19811 is the most important PR, because it solves the problem. But I think we all agree that if we can avoid to waste global variables it is better and there have been many PRs to avoid the usage of global variables. This is one of the many.

@mgaido91
Copy link
Contributor Author

Jenkins, retest this please

@kiszk
Copy link
Member

kiszk commented Dec 12, 2017

Got it. It seems to be . I have two questions.

  1. Is this mutable? It looks it is global, but immutable. IIUC, is there any other better API name?
  2. As @viirya pointed out here, why don't we declare final?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 12, 2017

  1. Yes, to ensure that there are no problems, variables' value should not be changed. I agree that we can find a better API name, to enforce this: what about addImmutableStateIfNotExists? Or do you have any suggestion?
  2. Because as @viirya pointed out here, they are not initialized in the declaration, but in the init method of the generated class, thus it cannot be made final unfortunately.

@kiszk
Copy link
Member

kiszk commented Dec 12, 2017

For 2., I noticed there are two types of initialization. One is in init method by using addPartitionInitializationStatement. The other is in constructor that includes an initialization statement in addSingleMutableState. Can we declare final for latter cases?

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84775 has finished for PR 19940 at commit ce76015.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

I don't think so, it would be very risky, since splitExpressions might put the initialization outside the constructor.
What about 1? Any thought?

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84846 has finished for PR 19940 at commit 90f517c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@cloud-fan @kiszk @viirya any more comments on this?

@@ -170,6 +170,14 @@ class CodegenContext {
val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
mutable.ArrayBuffer.empty[(String, String, String)]

/**
* A map containing the mutable states which have been defined so far using
* `addSingleMutableState`. Each entry contains the name of the mutable state as key and its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is addSingleMutableState old one?

@@ -401,4 +401,16 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
ctx.addReferenceObj("foo", foo)
assert(ctx.mutableStates.isEmpty)
}

test("SPARK-22750: addSingleMutableState") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is addSingleMutableState old one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks, good catch!

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84956 has finished for PR 19940 at commit e2725d2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

the test failure is not related to the PR. It looks the same R failure we had some days ago, I thought it was solved:

checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 22] do not match the length of object [0]

@mgaido91
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84972 has finished for PR 19940 at commit e2725d2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85202 has finished for PR 19940 at commit c650196.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85212 has finished for PR 19940 at commit 6143b22.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

Jenkins, retest this please

@mgaido91
Copy link
Contributor Author

the test errors are unrelated to this change. Any other comments @cloud-fan @kiszk @viirya ?

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85252 has finished for PR 19940 at commit 6143b22.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85261 has finished for PR 19940 at commit 6143b22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* `addImmutableStateIfNotExists`. Each entry contains the name of the mutable state as key and
* its Java type and init code as value.
*/
val singleMutableStates: mutable.Map[String, (String, String)] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immutableStates?

@@ -1193,7 +1196,8 @@ case class ToUTCTimestamp(left: Expression, right: Expression)
val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
val tzTerm = ctx.addMutableState(tzClass, "tz",
v => s"""$v = $dtu.getTimeZone("$tz");""")
val utcTerm = ctx.addMutableState(tzClass, "utc",
val utcTerm = "tzUTC"
ctx.addImmutableStateIfNotExists(tzClass, utcTerm,
v => s"""$v = $dtu.getTimeZone("UTC");""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated question: in the codebase sometimes we use UTC sometimes we use GMT, is it corrected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is no difference between them in practice. But I think that being consistent would be better for readability

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85273 has finished for PR 19940 at commit 7397cf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in c6f01ca Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants