[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state #19811

kiszk · 2017-11-24T08:20:13Z

What changes were proposed in this pull request?

This PR is follow-on of #19518. This PR tries to reduce the number of constant pool entries used for accessing mutable state.
There are two directions:

Primitive type variables should be allocated at the outer class due to better performance. Otherwise, this PR allocates an array.
The length of allocated array is up to 32768 due to avoiding usage of constant pool entry at access (e.g. mutableStateArray[32767]).

Here are some discussions to determine these directions.

[1], [2], [3], [4], [5]
[6], [7], [8]

This PR modifies addMutableState function in the CodeGenerator to check if the declared state can be easily initialized compacted into an array. We identify three types of states that cannot compacted:

Primitive type state (ints, booleans, etc) if the number of them does not exceed threshold
Multiple-dimensional array type
inline = true

When useFreshName = false, the given name is used.

Many codes were ported from #19518. Many efforts were put here. I think this PR should credit to @bdrillard

With this PR, the following code is generated:

/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean isNull_0;
/* 010 */   private boolean isNull_1;
/* 011 */   private boolean isNull_2;
/* 012 */   private int value_2;
/* 013 */   private boolean isNull_3;
...
/* 10006 */   private int value_4999;
/* 10007 */   private boolean isNull_5000;
/* 10008 */   private int value_5000;
/* 10009 */   private InternalRow[] mutableStateArray = new InternalRow[2];
/* 10010 */   private boolean[] mutableStateArray1 = new boolean[7001];
/* 10011 */   private int[] mutableStateArray2 = new int[1001];
/* 10012 */   private UTF8String[] mutableStateArray3 = new UTF8String[6000];
/* 10013 */
...
/* 107956 */     private void init_176() {
/* 107957 */       isNull_4986 = true;
/* 107958 */       value_4986 = -1;
...
/* 108004 */     }
...

How was this patch tested?

Added a new test case to GeneratedProjectionSuite

SparkQA · 2017-11-24T10:25:58Z

Test build #84154 has finished for PR 19811 at commit a019ff2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-24T14:46:34Z

Test build #84163 has finished for PR 19811 at commit 61c8268.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-24T15:32:17Z

Test build #84167 has finished for PR 19811 at commit b131265.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-24T17:13:53Z

Test build #84169 has finished for PR 19811 at commit 197e326.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-11-24T23:07:11Z

Jenkins, retest this please

SparkQA · 2017-11-25T02:01:09Z

Test build #84179 has finished for PR 19811 at commit 197e326.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-25T17:34:43Z

Test build #84185 has finished for PR 19811 at commit d8a9f9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-25T20:22:35Z

Test build #84186 has finished for PR 19811 at commit d01fcb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-26T15:24:11Z

Test build #84193 has finished for PR 19811 at commit ca178da.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-26T17:55:30Z

Test build #84194 has finished for PR 19811 at commit 5ad41fa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-26T20:56:50Z

cc @rednaxelafx @hvanhovell @cloud-fan

SparkQA · 2017-11-27T15:25:16Z

Test build #84214 has finished for PR 19811 at commit 006b2fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-11-27T15:32:42Z

cc @maropu @viirya @mgaido91

mgaido91 · 2017-11-27T16:50:47Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+        // identify multi-dimensional array or no simply-assigned object
+        !isPrimitiveType(javaType) &&
+          (javaType.contains("[][]") ||
+           !initCode.matches("(^[\\w_]+\\d+\\s*=\\s*null;|"


I am a bit scared by relying on such regexp. May I ask you why they are needed?

Let me borrow explanation from #19518. These regexps try to detect the following cases:

Object state of like-type initialized to null

Object state of like-type initialized to the type's base (no-argument) constructor

thanks for the explanation. Still, I can't understand the reason of this. Isn't enough that the init code used is always the same?

This is a conservative guard. I think that this intends to avoid unexpected behavior by moving places of initialization from (implicit) constructor to another place. cc @bil
For example, if a statement for initialization refers to a variable, we have to guarantee the variable is not changed from the original place to the new place. It seems to be hard.
WDYT? cc @bdrillard

P.S. I noticed that such a guard is required for primitive value cases, too.

@mgaido91, could you describe what you mean by "Isn't it enough that the init code used is always the same?" There are definitely some complicated init codes used throughout the codebase where, I think as @kiszk was saying, the initcode makes use of a previously defined variable.

Really it would be nice if we had a way of knowing whether an initialization was simple (assigned to a default for primitives, or null or the 0-parameter constructor for objects). Maybe we could define an abstract InitCode holding a single code field and then extend that with Simple and NonSimple case classes, then we could pattern match on the additional type information rather than trying to regex match the code itself. That solution might be safer and more concrete, but I don't know if it saves us any of the messiness.

@bdrillard I mean that in https://github.com/apache/spark/pull/19811/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR248 we are using the init code to initialize together the variables with the same code. If there are init codes which use previously defined variables, their init code would differ from all the other (unless the same previously defined variable is used), thus I don't see the problem.

Instead I do see the problem that in this way we might change the initialization order and this could be a problem. But I think that this problem can be present also in the current implementation, since we are actually changing the order in which things are inited, isn't it?
So, I am thinking, why aren't we initing the arrays as all the other variables so far (ie. as they weren't arrays, as before this PR, one piece of code after the other, without any for loop) and splitting the init code to avoid it to grow beyond the 64 KB limit?

mgaido91 · 2017-11-27T17:02:44Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+      varName
+    } else {
+      // Create an initialization code agnostic to the actual variable name which we can key by
+      val initCodeKey = initCode.replaceAll(varName, "*VALUE*")


what about codeFunctions("")? It looks safer to me.

I am afraid about the side-effect codeFunction(""). codeFunction may different result from the result in the first call. Thus, I want to call codeFunction() only once.
WDYT?

I don't understand your worries. Before the change we were simply passing a string. I don't see how this function can have side-effects, therefore. Have you something specific in mind? Some cases when this might happen?

Sorry for my weak explanation.
There are three cases.
1.

v => { nameInGlobal = ctx.fresh("tmp") $v = $name + 1; }

v => { val name = ctx.fresh("tmp") hashInGlobal += name -> v $v = $name + 1; }

other cases

You are right. I have not seen them now, but we cannot guarantee it would not happen.

My question is that what problem you saw by reusing the result of codeFunctions(varName)? Would it be possible to share it among us?

I can't understand these cases at the moment, I need to read them carefully, but I can answer to your question for the moment: the problem I see in that replacing a string like that may have undesired effects, for instance, if we have two inits like String varName1 = "varName1"; and String varName2 = "varName2"; they would end up in the same initCodeKey even though they shouldn't. This can be an extreme and very rare case, but it could happen.

mgaido91 · 2017-11-27T17:03:55Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+        // for type and initialized code. In addition, type, array name, and qualified initialized
+        // code is stored for code generation
+        val arrayName = freshName("mutableStateArray")
+        val qualifiedInitCode = initCode.replaceAll(


here I'd prefer codeFunctions(s"$arrayName[${CodeGenerator.INIT_LOOP_VARIABLE_NAME}]")

I am afraid about the side-effect codeFunction(""). codeFunction may different result from the result in the first call. Thus, I want to call codeFunction() only once.
WDYT?

cloud-fan · 2017-11-30T04:39:35Z

do we have to initialize the elements in a loop? Can we just initialize them one by one like

int[] ints = new int[1001];
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

is it also to reduce constant pool entries?

cloud-fan · 2017-11-30T04:42:17Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+      javaType: String,
+      variableName: String,
+      codeFunctions: String => String = _ => "",
+      inline: Boolean = false): String = {


I'm also a bit scared about using regex to match simple assignment. Maybe we should have 2 versions of addMutableState: one provides a simple initial value, one provides arbitrary initializing code.

One of possible approaches. Let me count the number of calls of each type.

if we go this way, what about creating a MutableStateBuilder then, with methods like withSimpleInitialValue taking a string and withComplexInit taking a function or something like that?

I noticed there are only about 15 complex assignment cases. I will specify inline = true for such as case and eliminate conditions using regex.

@kiszk thanks! A big 👍 for eliminating the regexps! :) I'd prefer something more explicit about the reason why it is inlined, like the solutions proposed above. I think readability would be better. WDYT?

mgaido91 · 2017-11-30T08:45:50Z

@cloud-fan yes, I have the same opinion and I'd suggest the same thing (#19811 (comment)), at least at the beginning. Then if we find smarter way yo behave we can submit another PR later.

kiszk · 2017-11-30T08:47:43Z

I think that it is good to have a loop to mainly reduce the byte code size and to reduce constant pool entries.

int[] ints = new int[32768];
ints[0] = 1;
ints[1] = 1;
...
ints[32767] = 1;

will not consume constant pool entries for constants (0 - 32767).
This code will generate a lot of java byte code for 32768 assignments (i.e. aload getfield iconst sipush iastore per an assignment). They will be split into multiple methods. To call a method requires a few constant pool entries at caller side.

I like to have a loop as possible. WDYT?

viirya · 2017-11-30T08:32:56Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

@@ -168,6 +166,21 @@ class CodegenContext {
  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
    mutable.ArrayBuffer.empty[(String, String, String)]

+  // An array keyed by the tuple of mutable states' types and initialization code, holds the
+  // current max index of the array
+  var mutableStateArrayIdx: mutable.Map[(String, String), Int] =


From the below code, looks like this is keyed by (javaType, arrayName), instead of (javaType, initCode)?

Good catch, old comment still exists there. Thanks

viirya · 2017-11-30T08:43:43Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+
+  // An array keyed by the tuple of mutable states' types, array names and initialization code,
+  // holds the code that will initialize the mutableStateArray when initialized in loops
+  var mutableStateArrayInitCodes: mutable.ArrayBuffer[(String, String, String)] =


This is also keyed by (javaType, arrayName) too.

It is ok to use an array since this is not looked up. This is used only for generating code.

viirya · 2017-11-30T08:51:24Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+        val qualifiedInitCode = initCode.replaceAll(
+          varName, s"$arrayName[${CodeGenerator.INIT_LOOP_VARIABLE_NAME}]")
+        mutableStateArrayCurrentNames += (javaType, initCodeKey) -> arrayName
+        mutableStateArrayInitCodes += ((javaType, arrayName, qualifiedInitCode))


Do all variables in the same array use the same init code?

If they are in the same entry of mutableStateArrayInitCodes, yes. arrayName is assigned based on initCodeKey.

viirya · 2017-11-30T08:52:34Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+          val loopIdxVar = CodeGenerator.INIT_LOOP_VARIABLE_NAME
+          s"""
+             for (int $loopIdxVar = 0; $loopIdxVar < $arrayName.length; $loopIdxVar++) {
+               $qualifiedInitCode


If we have two variables in the same array, they have different init codes, how does this loop work?

Oh. I see. You put the variables with same init code in the same array.

viirya · 2017-11-30T08:55:18Z

I'm afraid this style initialization

int[] ints = new int[1001];
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

produces too many bytecodes.

cloud-fan · 2017-11-30T10:18:32Z

I think we should do optimization incrementally. From

private int a;
private int b;
private int c;
...

a = 3;
b = 1;
c = 5;
...

to

int[] ints = new int[1001];
...
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

already hit our goal to reduce constant pool size, reducing the byte code size of constructor is another story.

cloud-fan · 2017-12-15T09:21:05Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-      addMutableState(javaType(expr.dataType), value,
-        s"$value = ${defaultValue(expr.dataType)};")
+      addMutableState(JAVA_BOOLEAN, isNull, forceInline = true, useFreshName = false)
+      addMutableState(javaType(expr.dataType), value, forceInline = true, useFreshName = false)


can we do

val isNull = addMutableState(JAVA_BOOLEAN, "subExprIsNull") val value = addMutableState(javaType(expr.dataType), "subExprValue") val fn = ...

at the beginning?

SparkQA · 2017-12-15T16:38:11Z

Test build #84962 has finished for PR 19811 at commit 31914c0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-12-15T16:57:49Z

I think this failure due to R package issue that we have seen before.

* building 'SparkR_2.3.0.tar.gz'

+ find pkg/vignettes/. -not -name . -not -name '*.Rmd' -not -name '*.md' -not -name '*.pdf' -not -name '*.html' -delete
++ grep Version /home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION
++ awk '{print $NF}'
+ VERSION=2.3.0
+ CRAN_CHECK_OPTIONS=--as-cran
+ '[' -n 1 ']'
+ CRAN_CHECK_OPTIONS='--as-cran --no-tests'
+ '[' -n 1 ']'
+ CRAN_CHECK_OPTIONS='--as-cran --no-tests --no-manual --no-vignettes'
+ echo 'Running CRAN check with --as-cran --no-tests --no-manual --no-vignettes options'
Running CRAN check with --as-cran --no-tests --no-manual --no-vignettes options
+ '[' -n 1 ']'
+ '[' -n 1 ']'
+ /usr/bin/R CMD check --as-cran --no-tests --no-manual --no-vignettes SparkR_2.3.0.tar.gz
* using log directory '/home/jenkins/workspace/SparkPullRequestBuilder/R/SparkR.Rcheck'
* using R version 3.1.1 (2014-07-10)
* using platform: x86_64-redhat-linux-gnu (64-bit)
* using session charset: ASCII
* using options '--no-tests --no-vignettes'
* checking for file 'SparkR/DESCRIPTION' ... OK
* checking extension type ... Package
* this is package 'SparkR' version '2.3.0'
* checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 22] do not match the length of object [0]
Execution halted
Loading required package: methods

Attaching package: 'SparkR'
...

kiszk · 2017-12-15T17:22:59Z

Another PR also causes the same failure while that PR just added new tests.

gatorsmile · 2017-12-15T17:55:33Z

I have to revert that PR again. e58f275

I think it is caused by that PR. Thus, I revert it again. Although I am not very sure what is the root cause. Will try to investigate more.

kiszk · 2017-12-18T04:57:54Z

ping @cloud-fan

cloud-fan · 2017-12-18T06:25:37Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   *                    compacted. Please set `true` into forceInline, if you want to access the
+   *                    status fast (e.g. frequently accessed) or if you want to use the original
+   *                    variable name
+   * @param useFreshName If this is false and forceInline is true, the name is not changed


more accurate: If this is false and the mutable state ends up inlining in the outer class, the name is not changed

cloud-fan · 2017-12-18T06:29:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

@@ -217,7 +217,7 @@ case class Stack(children: Seq[Expression]) extends Generator {
    ctx.addMutableState(
      s"$wrapperClass<InternalRow>",
      ev.value,
-      s"${ev.value} = $wrapperClass$$.MODULE$$.make($rowData);")


We can localize the global variable ev.value here to save one global variable slot.

I will do it in another PR after merging this.

cloud-fan · 2017-12-18T06:33:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

-        ctx.addMutableState(patternClass, pattern,
-          s"""$pattern = ${patternClass}.compile("$regexStr");""")
+        val pattern = ctx.addMutableState(patternClass, "patternLike",
+          v => s"""$v = ${patternClass}.compile("$regexStr");""", forceInline = true)


Do we have a clear rule when a global variable should be inlined for better performance? e.g. a microbenchmark showing noteworthy difference definitely proves we should inline.

Now, we have three rules to apply inlining

Have to use the original name

Frequently ~~used~~referenced, but generated once, in the hot spot

Not expected to be frequently generated proposed by @viirya

Now, we have no rule for 2. I will try to run microbenchmark for 2. Is it better to add these benchmarks into the benchmark directory?

yes, if it's not a lot of them...

my only concern is about point 2. I think it is a dangerous thing to do. What if we generate a lot of frequently used variable? I think it is safer at the moment to consider only 1 and 3 in the decision whether to inline or not. In the future, with a different codegen method, we might then define a threshold over which we generate an array for the given class, otherwise we use plain variables, which IMHO would be the best option but at the moment it is not feasible...

Sorry for confusing you about 2. I updated the statement. It is frequently referenced, but generated once. Therefore, if we have advantage for performance, we think it is safer since 3. will also apply inline.

If 3 is a precondition for 2, then it is ok. Thanks for the explanation.

Based on comments, I simplified rules.

Use the original name

Expect to be not-frequently used.

In the latter, I put comment regarding the reason at each site .

cloud-fan · 2017-12-18T06:37:36Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

+    }
+    assert(ctx1.inlinedMutableStates.size == CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD)
+    // When the number of primitive type mutable states is over the threshold, others are
+    // allocated into an array


Some notes: It's better if we can collect all mutable states before deciding which one should be inlined. However it's impossible to do with the current string based codegen framework, we need to decide inline or not immediately. We can revisit this in the future when we have an AST based codegen framework.

Yeah, I agree. In the future, we hope we have an AST based codegen framework.

cloud-fan · 2017-12-18T06:49:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+    val metrics = ctx.addMutableState(classOf[TaskMetrics].getName, "metrics",
+      v => s"$v = org.apache.spark.TaskContext.get().taskMetrics();", forceInline = true)
+    val sortedIterator = ctx.addMutableState("scala.collection.Iterator<UnsafeRow>", "sortedIter",
+      forceInline = true)


this looks reasonable as it's very unlikely we have a lot of sort operators in one stage. We have to inline it manually as we don't have the ability to find this out automatically yet. Same as https://github.com/apache/spark/pull/19811/files#r157408804

one question: is there any other places like this? do you have a list?

e.g. https://github.com/apache/spark/pull/19811/files#diff-2eb948516b5beaeb746aadac27fbd5b4R613 ?

cloud-fan · 2017-12-18T06:54:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

    // Right now, Range is only used when there is one upstream.
-    ctx.addMutableState("scala.collection.Iterator", input, s"$input = inputs[0];")
+    val input = ctx.addMutableState("scala.collection.Iterator", "input",


seems it's never used

cloud-fan · 2017-12-18T06:56:50Z

LGTM except some minor comments, great job!

viirya · 2017-12-18T07:54:23Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   *            is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   *         3. its type is multi-dimensional array
+   *         A primitive type variable will be inlined into outer class when the total number of
+   *         When a variable is compacted into an array, the max size of the array for compaction


The sentences looks broken? I.e., ...total number of

Actually this line A primitive type variable will be inlined into outer class when the total number of looks redundant.

mgaido91 · 2017-12-18T15:42:09Z

LGTM too, thanks @kiszk and @bdrillard! This is a very important PR IMHO

viirya · 2017-12-18T23:55:01Z

LGTM with one minor comment.

SparkQA · 2017-12-19T15:14:49Z

Test build #85107 has finished for PR 19811 at commit 0e45c19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-19T15:55:57Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+
+    /**
+     * Returns the reference of next available slot in current compacted array. The size of each
+     * compacted array is controlled by the config `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.


nit: CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT is not a config

cloud-fan · 2017-12-19T16:05:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala


    if (right.foldable) {
      val rVal = right.eval()
      if (rVal != null) {
        val regexStr =
          StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString()))
-        ctx.addMutableState(patternClass, pattern,
-          s"""$pattern = ${patternClass}.compile("$regexStr");""")
+        // inline mutable state since not many Like operations in a task


I'm not very sure about this, since Like is an expression and can appear many times, like other expressions.

cloud-fan · 2017-12-19T16:05:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala


    if (right.foldable) {
      val rVal = right.eval()
      if (rVal != null) {
        val regexStr =
          StringEscapeUtils.escapeJava(rVal.asInstanceOf[UTF8String].toString())
-        ctx.addMutableState(patternClass, pattern,
-          s"""$pattern = ${patternClass}.compile("$regexStr");""")
+        // inline mutable state since not many RLike operations in a task


cloud-fan · 2017-12-19T16:06:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala

    // Right now, InputAdapter is only used when there is one input RDD.
-    ctx.addMutableState("scala.collection.Iterator", input, s"$input = inputs[0];")
+    // inline mutable state since an inputAdaptor in a task


cloud-fan · 2017-12-19T16:07:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

-        ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
-          s"$fastHashMapTerm = new $fastHashMapClassName();")
-        ctx.addMutableState(s"java.util.Iterator<InternalRow>", iterTermForFastHashMap)
+        fastHashMapTerm = ctx.addMutableState(fastHashMapClassName, "vectorizedHastHashMap",


shall we force inline it too?

cloud-fan · 2017-12-19T16:08:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

-    sorterTerm = ctx.freshName("sorter")
-    ctx.addMutableState(classOf[UnsafeKVExternalSorter].getName, sorterTerm)
+    hashMapTerm = ctx.addMutableState(hashMapClassName, "hashMap",
+      v => s"$v = $thisPlan.createHashMap();")


cloud-fan · 2017-12-19T16:09:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

-    ctx.addMutableState(clsName, matches,
-      s"$matches = new $clsName($inMemoryThreshold, $spillThreshold);")
+    val matches = ctx.addMutableState(clsName, "matches",
+      v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);")


cloud-fan · 2017-12-19T16:11:27Z

great, merging to master, thank you all!

We can address other minor comments in follow-ups

kiszk · 2017-12-19T16:15:44Z

Thank you very much for your support and merging this. I will open a follow-up PR soon.

…reduce entries for mutable state ## What changes were proposed in this pull request? This PR addresses additional review comments in apache#19811 ## How was this patch tested? Existing test suites Author: Kazuaki Ishizaki <[email protected]> Closes apache#20036 from kiszk/SPARK-18066-followup.

kiszk changed the title ~~[WIP][SQL][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state~~ [WIP][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state Nov 24, 2017

kiszk force-pushed the SPARK-18016 branch from a019ff2 to 61c8268 Compare November 24, 2017 12:19

kiszk force-pushed the SPARK-18016 branch from 61c8268 to b131265 Compare November 24, 2017 15:26

kiszk force-pushed the SPARK-18016 branch from b131265 to 197e326 Compare November 24, 2017 15:45

kiszk mentioned this pull request Nov 24, 2017

[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State Compaction #19518

Closed

kiszk force-pushed the SPARK-18016 branch from 197e326 to d8a9f9e Compare November 25, 2017 14:44

kiszk changed the title ~~[WIP][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state~~ [SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state Nov 26, 2017

mgaido91 reviewed Nov 27, 2017

View reviewed changes

cloud-fan reviewed Nov 30, 2017

View reviewed changes

viirya reviewed Nov 30, 2017

View reviewed changes

cloud-fan reviewed Dec 15, 2017

View reviewed changes

address review comments

31914c0

cloud-fan reviewed Dec 18, 2017

View reviewed changes

viirya reviewed Dec 18, 2017

View reviewed changes

address review comments

0e45c19

cloud-fan approved these changes Dec 19, 2017

View reviewed changes

asfgit closed this in ee56fc3 Dec 19, 2017

This was referenced Dec 20, 2017

[SPARK-22848][SQL] Eliminate mutable state from Stack #20035

Closed

[SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant Pool Limit - reduce entries for mutable state #20036

Closed

cloud-fan mentioned this pull request Feb 13, 2018

[SPARK-23407][SQL] add a config to try to inline all mutable states during codegen #20599

Closed

[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state #19811

[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state #19811

Conversation

kiszk commented Nov 24, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 24, 2017

SparkQA commented Nov 24, 2017

SparkQA commented Nov 24, 2017

SparkQA commented Nov 24, 2017

kiszk commented Nov 24, 2017

SparkQA commented Nov 25, 2017

SparkQA commented Nov 25, 2017

SparkQA commented Nov 25, 2017

SparkQA commented Nov 26, 2017

SparkQA commented Nov 26, 2017

gatorsmile commented Nov 26, 2017

SparkQA commented Nov 27, 2017

kiszk commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

bdrillard Nov 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Nov 30, 2017

kiszk commented Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

kiszk Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 30, 2017

cloud-fan commented Nov 30, 2017

Choose a reason for hiding this comment

SparkQA commented Dec 15, 2017

kiszk commented Dec 15, 2017

kiszk commented Dec 15, 2017

gatorsmile commented Dec 15, 2017

kiszk commented Dec 18, 2017

cloud-fan Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Nov 24, 2017 •

edited

Loading

kiszk Nov 27, 2017 •

edited

Loading

bdrillard Nov 28, 2017 •

edited

Loading

kiszk Nov 30, 2017 •

edited

Loading

kiszk Nov 30, 2017 •

edited

Loading

kiszk commented Nov 30, 2017 •

edited

Loading

viirya Nov 30, 2017 •

edited

Loading

kiszk Nov 30, 2017 •

edited

Loading

cloud-fan Dec 18, 2017 •

edited

Loading

kiszk Dec 18, 2017 •

edited

Loading

kiszk Dec 18, 2017 •

edited

Loading