Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state #19811

Closed
wants to merge 26 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Nov 24, 2017

What changes were proposed in this pull request?

This PR is follow-on of #19518. This PR tries to reduce the number of constant pool entries used for accessing mutable state.
There are two directions:

  1. Primitive type variables should be allocated at the outer class due to better performance. Otherwise, this PR allocates an array.
  2. The length of allocated array is up to 32768 due to avoiding usage of constant pool entry at access (e.g. mutableStateArray[32767]).

Here are some discussions to determine these directions.

  1. [1], [2], [3], [4], [5]
  2. [6], [7], [8]

This PR modifies addMutableState function in the CodeGenerator to check if the declared state can be easily initialized compacted into an array. We identify three types of states that cannot compacted:

  • Primitive type state (ints, booleans, etc) if the number of them does not exceed threshold
  • Multiple-dimensional array type
  • inline = true

When useFreshName = false, the given name is used.

Many codes were ported from #19518. Many efforts were put here. I think this PR should credit to @bdrillard

With this PR, the following code is generated:

/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean isNull_0;
/* 010 */   private boolean isNull_1;
/* 011 */   private boolean isNull_2;
/* 012 */   private int value_2;
/* 013 */   private boolean isNull_3;
...
/* 10006 */   private int value_4999;
/* 10007 */   private boolean isNull_5000;
/* 10008 */   private int value_5000;
/* 10009 */   private InternalRow[] mutableStateArray = new InternalRow[2];
/* 10010 */   private boolean[] mutableStateArray1 = new boolean[7001];
/* 10011 */   private int[] mutableStateArray2 = new int[1001];
/* 10012 */   private UTF8String[] mutableStateArray3 = new UTF8String[6000];
/* 10013 */
...
/* 107956 */     private void init_176() {
/* 107957 */       isNull_4986 = true;
/* 107958 */       value_4986 = -1;
...
/* 108004 */     }
...

How was this patch tested?

Added a new test case to GeneratedProjectionSuite

@kiszk kiszk changed the title [WIP][SQL][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state [WIP][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state Nov 24, 2017
@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84154 has finished for PR 19811 at commit a019ff2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84163 has finished for PR 19811 at commit 61c8268.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84167 has finished for PR 19811 at commit b131265.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84169 has finished for PR 19811 at commit 197e326.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Nov 24, 2017

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2017

Test build #84179 has finished for PR 19811 at commit 197e326.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2017

Test build #84185 has finished for PR 19811 at commit d8a9f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2017

Test build #84186 has finished for PR 19811 at commit d01fcb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2017

Test build #84193 has finished for PR 19811 at commit ca178da.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [WIP][SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state [SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state Nov 26, 2017
@SparkQA
Copy link

SparkQA commented Nov 26, 2017

Test build #84194 has finished for PR 19811 at commit 5ad41fa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84214 has finished for PR 19811 at commit 006b2fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Nov 27, 2017

cc @maropu @viirya @mgaido91

// identify multi-dimensional array or no simply-assigned object
!isPrimitiveType(javaType) &&
(javaType.contains("[][]") ||
!initCode.matches("(^[\\w_]+\\d+\\s*=\\s*null;|"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit scared by relying on such regexp. May I ask you why they are needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me borrow explanation from #19518. These regexps try to detect the following cases:

  • Object state of like-type initialized to null
  • Object state of like-type initialized to the type's base (no-argument) constructor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation. Still, I can't understand the reason of this. Isn't enough that the init code used is always the same?

Copy link
Member Author

@kiszk kiszk Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a conservative guard. I think that this intends to avoid unexpected behavior by moving places of initialization from (implicit) constructor to another place. cc @bil
For example, if a statement for initialization refers to a variable, we have to guarantee the variable is not changed from the original place to the new place. It seems to be hard.
WDYT? cc @bdrillard

P.S. I noticed that such a guard is required for primitive value cases, too.

Copy link

@bdrillard bdrillard Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91, could you describe what you mean by "Isn't it enough that the init code used is always the same?" There are definitely some complicated init codes used throughout the codebase where, I think as @kiszk was saying, the initcode makes use of a previously defined variable.

Really it would be nice if we had a way of knowing whether an initialization was simple (assigned to a default for primitives, or null or the 0-parameter constructor for objects). Maybe we could define an abstract InitCode holding a single code field and then extend that with Simple and NonSimple case classes, then we could pattern match on the additional type information rather than trying to regex match the code itself. That solution might be safer and more concrete, but I don't know if it saves us any of the messiness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdrillard I mean that in https://github.com/apache/spark/pull/19811/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR248 we are using the init code to initialize together the variables with the same code. If there are init codes which use previously defined variables, their init code would differ from all the other (unless the same previously defined variable is used), thus I don't see the problem.

Instead I do see the problem that in this way we might change the initialization order and this could be a problem. But I think that this problem can be present also in the current implementation, since we are actually changing the order in which things are inited, isn't it?
So, I am thinking, why aren't we initing the arrays as all the other variables so far (ie. as they weren't arrays, as before this PR, one piece of code after the other, without any for loop) and splitting the init code to avoid it to grow beyond the 64 KB limit?

varName
} else {
// Create an initialization code agnostic to the actual variable name which we can key by
val initCodeKey = initCode.replaceAll(varName, "*VALUE*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about codeFunctions("")? It looks safer to me.

Copy link
Member Author

@kiszk kiszk Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid about the side-effect codeFunction(""). codeFunction may different result from the result in the first call. Thus, I want to call codeFunction() only once.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your worries. Before the change we were simply passing a string. I don't see how this function can have side-effects, therefore. Have you something specific in mind? Some cases when this might happen?

Copy link
Member Author

@kiszk kiszk Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my weak explanation.
There are three cases.
1.

v => {
  nameInGlobal = ctx.fresh("tmp")
  $v = $name + 1;
}
v => {
  val name = ctx.fresh("tmp")
  hashInGlobal += name -> v
  $v = $name + 1;
}
  1. other cases

You are right. I have not seen them now, but we cannot guarantee it would not happen.

My question is that what problem you saw by reusing the result of codeFunctions(varName)? Would it be possible to share it among us?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand these cases at the moment, I need to read them carefully, but I can answer to your question for the moment: the problem I see in that replacing a string like that may have undesired effects, for instance, if we have two inits like String varName1 = "varName1"; and String varName2 = "varName2"; they would end up in the same initCodeKey even though they shouldn't. This can be an extreme and very rare case, but it could happen.

// for type and initialized code. In addition, type, array name, and qualified initialized
// code is stored for code generation
val arrayName = freshName("mutableStateArray")
val qualifiedInitCode = initCode.replaceAll(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I'd prefer codeFunctions(s"$arrayName[${CodeGenerator.INIT_LOOP_VARIABLE_NAME}]")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid about the side-effect codeFunction(""). codeFunction may different result from the result in the first call. Thus, I want to call codeFunction() only once.
WDYT?

@cloud-fan
Copy link
Contributor

do we have to initialize the elements in a loop? Can we just initialize them one by one like

int[] ints = new int[1001];
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

is it also to reduce constant pool entries?

javaType: String,
variableName: String,
codeFunctions: String => String = _ => "",
inline: Boolean = false): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a bit scared about using regex to match simple assignment. Maybe we should have 2 versions of addMutableState: one provides a simple initial value, one provides arbitrary initializing code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of possible approaches. Let me count the number of calls of each type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go this way, what about creating a MutableStateBuilder then, with methods like withSimpleInitialValue taking a string and withComplexInit taking a function or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there are only about 15 complex assignment cases. I will specify inline = true for such as case and eliminate conditions using regex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk thanks! A big 👍 for eliminating the regexps! :) I'd prefer something more explicit about the reason why it is inlined, like the solutions proposed above. I think readability would be better. WDYT?

@mgaido91
Copy link
Contributor

@cloud-fan yes, I have the same opinion and I'd suggest the same thing (#19811 (comment)), at least at the beginning. Then if we find smarter way yo behave we can submit another PR later.

@kiszk
Copy link
Member Author

kiszk commented Nov 30, 2017

I think that it is good to have a loop to mainly reduce the byte code size and to reduce constant pool entries.

int[] ints = new int[32768];
ints[0] = 1;
ints[1] = 1;
...
ints[32767] = 1;

will not consume constant pool entries for constants (0 - 32767).
This code will generate a lot of java byte code for 32768 assignments (i.e. aload getfield iconst sipush iastore per an assignment). They will be split into multiple methods. To call a method requires a few constant pool entries at caller side.

I like to have a loop as possible. WDYT?

@@ -168,6 +166,21 @@ class CodegenContext {
val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
mutable.ArrayBuffer.empty[(String, String, String)]

// An array keyed by the tuple of mutable states' types and initialization code, holds the
// current max index of the array
var mutableStateArrayIdx: mutable.Map[(String, String), Int] =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the below code, looks like this is keyed by (javaType, arrayName), instead of (javaType, initCode)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, old comment still exists there. Thanks


// An array keyed by the tuple of mutable states' types, array names and initialization code,
// holds the code that will initialize the mutableStateArray when initialized in loops
var mutableStateArrayInitCodes: mutable.ArrayBuffer[(String, String, String)] =
Copy link
Member

@viirya viirya Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also keyed by (javaType, arrayName) too.

Copy link
Member Author

@kiszk kiszk Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok to use an array since this is not looked up. This is used only for generating code.

val qualifiedInitCode = initCode.replaceAll(
varName, s"$arrayName[${CodeGenerator.INIT_LOOP_VARIABLE_NAME}]")
mutableStateArrayCurrentNames += (javaType, initCodeKey) -> arrayName
mutableStateArrayInitCodes += ((javaType, arrayName, qualifiedInitCode))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all variables in the same array use the same init code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are in the same entry of mutableStateArrayInitCodes, yes. arrayName is assigned based on initCodeKey.

val loopIdxVar = CodeGenerator.INIT_LOOP_VARIABLE_NAME
s"""
for (int $loopIdxVar = 0; $loopIdxVar < $arrayName.length; $loopIdxVar++) {
$qualifiedInitCode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have two variables in the same array, they have different init codes, how does this loop work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. You put the variables with same init code in the same array.

@viirya
Copy link
Member

viirya commented Nov 30, 2017

I'm afraid this style initialization

int[] ints = new int[1001];
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

produces too many bytecodes.

@cloud-fan
Copy link
Contributor

I think we should do optimization incrementally. From

private int a;
private int b;
private int c;
...

a = 3;
b = 1;
c = 5;
...

to

int[] ints = new int[1001];
...
ints[0] = 3;
ints[1] = 1;
ints[2] = 5;
...

already hit our goal to reduce constant pool size, reducing the byte code size of constructor is another story.

addMutableState(javaType(expr.dataType), value,
s"$value = ${defaultValue(expr.dataType)};")
addMutableState(JAVA_BOOLEAN, isNull, forceInline = true, useFreshName = false)
addMutableState(javaType(expr.dataType), value, forceInline = true, useFreshName = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do

val isNull = addMutableState(JAVA_BOOLEAN, "subExprIsNull")
val value = addMutableState(javaType(expr.dataType), "subExprValue")
val fn = ...

at the beginning?

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84962 has finished for PR 19811 at commit 31914c0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Dec 15, 2017

I think this failure due to R package issue that we have seen before.

* building 'SparkR_2.3.0.tar.gz'

+ find pkg/vignettes/. -not -name . -not -name '*.Rmd' -not -name '*.md' -not -name '*.pdf' -not -name '*.html' -delete
++ grep Version /home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION
++ awk '{print $NF}'
+ VERSION=2.3.0
+ CRAN_CHECK_OPTIONS=--as-cran
+ '[' -n 1 ']'
+ CRAN_CHECK_OPTIONS='--as-cran --no-tests'
+ '[' -n 1 ']'
+ CRAN_CHECK_OPTIONS='--as-cran --no-tests --no-manual --no-vignettes'
+ echo 'Running CRAN check with --as-cran --no-tests --no-manual --no-vignettes options'
Running CRAN check with --as-cran --no-tests --no-manual --no-vignettes options
+ '[' -n 1 ']'
+ '[' -n 1 ']'
+ /usr/bin/R CMD check --as-cran --no-tests --no-manual --no-vignettes SparkR_2.3.0.tar.gz
* using log directory '/home/jenkins/workspace/SparkPullRequestBuilder/R/SparkR.Rcheck'
* using R version 3.1.1 (2014-07-10)
* using platform: x86_64-redhat-linux-gnu (64-bit)
* using session charset: ASCII
* using options '--no-tests --no-vignettes'
* checking for file 'SparkR/DESCRIPTION' ... OK
* checking extension type ... Package
* this is package 'SparkR' version '2.3.0'
* checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 22] do not match the length of object [0]
Execution halted
Loading required package: methods

Attaching package: 'SparkR'
...

@kiszk
Copy link
Member Author

kiszk commented Dec 15, 2017

Another PR also causes the same failure while that PR just added new tests.

@gatorsmile
Copy link
Member

I have to revert that PR again. e58f275

I think it is caused by that PR. Thus, I revert it again. Although I am not very sure what is the root cause. Will try to investigate more.

@kiszk
Copy link
Member Author

kiszk commented Dec 18, 2017

ping @cloud-fan

* compacted. Please set `true` into forceInline, if you want to access the
* status fast (e.g. frequently accessed) or if you want to use the original
* variable name
* @param useFreshName If this is false and forceInline is true, the name is not changed
Copy link
Contributor

@cloud-fan cloud-fan Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more accurate: If this is false and the mutable state ends up inlining in the outer class, the name is not changed

@@ -217,7 +217,7 @@ case class Stack(children: Seq[Expression]) extends Generator {
ctx.addMutableState(
s"$wrapperClass<InternalRow>",
ev.value,
s"${ev.value} = $wrapperClass$$.MODULE$$.make($rowData);")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can localize the global variable ev.value here to save one global variable slot.

Copy link
Member Author

@kiszk kiszk Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do it in another PR after merging this.

ctx.addMutableState(patternClass, pattern,
s"""$pattern = ${patternClass}.compile("$regexStr");""")
val pattern = ctx.addMutableState(patternClass, "patternLike",
v => s"""$v = ${patternClass}.compile("$regexStr");""", forceInline = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a clear rule when a global variable should be inlined for better performance? e.g. a microbenchmark showing noteworthy difference definitely proves we should inline.

Copy link
Member Author

@kiszk kiszk Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we have three rules to apply inlining

  1. Have to use the original name
  2. Frequently usedreferenced, but generated once, in the hot spot
  3. Not expected to be frequently generated proposed by @viirya

Now, we have no rule for 2. I will try to run microbenchmark for 2. Is it better to add these benchmarks into the benchmark directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if it's not a lot of them...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern is about point 2. I think it is a dangerous thing to do. What if we generate a lot of frequently used variable? I think it is safer at the moment to consider only 1 and 3 in the decision whether to inline or not. In the future, with a different codegen method, we might then define a threshold over which we generate an array for the given class, otherwise we use plain variables, which IMHO would be the best option but at the moment it is not feasible...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for confusing you about 2. I updated the statement. It is frequently referenced, but generated once. Therefore, if we have advantage for performance, we think it is safer since 3. will also apply inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 3 is a precondition for 2, then it is ok. Thanks for the explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on comments, I simplified rules.

  1. Use the original name
  2. Expect to be not-frequently used.

In the latter, I put comment regarding the reason at each site .

}
assert(ctx1.inlinedMutableStates.size == CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD)
// When the number of primitive type mutable states is over the threshold, others are
// allocated into an array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes: It's better if we can collect all mutable states before deciding which one should be inlined. However it's impossible to do with the current string based codegen framework, we need to decide inline or not immediately. We can revisit this in the future when we have an AST based codegen framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. In the future, we hope we have an AST based codegen framework.

val metrics = ctx.addMutableState(classOf[TaskMetrics].getName, "metrics",
v => s"$v = org.apache.spark.TaskContext.get().taskMetrics();", forceInline = true)
val sortedIterator = ctx.addMutableState("scala.collection.Iterator<UnsafeRow>", "sortedIter",
forceInline = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable as it's very unlikely we have a lot of sort operators in one stage. We have to inline it manually as we don't have the ability to find this out automatically yet. Same as https://github.com/apache/spark/pull/19811/files#r157408804

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question: is there any other places like this? do you have a list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Right now, Range is only used when there is one upstream.
ctx.addMutableState("scala.collection.Iterator", input, s"$input = inputs[0];")
val input = ctx.addMutableState("scala.collection.Iterator", "input",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's never used

@cloud-fan
Copy link
Contributor

LGTM except some minor comments, great job!

* is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
* 3. its type is multi-dimensional array
* A primitive type variable will be inlined into outer class when the total number of
* When a variable is compacted into an array, the max size of the array for compaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentences looks broken? I.e., ...total number of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this line A primitive type variable will be inlined into outer class when the total number of looks redundant.

@mgaido91
Copy link
Contributor

LGTM too, thanks @kiszk and @bdrillard! This is a very important PR IMHO

@viirya
Copy link
Member

viirya commented Dec 18, 2017

LGTM with one minor comment.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85107 has finished for PR 19811 at commit 0e45c19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Returns the reference of next available slot in current compacted array. The size of each
* compacted array is controlled by the config `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT is not a config


if (right.foldable) {
val rVal = right.eval()
if (rVal != null) {
val regexStr =
StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString()))
ctx.addMutableState(patternClass, pattern,
s"""$pattern = ${patternClass}.compile("$regexStr");""")
// inline mutable state since not many Like operations in a task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure about this, since Like is an expression and can appear many times, like other expressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


if (right.foldable) {
val rVal = right.eval()
if (rVal != null) {
val regexStr =
StringEscapeUtils.escapeJava(rVal.asInstanceOf[UTF8String].toString())
ctx.addMutableState(patternClass, pattern,
s"""$pattern = ${patternClass}.compile("$regexStr");""")
// inline mutable state since not many RLike operations in a task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// Right now, InputAdapter is only used when there is one input RDD.
ctx.addMutableState("scala.collection.Iterator", input, s"$input = inputs[0];")
// inline mutable state since an inputAdaptor in a task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
s"$fastHashMapTerm = new $fastHashMapClassName();")
ctx.addMutableState(s"java.util.Iterator<InternalRow>", iterTermForFastHashMap)
fastHashMapTerm = ctx.addMutableState(fastHashMapClassName, "vectorizedHastHashMap",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we force inline it too?

sorterTerm = ctx.freshName("sorter")
ctx.addMutableState(classOf[UnsafeKVExternalSorter].getName, sorterTerm)
hashMapTerm = ctx.addMutableState(hashMapClassName, "hashMap",
v => s"$v = $thisPlan.createHashMap();")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

ctx.addMutableState(clsName, matches,
s"$matches = new $clsName($inMemoryThreshold, $spillThreshold);")
val matches = ctx.addMutableState(clsName, "matches",
v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

great, merging to master, thank you all!

We can address other minor comments in follow-ups

@asfgit asfgit closed this in ee56fc3 Dec 19, 2017
@kiszk
Copy link
Member Author

kiszk commented Dec 19, 2017

Thank you very much for your support and merging this. I will open a follow-up PR soon.

ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 28, 2017
…reduce entries for mutable state

## What changes were proposed in this pull request?

This PR addresses additional review comments in apache#19811

## How was this patch tested?

Existing test suites

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#20036 from kiszk/SPARK-18066-followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants