Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16845][SQL] GeneratedClass$SpecificOrdering grows beyond 64 KB #15480

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR
* Generates the code for ordering based on the given order.
*/
def genComparisons(ctx: CodegenContext, ordering: Seq[SortOrder]): String = {
val comparisons = ordering.map { order =>
def comparisons(orderingGroup: Seq[SortOrder]) = orderingGroup.map { order =>
val eval = order.child.genCode(ctx)
val asc = order.isAscending
val isNullA = ctx.freshName("isNullA")
Expand Down Expand Up @@ -118,7 +118,42 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR
}
"""
}.mkString("\n")
comparisons

/*
Copy link
Member

@kiszk kiszk Oct 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this implementation, is it possible to use this function by adding a statement for return as a default argument?

Copy link
Contributor Author

@lw-lin lw-lin Oct 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kiszk for bringing this up!

@ueshin @kiszk any comments please on do we want to expand CodeGenerator#private splitExpressions to:

  • add one more argument to specify how we want to "return" things?
  • add arguments to specify in which manner do we want to break expression? E.g. in this pr's case, we might want to break it according to numberOfComparisonsThreshold = 40 rather than the string length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kiszk that we would use the function if possible, but I have no idea to expand the function to apply to this case simply.

Copy link
Member

@kiszk kiszk Oct 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw-lin , good points.
For the second issue, can we use the string length as a proxy of numberOfComparisonsThreshold? I know this is not the exact estimation.
For the first issue, how about the following approach? In advance, I am sorry that I have not compiled it myself.

// CodeGenerator.scalar
def splitExpressions(expressions: Seq[String], funcName: String, arguments: Seq[(String, String), returns: (String, String] =("void", "")): String = {
  ..
  val code = s"""
    |private $(returns._1) $name(${arguments.map { case (t, name) => s"$t $name" }.mkString(", ")}) {
    |  $body
    |  ${returns._2)
    |}
  """.stripMargin
...
}
// GenerateOrdering.scala 
  val groupedOrderingItr = ordering.grouped(numberOfComparisonsThreshold)
  // var groupedOrderingLength = 0
  val functions = ctx.splitExpressions {
    ..
  }
  val comp = freshName("comp")
  functions.zipWithIndex.map { case(func , i) =>
    val name = s"${comp}_$i"
    s"""
      |int $name = $func(a, b);
      |if ($name != 0) {
      |  return $name;
      |}
    """.stripMargin
  }.mkString

Copy link
Contributor Author

@lw-lin lw-lin Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk @ueshin thanks!
i kind of implemented a prove of concept (lw-lin@d0c1198) on how to extend splitExpressions to rewrite this PR
could you comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw-lin I commented to the commit. Please look at them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies could you take a quick look here:

this PR tries to break a huge generated method into smaller pieces. @kiszk @ueshin and I were discussing whether we should:

  • (a) just do the breaking-ups case by case, like in this PR
  • or (b) expand CodeGenerator.splitExpressions(...) to generally support this breaking-ups, like in the POC lw-lin@d0c1198

@davies could you advise on this? thanks!

* 40 = 7000 bytes / 170 (around 170 bytes per ordering comparison).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you get the 170?

Copy link
Contributor Author

@lw-lin lw-lin Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry this is vague.

Actually I ran all the 36 test cases in OrderingSuite and logged each of the generated comparisonXXX method's size. Each method:

  • < 4.9 KB when numberOfComparisonsThreshold = 40
  • < 6.4 KB when numberOfComparisonsThreshold = 50
  • < 7.8 KB when numberOfComparisonsThreshold = 60

I thought it'd be safer if we pick 40 (taking minor future changes into account). Thus 170 should be considered as some kind of a safe assumption (or not?).

Would you share your thoughts on this? or anyway we can improve this? thanks.

Copy link
Contributor

@cloud-fan cloud-fan Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test cases are not real world workloads, we can't estimate comparison code size based on test cases. My idea is that, first we generate the comparison code like before, and check the code size, if it exceeds 1024(see #15620 (comment)), go to the splitting branch. In the splitting branch, we can generate method for each ordering expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use ctx.splitExpressions() approach, we don't need to calculate the size because the method will split based on the code size of 1024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cloud-fan @ueshin for your valuable comments.

@ueshin do you have an on-going work of refactoring ctx.splitExpressions()? If you do, I might update this PR to rely on your work -- do not want to step on your toes :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor ctx.splitExpressions() like 0aedc47 based on @lw-lin's PoC.
Unfortunately, the max number of comparisons we can support declined to around 5000 (current approach can support 10000 or more) because ctx.splitExpressions() splits comparisons into smaller size than current rule.

* The maximum byte code size to be compiled for HotSpot is 8000 bytes.
* We should keep less than 8000 bytes.
*/
val numberOfComparisonsThreshold = 40

if (ordering.size <= numberOfComparisonsThreshold) {
comparisons(ordering)
} else {
val groupedOrderingItr = ordering.grouped(numberOfComparisonsThreshold)
val funcNamePrefix = ctx.freshName("compare")
val funcNames = groupedOrderingItr.zipWithIndex.map { case (orderingGroup, i) =>
val funcName = s"${funcNamePrefix}_$i"
val funcCode =
s"""
|private int $funcName(InternalRow a, InternalRow b) {
| InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated.
| ${comparisons(orderingGroup)}
| return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make the comparison result a member variable, then we don't need return in the comparison code right?

Copy link
Contributor Author

@lw-lin lw-lin Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's right -- here we're returning ints because comparisons(ordering/orderingGroup)(L75 of this same file) is returning ints. Should we change that all along?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance concerns, we should avoid using member variables. If there is no easy way to reuse splitExpressions, I'm ok with the current approach.

Copy link
Contributor Author

@lw-lin lw-lin Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah then maybe this is ready to go? I recall @ueshin or @kiszk might be interested in refactoring splitExpressions in future PRs.

Thanks for reviewing this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in refactoring approach, which will be useful for more general case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also interested in refactoring. However, it would be better to do in another PR.

|}
""".stripMargin
ctx.addNewFunction(funcName, funcCode)
funcName
}

funcNames.zipWithIndex.map { case (funcName, i) =>
s"""
|int comp_$i = ${funcName}(a, b);
|if (comp_$i != 0) {
| return comp_$i;
|}
""".stripMargin
}.mkString
}
}

protected def create(ordering: Seq[SortOrder]): BaseOrdering = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,17 @@ class OrderingSuite extends SparkFunSuite with ExpressionEvalHelper {
}
}
}

test("SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB") {
val sortOrder = Literal("abc").asc

// this is passing prior to SPARK-16845, and it should also be passing after SPARK-16845
GenerateOrdering.generate(Array.fill(40)(sortOrder))

// this is FAILING prior to SPARK-16845, but it should be passing after SPARK-16845
GenerateOrdering.generate(Array.fill(450)(sortOrder))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, it's covered by the 5000 test case.

Copy link
Contributor Author

@lw-lin lw-lin Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's remove the 450 test case


// verify that we can support up to 10000 ordering comparisons, which should be sufficient
GenerateOrdering.generate(Array.fill(10000)(sortOrder))
}
}