[SPARK-23486]cache the function name from the external catalog for lookupFunctions #20795

kevinyu98 · 2018-03-11T07:03:08Z

What changes were proposed in this pull request?

This PR will cache the function name from external catalog, it is used by lookupFunctions in the analyzer, and it is cached for each query plan. The original problem is reported in the spark-19737

How was this patch tested?

create new test file LookupFunctionsSuite and add test case in SessionCatalogSuite

hvanhovell · 2018-03-11T11:44:23Z

Ok to test

viirya

We should add a test for this too.

viirya · 2018-03-11T13:47:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          throw new NoSuchFunctionException(f.name.database.getOrElse("default"), f.name.funcName)
+    override def apply(plan: LogicalPlan): LogicalPlan = {
+      val catalogFunctionNameSet = new mutable.HashSet[FunctionIdentifier]()
+        plan.transformAllExpressions {


style: indent.

changed, thanks.

viirya · 2018-03-11T14:00:31Z

I've added a test like this:

class LookupFunctionsSuite extends PlanTest {

  test("SPARK-23486: LookupFunctions should not check the same function name more than once") {
    val externalCatalog = new CustomInMemoryCatalog
    val analyzer = {
      val conf = new SQLConf()
      val catalog = new SessionCatalog(externalCatalog, FunctionRegistry.builtin, conf)
      catalog.createDatabase(
        CatalogDatabase("default", "", new URI("loc"), Map.empty),
        ignoreIfExists = false)
      new Analyzer(catalog, conf)
    }

    val unresolvedFunc = UnresolvedFunction("func", Seq.empty, false)
    val plan = Project(
      Seq(Alias(unresolvedFunc, "call1")(), Alias(unresolvedFunc, "call2")()),
      table("TaBlE"))
    analyzer.LookupFunctions.apply(plan)
    assert(externalCatalog.getFunctionExistsCalledTimes == 1)
  }
}

class CustomInMemoryCatalog extends InMemoryCatalog {

  private var functionExistsCalledTimes: Int = 0

  override def functionExists(db: String, funcName: String): Boolean = synchronized {
    functionExistsCalledTimes = functionExistsCalledTimes + 1
    true
  }

  def getFunctionExistsCalledTimes: Int = functionExistsCalledTimes
}

Maybe others have better idea.

SparkQA · 2018-03-11T15:03:05Z

Test build #88161 has finished for PR 20795 at commit 701100c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kevinyu98 · 2018-03-11T15:44:58Z

@viirya Thanks a lot. I will create a new test file LookupFunctionsSuite under sql/catalyst/analysis.

gatorsmile · 2018-03-11T16:17:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        plan.transformAllExpressions {
+          case f: UnresolvedFunction if catalogFunctionNameSet.contains(f.name) => f
+          case f: UnresolvedFunction if catalog.functionExists(f.name) =>
+            catalogFunctionNameSet.add(f.name)


Normalize the name before adding it to the cache? This can cover more cases

sure, I will do that. Thanks.

viirya · 2018-03-13T01:59:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    override def apply(plan: LogicalPlan): LogicalPlan = {
+      val catalogFunctionNameSet = new mutable.HashSet[FunctionIdentifier]()
+      plan.transformAllExpressions {
+        case f: UnresolvedFunction if catalogFunctionNameSet.contains(f.name) => f


Normalize FunctionIdentifier when looking up it too?

I will normalize the look up too. Thanks.

viirya · 2018-03-13T01:59:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    }
+
+    private def normalizeFuncName(name: FunctionIdentifier): FunctionIdentifier = {
+      FunctionIdentifier(name.funcName.toLowerCase(Locale.ROOT), name.database)


name.database.getOrElse("default")?

the FunctionIdentifier's signature for database is Option, it is not string. since we are just used in this local cache, I think it is ok to not convert to "default" string. I saw when we do the registerFunction in FunctionRegistry.scala, we didn't put the "default" in normalizeFuncName either. What do you think? thanks.

@viirya Hello Simon, I will leave this as it is for now, but if you think it is better to have the Option(name.database.getOrElse("default")) in the catalogFunctionNameSet, let me know. Thanks.

Ah, sorry for replying late. What I thought is if two FunctionIdentifiers, one is with default database name Some("default"), another is None. They should be equal to each other here.

I actually mean name.database.orElse(Some("default")).

@viirya Shouldn't we be using the current database if database is not specified ? I am trying to understand why we should use "default" here ?

@kevinyu98 For built-in functions, we don't need to normalize their database name.

Rethink about it, actually here it is not for function resolution, I think it is OK to leave it as name.database.

@viirya @kevinyu98 We need to check what happens in the following case .

use currentdb; select currentdb.function1(), function1() from ....

In this case, the 2nd function should be resolved from the local cache if this optimization
were to work. If we just use name.database instead of defaulting to current database , will it still happen ?

@dilipbiswal @viirya Thanks for pointing this out. If we just use the name.database, the cache will store "None" for the database name, the 2nd function will not resolved from the local cache. We need to use the catalog.getCurrentDatabase for the database name in the cache.
After running more test cases, I think it is better to cache the external function name only, not include the build-in function. If we all agree this approach, I can submit the code for review.

For built-in functions, it may no a big deal if we don't find it in this cache. It should be very fast to query built-in functions. I remember the main issue of this ticket is external function lookup where it means more loading on connection with metastore.

i agree @viirya

SparkQA · 2018-03-13T05:09:36Z

Test build #88188 has finished for PR 20795 at commit 99cc3b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-13T05:14:07Z

Test build #88189 has finished for PR 20795 at commit 211abcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class LookupFunctionsSuite extends PlanTest
class CustomInMemoryCatalog extends InMemoryCatalog

SparkQA · 2018-03-16T07:05:01Z

Test build #88292 has finished for PR 20795 at commit 1e5ba02.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-16T08:45:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          f
+        case f: UnresolvedFunction =>
+          withPosition(f) {
+            throw new NoSuchFunctionException(f.name.database.getOrElse("default"),


Then I think this should be current database instead of default.

@viirya Yeah..

@viirya @dilipbiswal @gatorsmile @liancheng ok, I will change this.

gatorsmile · 2018-03-19T05:46:39Z

Please ping me if this is ready to review.

SparkQA · 2018-03-19T23:00:12Z

Test build #88386 has finished for PR 20795 at commit 93b115e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T07:05:01Z

Test build #88447 has finished for PR 20795 at commit 17f7e74.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-03-21T10:44:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

@@ -175,6 +175,8 @@ private[sql] class HiveSessionCatalog(
    super.functionExists(name) || hiveFunctions.contains(name.funcName)
  }

+  override def externalFunctionExists(name: FunctionIdentifier): Boolean = functionExists(name)


According to your logic, I think HiveSessionCatalog should override both buildinFunctionExists and externalFunctionExists. Like:

override def buildinFunctionExists(name: FunctionIdentifier): Boolean = { super.buildinFunctionExists(name) || hiveFunctions.contains(name.funcName) } override def externalFunctionExists(name: FunctionIdentifier): Boolean = functionExists(name) = { super.externalFunctionExists(name) }

@WeichenXu123 thanks very much for reviewing. I am a little confused. So HiveSessionCatalog's builtinFunctionExists is essentially same as its parent. That is the reason i didn't override it in HiveSessionCatalog. However the logic to lookup an external function is different in HiveSessionCatalog as we also have to handle the special function "histogram_numeric". Thats why i choose to override the externalFunctionExists. One clarification is that builtinFunctionExists solely looks at FunctionRegistry to lookup a function.

builtInFunctionExists => Looks up a function in FunctionRegistry (same for both SessionCatalog and HiveSessionCatalog

externalFunctionExists => Looks up an external catalog to find the function. HiveSessionCatalog has one additional semantics to handle the special function called histogram_numeric.

as you point out, I need to change the override externalFunctionExists to this:
override def externalFunctionExists(name: FunctionIdentifier): Boolean = { super.externalFunctionExists(name) || hiveFunctions.contains(name.funcName)

Oh, you mean functions like "histogram_numeric" should be regarded as externalFunction in hiveContext ? I am not sure about this. But if that's right your current code is OK :)

yes, seem "histogram_numeric" is not supported in spark natively yet, I think once this jira closed (https://issues.apache.org/jira/browse/SPARK-16280), we don't need these codes in the HiveSessionCatalog.

WeichenXu123 · 2018-03-21T10:45:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    requireDbExists(db)
+    externalCatalog.functionExists(db, name.funcName)
+  }
+


To avoid code duplication, you should modify the functionExists like:

def functionExists(name: FunctionIdentifier): Boolean = { buildinFunctionExists(name) || externalFunctionExists(name) }

ok, I will change.

WeichenXu123 · 2018-03-21T11:10:29Z

And I don't think it need to split into builtin and external function exist check in this case. Just following code works fine:

  object LookupFunctions extends Rule[LogicalPlan] {
    override def apply(plan: LogicalPlan): LogicalPlan = {
      val cachedNameSet = new mutable.HashSet[FunctionIdentifier]()
      plan.transformAllExpressions {
        case f: UnresolvedFunction
          if cachedNameSet.contains(normalizeFuncName(f.name)) => f
        case f: UnresolvedFunction if catalog.functionExists(f.name) =>
          cachedNameSet.add(normalizeFuncName(f.name))
          f
        case f: UnresolvedFunction =>
          withPosition(f) {
            throw new NoSuchFunctionException(f.name.database.getOrElse(catalog.getCurrentDatabase),
              f.name.funcName)
          }
      }
    }

    private def normalizeFuncName(name: FunctionIdentifier): FunctionIdentifier = ...
  }

Isn't it ?

kevinyu98 · 2018-03-22T00:07:22Z

@WeichenXu123 I didn't split until this disussion [discussion] (#20795 (comment)). The original jira report is about lookup HiveSessionCatalog, so I think to just cache the externalFunctionName to avoid the complexity of code.

viirya · 2018-03-22T02:19:02Z

I'm also a bit confusing why we need to split built-in and external functions.

WeichenXu123 · 2018-03-22T03:15:15Z

Yea, I understand the reason to split built-in and external because you only want to cache external function name. But cache all used function names in a query do not cost too much so that maybe do not worth to do that.

kevinyu98 · 2018-03-22T06:55:16Z

the reason I was thinking to split is for the below scenario:
In order to avoid cache twice for the external function name in the cache as the scenario described by Dilip, we decide to use getCurrentDatabase during normalizeFuncName.

but it will fail for the spark's builtin function, for example:

use currentdb;
select function1(), currentdb.function1() from ...

if the function1 is builtin function, for example max, and the currentdb doesn't have the function max.

the first time, max will be found from builtin function checking (functionRegistry.functionExists(name)), spark's builtin function checking didn't use the
database name if you don't explicit specify. So the cache will store the builtin function max as

currentdb.max

the second function currentdb.max will be found in the cache, even the currentdb doesn't have the max function.

but during ResolveFunctions in the analyzer, currentdb.max can't be resolved, and it will get NoSuchFunctionException for max.

viirya · 2018-03-22T07:16:38Z

Can we just skip caching for built-in functions?

kevinyu98 · 2018-03-22T13:58:24Z

@viirya yes, my latest submitted code only caching the external functions, skip the built-in functions.
@WeichenXu123 I will change this comment only comment. Let me know if you have concerns? Thanks.

SparkQA · 2018-03-26T07:05:01Z

Test build #88576 has finished for PR 20795 at commit 029ee6c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-26T07:37:07Z

retest this please

SparkQA · 2018-03-26T11:13:24Z

Test build #88580 has finished for PR 20795 at commit 029ee6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-03-26T15:52:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    builtinFunctionExists(name) || externalFunctionExists(name)
+  }
+
+  def builtinFunctionExists(name: FunctionIdentifier): Boolean = {


All the functions in functionRegistry are built-in?

@gatorsmile sorry for the delay. Not all the functions in functionRegistry are built-in, there could be UDF. I am thinking to change builtinFunctionExists to registryFunctionExist, what do you think? I also need to change the PR description if we just cache the external function names. Thanks.

Let us do not add these APIs to SessionCatalog. Just create private functions?

/** * Returns whether this function has been registered in the function registry of the current * session. If not existed, returns false. */ def isRegisteredFunction(name: FunctionIdentifier): Boolean = { functionRegistry.functionExists(name) } /** * Returns whether it is a persistent function. If not existed, returns false. */ def isPersistentFunction(name: FunctionIdentifier): Boolean = { val db = formatDatabaseName(name.database.getOrElse(getCurrentDatabase)) databaseExists(db) && externalCatalog.functionExists(db, name.funcName) }

Please also add the unit test cases to SessionCatalogSuite

Move them after def isTemporaryFunction(name: FunctionIdentifier): Boolean

@gatorsmile Sorry for the delay. I was working on something else. Thanks very much for the help, I will add these two new APIs in SessionCatalog and add test cases in SessionCatalogSuite

SparkQA · 2018-04-10T21:01:44Z

Test build #89142 has finished for PR 20795 at commit d1ee9cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kevinyu98 · 2018-06-26T06:15:26Z

sorry for the delay. I was working on some other projects. I am back and focus on addressing the comments now.

HyukjinKwon · 2018-06-26T08:59:39Z

ok to test

SparkQA · 2018-06-26T13:04:47Z

Test build #92333 has finished for PR 20795 at commit d1ee9cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-09T21:37:08Z

ping @kevinyu98 @dilipbiswal

kevinyu98 · 2018-07-10T18:41:07Z

@gatorsmile Hi Sean, I am so sorry for the long delay. I will address the comments today and submit the code for reviewing.

Thanks very much !
Kevin

SparkQA · 2018-07-11T12:04:04Z

Test build #92849 has finished for PR 20795 at commit 8dceda9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-11T16:23:19Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/LookupFunctionsSuite.scala

+    assert(analyzer.LookupFunctions.normalizeFuncName
+      (unresolvedFunc.name).database == Some("default"))
+    assert(catalog.isRegisteredFunction(unresolvedFunc.name) == false)
+    assert(catalog.isRegisteredFunction(FunctionIdentifier("max")) == true)


I mean adding another test case to check whether LookupFunctions does not resolve the registeredFunction more than once.

We do not need to add assert.

I see, I add the test case, can you verify ? thanks a lot.

dilipbiswal · 2018-07-12T06:55:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+
+    def normalizeFuncName(name: FunctionIdentifier): FunctionIdentifier = {
+      FunctionIdentifier(name.funcName.toLowerCase(Locale.ROOT),
+        name.database.orElse(Some(catalog.getCurrentDatabase)))


@kevinyu98 I have a question. So we normalize the funcName here. How about name.database ? Is that normalized already by the time we are here ?

@kevinyu98 how about consideration of conf.caseSensitiveAnalysis ?

yes, I will change the code for the name.database. Thanks.

SparkQA · 2018-07-12T07:05:02Z

Test build #92915 has finished for PR 20795 at commit 0db2826.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CustomerFunctionRegistry extends SimpleFunctionRegistry

dilipbiswal · 2018-07-12T07:07:34Z

retest this please

SparkQA · 2018-07-12T11:19:27Z

Test build #92919 has finished for PR 20795 at commit 0db2826.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CustomerFunctionRegistry extends SimpleFunctionRegistry

dilipbiswal · 2018-07-12T18:10:33Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/LookupFunctionsSuite.scala

+  }
+}
+
+class CustomerFunctionRegistry extends SimpleFunctionRegistry {


@kevinyu98 Instead of extending FunctionRegistry and Catalog, what do think of extending SessionCatalog and overriding isRegisteredFunction and isPersistentFunction. So after a invocation of LookupFunction we get a count of how many times isRegisteredFunction was called and how many times isPersistentFunction was called ? We can just create an instance of analyzer with a extended Session catalog that we can use in more than one test ? Would that be simpler ?

Either is fine to me. The major goal of these test cases is to count the number of invocation of functionExists. That is why the current way is more straightforward to reviewers.

@gatorsmile Sure Sean.

gatorsmile · 2018-07-13T02:51:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      }
+    }
+
+    def normalizeFuncName(name: FunctionIdentifier): FunctionIdentifier = {


This is a common utility function. We can refactor the code later.

gatorsmile · 2018-07-13T02:52:09Z

LGTM

SparkQA · 2018-07-13T03:30:03Z

Test build #92954 has finished for PR 20795 at commit 26f2f54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-13T05:18:48Z

Thanks! Merged to master

viirya reviewed Mar 11, 2018

View reviewed changes

gatorsmile reviewed Mar 11, 2018

View reviewed changes

viirya reviewed Mar 13, 2018

View reviewed changes

viirya reviewed Mar 16, 2018

View reviewed changes

kevinyu98 force-pushed the spark-23486 branch from 1e5ba02 to 93b115e Compare March 19, 2018 21:27

WeichenXu123 reviewed Mar 21, 2018

View reviewed changes

gatorsmile reviewed Mar 26, 2018

View reviewed changes

kevinyu98 changed the title ~~[SPARK-23486]cache the function name from the catalog for lookupFunctions~~ [SPARK-23486]cache the function name from the external catalog for lookupFunctions Apr 10, 2018

kevinyu98 added 10 commits July 10, 2018 15:45

cache the function name from the catalog for lookupFunction

5c8648c

address comments and add test cases

caf6b6d

commit the new testcase file

b8871e2

add normalize during function name lookup

14c62e4

just cache external function name

5c6687c

fix the HiveSQLViewSuite failure

e030bd0

address comments

6358b92

changing the APIs in SessionCatalog

74b01a5

address comments

27246fb

fix style

8dceda9

kevinyu98 force-pushed the spark-23486 branch from d1ee9cb to 8dceda9 Compare July 11, 2018 08:13

gatorsmile reviewed Jul 11, 2018

View reviewed changes

add isRegisteredFunction check

0db2826

dilipbiswal reviewed Jul 12, 2018

View reviewed changes

check case sensitive

26f2f54

gatorsmile reviewed Jul 13, 2018

View reviewed changes

asfgit closed this in 0ce11d0 Jul 13, 2018

[SPARK-23486]cache the function name from the external catalog for lookupFunctions #20795

[SPARK-23486]cache the function name from the external catalog for lookupFunctions #20795

Conversation

kevinyu98 commented Mar 11, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented Mar 11, 2018

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 11, 2018 • edited Loading

SparkQA commented Mar 11, 2018

kevinyu98 commented Mar 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2018

SparkQA commented Mar 13, 2018

SparkQA commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 19, 2018

SparkQA commented Mar 19, 2018

SparkQA commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Mar 21, 2018

kevinyu98 commented Mar 22, 2018

viirya commented Mar 22, 2018

WeichenXu123 commented Mar 22, 2018

kevinyu98 commented Mar 22, 2018

viirya commented Mar 22, 2018

kevinyu98 commented Mar 22, 2018

SparkQA commented Mar 26, 2018

HyukjinKwon commented Mar 26, 2018

SparkQA commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2018

kevinyu98 commented Jun 26, 2018

HyukjinKwon commented Jun 26, 2018

SparkQA commented Jun 26, 2018

gatorsmile commented Jul 9, 2018

kevinyu98 commented Jul 10, 2018

SparkQA commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2018

dilipbiswal commented Jul 12, 2018

SparkQA commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinyu98 commented Mar 11, 2018 •

edited

Loading

viirya commented Mar 11, 2018 •

edited

Loading