-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9078] [SQL] Allow jdbc dialects to override the query used to check the table. #8676
[SPARK-9078] [SQL] Allow jdbc dialects to override the query used to check the table. #8676
Conversation
…ialect implementations to specify the query.
FWIW, we dropped JVM 1.6 support in Spark 1.5. Would that make this easier? |
Even if spark is running on jdk1.7, customers using older version of drivers will run into AbstractMethodError exception. I think adding requirement for customers to use new drivers that implement getSchema() function will be unnecessary. After implementing the current approach I got curious on how the jdbc read functionality finds the meta data and learned org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.resolveTable also uses s"SELECT * FROM $table WHERE 1=0" to get column information. Alternative approach is to add getMetadataQuery(table:string) to the JdbcDialect interface that helps to determine if table exists for write case , and column type information in the case of read instead of getTableExistsQuery() as implemented in the current pull request. It might be a milli second slower in the case of write call for dialects that specify “select 1 from $table limit 1", instead of “select * from $table limit 1”. Advantage is one method to the interface will address both the cases. Any comments ? |
* @return The SQL query to use for checking the table. | ||
*/ | ||
def getTableExistsQuery(table: String): String = { | ||
s"SELECT * FROM $table WHERE 1=0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should quote the table here actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually never mind we cannot quote it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin What's the specific reason table name cannot be quoted? We happen to have a table with dots and parenthesis in its name, planning to add surrounding backticks before passing it to Spark.
next() will return false because resultset will be empty when query is where 1!=0. executeQuery() will throw an exception if table is not found. next() call is not really required to find if the table exists or not. |
Typo in my previous comment, I meant when query is where 1=0. |
@rxin Thank you for reviewing the patch . Just to make sure tested with out the next() call on MySql, Postgres, and DB2, it worked fine. Updated the pull request. |
Thanks. Merging this in master. |
(Oops spoke too soon - I will merge after tests pass) |
Test build #1760 has finished for PR 8676 at commit
|
retest this please |
Test build #42504 has finished for PR 8676 at commit
|
@vanzin do you know what's going on with the tests? [error] Execution of test test.org.apache.spark.sql.JavaApplySchemaSuite failed: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.ExtendedYarnTest |
I've merged this. |
@rxin I reverted the patch that caused those. |
It has been merged to master. |
Current implementation uses query with a LIMIT clause to find if table already exists. This syntax works only in some database systems. This patch changes the default query to the one that is likely to work on most databases, and adds a new method to the JdbcDialect abstract class to allow dialects to override the default query.
I looked at using the JDBC meta data calls, it turns out there is no common way to find the current schema, catalog..etc. There is a new method Connection.getSchema() , but that is available only starting jdk1.7 , and existing jdbc drivers may not have implemented it. Other option was to use jdbc escape syntax clause for LIMIT, not sure on how well this supported in all the databases also. After looking at all the jdbc metadata options my conclusion was most common way is to use the simple select query with 'where 1 =0' , and allow dialects to customize as needed