Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13167][SQL] Include rows with null values for partition column when reading from JDBC datasources. #11063

Conversation

sureshthalamati
Copy link
Contributor

Rows with null values in partition column are not included in the results because none of the partition
where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause.

Example:
JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1),
JDBCPartition(THEID >= 2, 2)

@@ -213,14 +212,21 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
url: String,
table: String,
parts: Array[Partition],
connectionProperties: Properties): DataFrame = {
connectionProperties: Properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually breaks api compatibility
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing the patch, Reynold.
This particular jdbc method where I made the signature changes is not public. It i defined as private def jdbc ..

@sureshthalamati
Copy link
Contributor Author

@rxin Thank you for reviewing the PR. As I mentioned in my comment, I did not change the public method. Any suggestions to improve this fix ?

@@ -45,7 +47,8 @@ private[sql] object JDBCRelation {
* incorrect values may cause the partitioning to be poor, but no data
* will fail to be represented.
*/
def columnPartition(partitioning: JDBCPartitioningInfo): Array[Partition] = {
def columnPartition(partitioning: JDBCPartitioningInfo,
schema: StructType, url: String): Array[Partition] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some documentation to this function to explain the parameters?

just do them with @param

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also 4 space indent for function params

@sureshthalamati sureshthalamati force-pushed the nullable_jdbc_part_col_spark-13167 branch from f4358bb to 1e6a631 Compare March 1, 2016 23:17
@sureshthalamati
Copy link
Contributor Author

Thanks for input, Reynold . Update the PR to specify the is null clause in the first partition where clause. Please review.

@rxin
Copy link
Contributor

rxin commented Mar 1, 2016

Thanks - can you update the pull request description to reflect the latest change?

@sureshthalamati
Copy link
Contributor Author

sure. Updated the description.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #2599 has finished for PR 11063 at commit 1e6a631.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

Thanks - I'm merging this in master.

@asfgit asfgit closed this in e42724b Mar 2, 2016
@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #2600 has finished for PR 11063 at commit 1e6a631.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati
Copy link
Contributor Author

Thank you.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
… when reading from JDBC datasources.

Rows with null values in partition column are not included in the results because none of the partition
where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column  to the first JDBC partition where clause.

Example:
JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1),
JDBCPartition(THEID >= 2, 2)

Author: sureshthalamati <[email protected]>

Closes apache#11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants