Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC #10750

Closed
wants to merge 1 commit into from
Closed

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC #10750

wants to merge 1 commit into from

Conversation

huaxingao
Copy link
Contributor

…layer

For arithmetic operator in WHERE clause such as
select * from table where c1 + c2 > 10
Currently where c1 + c2 >10 is done at spark layer.
Will push this to JDBC layer so it will be done in database

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@huaxingao
Copy link
Contributor Author

@viirya
I changed the code based on your suggestion. Could you please review again?
Thanks a lot for your help!!

private def translateArithemiticOPFilter(predicate: Expression): Option[Filter] = {
predicate match {
case expressions.EqualTo(Add(left, right), Literal(v, t)) =>
Some(sources.ArithmeticOPEqualTo(Add(left, right), convertToScala(v, t)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in SPARK-10195, it looks now data sources API exposes Catalyst's internal types through its Filter interfaces. I think this might have to be hidden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look of SPARK-10195. Looks like it deals with the issue of exposing internal data types. It uses convertToScala to convert these internal data types to scala version. Since here convertToScala is used to convert the values. I think it should not be the same problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I see. but for me it still looks expression._ might have to hide. In this way, the expression._ can be accessed in datasource level. I believe the reason way source._ is implemented is, to hide expression._ which has been changed rapidly version by version.

@HyukjinKwon
Copy link
Member

Since source.Filter is shared with Parquet, ORC and etc., I think this might have to resolve the arithmetic operators in DataSourceStrategy itself.

AFAIK, Parquet and ORC does not support arithmetic operators and they might anyway have to convert them in Spark-side in the future if we support this in this way. So, for this case, I think the operators might have to be resolved in DataSourceStrategy.

I believe we might better resolve this issue by modifying implementing CatalystScan as suggested by @liancheng in SPARK-9182 and filed here in SPARK-12126

@HyukjinKwon
Copy link
Member

If we keep going to solve in DataSourceStrategy in this way, I think we should resolve the operators for other datasources in DataSourceStrategy. For this, dealing with Cast SPARK-9182 might have to be done first.

@HyukjinKwon
Copy link
Member

Actually, I also suggested the way similar with this before in SPARK-9182. If we keep adding filters in this way, this could end up with converting all expressions.Filter tosources.Filter. This might mean we need to write a new expression library, which might not worth the effort.

And this is why I said all the issues above in this Jira issue, SPARK-12506.

@HyukjinKwon
Copy link
Member

@huaxingao please change the title to one not ending with …

@rxin
Copy link
Contributor

rxin commented Jan 14, 2016

Yes I think using the internal expression API makes more sense. We don't want to add too many expressions to the external data source API.

@viirya
Copy link
Member

viirya commented Jan 14, 2016

Indeed, continuing to add more filters will be a problem. If we can directly pass Catalyst expressions to JDBC datasource, that will be better.

@HyukjinKwon
Copy link
Member

@viirya Yes, I think so. But the reason why I did not give a try for this is, expression._ is being rapidly changed, which could affect codes at the datasource implemented by CatalystScan version by version. I believe that this is also why Parquet datasource is changed from its implementation of CatalystScan to another.

So, maybe we should try to find a better solution for this..

@huaxingao huaxingao changed the title [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC … [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC Jan 14, 2016
@viirya
Copy link
Member

viirya commented Jan 27, 2016

I think most expressions (such as >, >=, <, <=, ==, string ops, arithmetic ops) which are commonly used in filters are relatively stable now. Maybe we can let JDBC datasource implement CatalystScan and process these expressions.

@huaxingao
Copy link
Contributor Author

@viirya @HyukjinKwon @rxin
Thank you all very much for your comments. I will change JDBCRelation to implement CatalystScan, and then directly access Catalyst expressions in JDBCRDD. I will close this PR and submit a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants