Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: PySpark backend #1913

Merged
merged 18 commits into from
Aug 22, 2019
Merged

Conversation

icexelloss
Copy link
Contributor

This is a Pyspark backend for ibis. This is different from the spark backend where the ibis expr is compiled to SQL string. Instead, the pyspark backend compiles the ibis expr to pyspark.DataFrame exprs.

@pep8speaks
Copy link

pep8speaks commented Aug 7, 2019

Hello @icexelloss! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-22 18:41:54 UTC

@icexelloss
Copy link
Contributor Author

Currently this implementation passes test_aggregation and test_numeric. @hjoo and I are working on passing the rest of tests under "all" but I'd like to have this up to see if this approach makes sense.

@icexelloss icexelloss changed the title Pyspark backend prototype PySpark backend Aug 7, 2019
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic approach looks good!

ibis/pyspark/compiler.py Show resolved Hide resolved

return decorator

def translate(self, expr, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think PySparkExprTranslator can be a pass through

elif isinstance(selection, types.ColumnExpr):
column_name = selection.get_name()
col_names_in_selection_order.append(column_name)
if column_name not in src_table.columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this do the wrong thing for expressions like t[['a', 'b']].mutate(a=lambda t: t.a + 1)? In that example you still need to compile the new expression for a even though src_table still has an a column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This is wrong. Let me fix it.

ibis/pyspark/compiler.py Outdated Show resolved Hide resolved
ibis/pyspark/compiler.py Show resolved Hide resolved
ibis/pyspark/client.py Outdated Show resolved Hide resolved
ibis/pyspark/client.py Outdated Show resolved Hide resolved


class PySpark(Backend, RoundAwayFromZero):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can kill the newline here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

op = expr.op()

src_column = t.translate(op.arg)
return F.log(float(op.base.op().value), src_column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to translate the first argument here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

op = expr.op()

src_column = t.translate(op.arg)
scale = op.digits.op().value if op.digits is not None else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to translate this instead of pulling out the value from the ibis literal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@cpcloud cpcloud added this to the Next Feature Release milestone Aug 7, 2019
@cpcloud cpcloud added the spark label Aug 7, 2019
return client._session.table(name)


def compile_table_and_cache(t, expr, cache):
Copy link
Contributor Author

@icexelloss icexelloss Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud I found that when compiling selections I keep recompile the same TableExpr over and over again because it is referenced from multiple places. I end up adding cache for expressions that has been compiled and pass it around.

Is this a reasonable approach? How did you deal with similar problems with Pandas backend?

@icexelloss icexelloss force-pushed the pyspark-backend-prototype branch 2 times, most recently from 14a9b60 to 4bb17f9 Compare August 15, 2019 14:49
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more round, and this should be good to merge.

ibis/pyspark/tests/test_basic.py Outdated Show resolved Hide resolved
ibis/pyspark/tests/test_basic.py Outdated Show resolved Hide resolved
ibis/pyspark/compiler.py Show resolved Hide resolved
ibis/pyspark/compiler.py Outdated Show resolved Hide resolved
if context:
return col
else:
return t.translate(expr.op().arg.op().table, scope).select(col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what are you trying to pull out here with this chain of op accesses?

Copy link
Contributor Author

@icexelloss icexelloss Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is a bit convoluted:

The expression is sth like "some_col.max()"

What I need to translate that to is "df.select(max(some_col))" to make it a lazy Spark expression

The expr.op().arg.op().table thing is to get to the table of the column. I will add a comment


left_df = t.translate(op.left)
right_df = t.translate(op.right)
# TODO: Handle multiple predicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.


left_df = t.translate(op.left)
right_df = t.translate(op.right)
# TODO: Handle multiple predicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write an xfailing test that will pass when this is implemented?

@cpcloud
Copy link
Member

cpcloud commented Aug 15, 2019

Looks like there's a merge conflict as well.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @icexelloss

@cpcloud
Copy link
Member

cpcloud commented Aug 20, 2019

Thanks @hjoo!

@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #1913 into master will decrease coverage by 1.47%.
The diff coverage is 96.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1913      +/-   ##
==========================================
- Coverage   87.46%   85.99%   -1.48%     
==========================================
  Files          89       93       +4     
  Lines       16405    16718     +313     
  Branches     2093     2120      +27     
==========================================
+ Hits        14349    14376      +27     
- Misses       1660     1943     +283     
- Partials      396      399       +3
Impacted Files Coverage Δ
ibis/pyspark/operations.py 100% <100%> (ø)
ibis/pyspark/api.py 100% <100%> (ø)
ibis/spark/api.py 88.23% <100%> (ø) ⬆️
ibis/spark/client.py 83.84% <100%> (ø) ⬆️
ibis/pyspark/client.py 92.3% <92.3%> (ø)
ibis/pyspark/compiler.py 96.41% <96.41%> (ø)
ibis/bigquery/client.py 41.1% <0%> (-53.39%) ⬇️
ibis/bigquery/compiler.py 59.92% <0%> (-37.5%) ⬇️
ibis/bigquery/udf/api.py 80.48% <0%> (-14.64%) ⬇️
ibis/impala/compiler.py 91.23% <0%> (-5.2%) ⬇️
... and 11 more

@cpcloud
Copy link
Member

cpcloud commented Aug 22, 2019

Merging. Thanks @icexelloss!

@cpcloud cpcloud changed the title PySpark backend FEAT: PySpark backend Aug 22, 2019
@cpcloud cpcloud modified the milestones: Next Feature Release, Next Major Release Aug 22, 2019
@cpcloud cpcloud merged commit 99a2f2e into ibis-project:master Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants