FEAT: PySpark backend #1913

icexelloss · 2019-08-07T15:55:17Z

This is a Pyspark backend for ibis. This is different from the spark backend where the ibis expr is compiled to SQL string. Instead, the pyspark backend compiles the ibis expr to pyspark.DataFrame exprs.

pep8speaks · 2019-08-07T15:55:28Z

Hello @icexelloss! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-22 18:41:54 UTC

icexelloss · 2019-08-07T15:57:17Z

Currently this implementation passes test_aggregation and test_numeric. @hjoo and I are working on passing the rest of tests under "all" but I'd like to have this up to see if this approach makes sense.

cpcloud

Basic approach looks good!

ibis/pyspark/compiler.py

cpcloud · 2019-08-07T18:54:41Z

ibis/pyspark/compiler.py

+
+        return decorator
+
+    def translate(self, expr, **kwargs):


Same here, I think PySparkExprTranslator can be a pass through

cpcloud · 2019-08-07T18:57:22Z

ibis/pyspark/compiler.py

+        elif isinstance(selection, types.ColumnExpr):
+            column_name = selection.get_name()
+            col_names_in_selection_order.append(column_name)
+            if column_name not in src_table.columns:


Won't this do the wrong thing for expressions like t[['a', 'b']].mutate(a=lambda t: t.a + 1)? In that example you still need to compile the new expression for a even though src_table still has an a column.

Yep. This is wrong. Let me fix it.

ibis/pyspark/compiler.py

ibis/pyspark/client.py

cpcloud · 2019-08-07T19:18:25Z

ibis/tests/backends.py

+
+
+class PySpark(Backend, RoundAwayFromZero):
+


You can kill the newline here.

cpcloud · 2019-08-07T19:21:04Z

ibis/pyspark/compiler.py

+    op = expr.op()
+
+    src_column = t.translate(op.arg)
+    return F.log(float(op.base.op().value), src_column)


You probably want to translate the first argument here.

cpcloud · 2019-08-07T19:22:10Z

ibis/pyspark/compiler.py

+    op = expr.op()
+
+    src_column = t.translate(op.arg)
+    scale = op.digits.op().value if op.digits is not None else 0


I think you want to translate this instead of pulling out the value from the ibis literal

icexelloss · 2019-08-09T18:48:48Z

ibis/pyspark/compiler.py

+    return client._session.table(name)
+
+
+def compile_table_and_cache(t, expr, cache):


@cpcloud I found that when compiling selections I keep recompile the same TableExpr over and over again because it is referenced from multiple places. I end up adding cache for expressions that has been compiled and pass it around.

Is this a reasonable approach? How did you deal with similar problems with Pandas backend?

cpcloud

One more round, and this should be good to merge.

ibis/pyspark/tests/test_basic.py

ibis/pyspark/compiler.py

cpcloud · 2019-08-15T20:29:57Z

ibis/pyspark/compiler.py

+    if context:
+        return col
+    else:
+        return t.translate(expr.op().arg.op().table, scope).select(col)


Hm, what are you trying to pull out here with this chain of op accesses?

Hmm this is a bit convoluted:

The expression is sth like "some_col.max()"

What I need to translate that to is "df.select(max(some_col))" to make it a lazy Spark expression

The expr.op().arg.op().table thing is to get to the table of the column. I will add a comment

cpcloud · 2019-08-15T20:32:02Z

ibis/pyspark/compiler.py

+
+    left_df = t.translate(op.left)
+    right_df = t.translate(op.right)
+    # TODO: Handle multiple predicates


cpcloud · 2019-08-15T20:32:21Z

ibis/pyspark/compiler.py

+
+    left_df = t.translate(op.left)
+    right_df = t.translate(op.right)
+    # TODO: Handle multiple predicates


Can you write an xfailing test that will pass when this is implemented?

cpcloud · 2019-08-15T20:37:53Z

Looks like there's a merge conflict as well.

cpcloud

LGTM. Thanks @icexelloss

cpcloud · 2019-08-20T19:48:39Z

Thanks @hjoo!

* implement pyspark compiler numeric operations to pass all/test_numeric.py

codecov · 2019-08-22T19:15:56Z

Codecov Report

Merging #1913 into master will decrease coverage by 1.47%.
The diff coverage is 96.23%.

@@            Coverage Diff             @@
##           master    #1913      +/-   ##
==========================================
- Coverage   87.46%   85.99%   -1.48%     
==========================================
  Files          89       93       +4     
  Lines       16405    16718     +313     
  Branches     2093     2120      +27     
==========================================
+ Hits        14349    14376      +27     
- Misses       1660     1943     +283     
- Partials      396      399       +3

Impacted Files	Coverage Δ
ibis/pyspark/operations.py	`100% <100%> (ø)`
ibis/pyspark/api.py	`100% <100%> (ø)`
ibis/spark/api.py	`88.23% <100%> (ø)`	⬆️
ibis/spark/client.py	`83.84% <100%> (ø)`	⬆️
ibis/pyspark/client.py	`92.3% <92.3%> (ø)`
ibis/pyspark/compiler.py	`96.41% <96.41%> (ø)`
ibis/bigquery/client.py	`41.1% <0%> (-53.39%)`	⬇️
ibis/bigquery/compiler.py	`59.92% <0%> (-37.5%)`	⬇️
ibis/bigquery/udf/api.py	`80.48% <0%> (-14.64%)`	⬇️
ibis/impala/compiler.py	`91.23% <0%> (-5.2%)`	⬇️
... and 11 more

cpcloud · 2019-08-22T19:32:58Z

Merging. Thanks @icexelloss!

icexelloss changed the title ~~Pyspark backend prototype~~ PySpark backend Aug 7, 2019

cpcloud requested changes Aug 7, 2019

View reviewed changes

cpcloud added this to the Next Feature Release milestone Aug 7, 2019

cpcloud added the spark label Aug 7, 2019

icexelloss commented Aug 9, 2019

View reviewed changes

icexelloss force-pushed the pyspark-backend-prototype branch 2 times, most recently from 14a9b60 to 4bb17f9 Compare August 15, 2019 14:49

cpcloud requested changes Aug 15, 2019

View reviewed changes

icexelloss force-pushed the pyspark-backend-prototype branch from 97a4ece to 9fad68e Compare August 15, 2019 21:03

cpcloud approved these changes Aug 20, 2019

View reviewed changes

icexelloss and others added 16 commits August 22, 2019 14:41

Initial commit of pyspark DataFrame backend (#1)

54c2f2d

Implement basic aggregation, group_by and window (#3)

fa4ad23

add pyspark compile rule for greatest, fix bug with selection (#4)

c4a2b79

Implement basic join

88705fe

Link existing tests with PySpark backend (#7)

215c0d9

Implement compiler rules to pass all/test_aggregation.py

675a89f

implement pyspark numeric operations to pass all/test_numeric.py (#9)

9ad663f

* implement pyspark compiler numeric operations to pass all/test_numeric.py

Fix rebase errors

72b45f8

Remove dead code

7cc2a9e

Add pyspark marker to setup.cfg

4764a4e

Address PR comments

e00dc00

Add scope

108ccd8

Add importskip

26b041c

Change pyspark imports to optional

1f9409b

Skip unimplemented tests

0969b0a

Fix tests

f173425

icexelloss added 2 commits August 22, 2019 14:41

Address comments

8f1c35e

Add pyspark/__init__.py

213e371

icexelloss force-pushed the pyspark-backend-prototype branch from 4e5a86c to 213e371 Compare August 22, 2019 18:41

cpcloud changed the title ~~PySpark backend~~ FEAT: PySpark backend Aug 22, 2019

cpcloud modified the milestones: Next Feature Release, Next Major Release Aug 22, 2019

cpcloud merged commit 99a2f2e into ibis-project:master Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: PySpark backend #1913

FEAT: PySpark backend #1913

icexelloss commented Aug 7, 2019

pep8speaks commented Aug 7, 2019 •

edited

Loading

icexelloss commented Aug 7, 2019

cpcloud left a comment

cpcloud Aug 7, 2019

cpcloud Aug 7, 2019

icexelloss Aug 7, 2019

cpcloud Aug 7, 2019

icexelloss Aug 9, 2019

cpcloud Aug 7, 2019

icexelloss Aug 9, 2019

cpcloud Aug 7, 2019

icexelloss Aug 9, 2019

icexelloss Aug 9, 2019 •

edited

Loading

cpcloud left a comment

cpcloud Aug 15, 2019

icexelloss Aug 15, 2019 •

edited

Loading

cpcloud Aug 15, 2019

cpcloud Aug 15, 2019

cpcloud commented Aug 15, 2019

cpcloud left a comment

cpcloud commented Aug 20, 2019

codecov bot commented Aug 22, 2019 •

edited

Loading

cpcloud commented Aug 22, 2019

		return client._session.table(name)


		def compile_table_and_cache(t, expr, cache):

FEAT: PySpark backend #1913

FEAT: PySpark backend #1913

Conversation

icexelloss commented Aug 7, 2019

pep8speaks commented Aug 7, 2019 • edited Loading

Comment last updated at 2019-08-22 18:41:54 UTC

icexelloss commented Aug 7, 2019

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Aug 9, 2019 • edited Loading

Choose a reason for hiding this comment

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Aug 15, 2019

cpcloud left a comment

Choose a reason for hiding this comment

cpcloud commented Aug 20, 2019

codecov bot commented Aug 22, 2019 • edited Loading

Codecov Report

cpcloud commented Aug 22, 2019

pep8speaks commented Aug 7, 2019 •

edited

Loading

icexelloss Aug 9, 2019 •

edited

Loading

icexelloss Aug 15, 2019 •

edited

Loading

codecov bot commented Aug 22, 2019 •

edited

Loading