Add pyspark dataframe_loader #2790

DavidKatz-il · 2020-08-07T13:11:50Z

Add dataframe_loader to support inputs.

natekupp · 2020-08-10T23:10:14Z

hey @sryza can you take a look at this one too?

sryza · 2020-08-11T00:45:18Z

Thanks for your contribution @DavidKatz-il ! I'm impressed with your thoroughness.

An important area to think through is that Dagster should be able to support any version of Spark. I'm concerned that, by wrapping every individual option to every Spark input format in Dagster config, it will tie us too closely to specific Spark versions. I.e. Spark might add or remove an option, and then it becomes unclear which Spark version Dagster's configuration should reflect.

Although it doesn't give us the same advantage of having all the options nicely display in the Dagit config editor, I think a safer approach would be to use permissive dictionaries for the set of options within each input format. E.g.

'jdbc': {},
'orc': {},

This allows Spark itself to remain as the arbiter of what config options are available.

DavidKatz-il · 2020-08-11T08:14:47Z

Hey @sryza, thanks for reviewing.
Right now the user can add options that we didn't configure because we are using Permissive.
So you suggesting to put an empty Permissive?
e.g.

'jdbc': Permissive(),
'orc': Permissive(),
...

If so, what about options that are required, e.g. 'path'.

sryza · 2020-08-12T04:17:45Z

Thought about it a little bit more, and I think you're right. As long as we use Permissive, it's helpful to include the detailed config.

I just launched a build for the PR.

flvndh · 2020-08-12T09:06:39Z

Great contribution @DavidKatz-il! What do you think of adding support for Apache Iceberg and Delta Lake as well ?

DavidKatz-il · 2020-08-12T17:08:53Z

Hi @flvndh, Does it require packages outside pyspark?

flvndh · 2020-08-12T17:31:00Z

@DavidKatz-il just jars packages : org.apache.iceberg:iceberg-spark3-runtime:0.9.0 and io.delta:delta-core_2.12:0.7.0.

DavidKatz-il · 2020-08-12T17:43:13Z

@sryza what do you think?

sryza · 2020-08-12T22:45:43Z

This change looks ready to go in. I think adding support for Iceberg and Delta Lake would be cool. @DavidKatz-il - if you're interested in adding those to this PR, then great. If you'd like me to merge this as-is, also great.

DavidKatz-il · 2020-08-13T21:54:32Z

I added the file_type 'other'.

e.g.

'other': {    
    'path': 'path_to_file',    
    'format': 'delta',    
    ...
}

sryza · 2020-08-13T22:46:41Z

python_modules/libraries/dagster-pyspark/dagster_pyspark_tests/test_types.py

@@ -66,3 +74,28 @@ def return_df(_):
        assert result.success
        actual = read(temp_path)
        assert sorted(df.collect()) == sorted(actual.collect())
+
+
+@pytest.mark.parametrize(


would you mind adding tests for the "other" cases for input and output?

sryza · 2020-08-18T00:27:52Z

@schrockn brought up the question of whether we should avoid the "other" escape hatch in favor of making it easy to compose type materializers: https://dagster.phacility.com/D4188.

@DavidKatz-il - it could take some time for us to reach some resolution on that, but I still think the core of this PR, which adds support for the native Spark formats, is useful. If you want to take out the "other" option for now, I can merge the rest of it. Thanks for bearing with us on the thrash here.

EDIT: it looks like we actually got to a resolution on that issue already ^^, and the escape hatch you added here seems like a reasonable direction. It would still be good to have tests for it.

sryza · 2020-08-19T20:49:17Z

python_modules/libraries/dagster-pyspark/dagster_pyspark_tests/test_types.py

-                                        'compression': 'gzip',
-                                    }
+                                    file_type: dict(
+                                        {'mode': 'overwrite', 'compression': 'gzip',}, **options,


I'm noticing a syntax error here for in the python 2.7 tests: https://buildkite.com/dagster/dagster/builds/14227#9a8db1d9-518e-4491-87a8-d247ca90946c

The comma at the end of the line caused this error on python 2.7.

sryza · 2020-08-20T16:13:14Z

This looks good. Merging. Thanks for your contribution @DavidKatz-il !

DavidKatz-il changed the title ~~Add dataframe_loader~~ Add pyspark dataframe_loader Aug 7, 2020

DavidKatz-il changed the title ~~Add pyspark dataframe_loader~~ Add pyspark dataframe_loader Aug 7, 2020

DavidKatz-il force-pushed the add-pyspark-dataframe-loader branch 2 times, most recently from 33bc81b to 0ef8287 Compare August 10, 2020 21:22

DavidKatz-il mentioned this pull request Aug 11, 2020

Add type loader for dagster-pyspark DataFrame #2806

Closed

DavidKatz-il force-pushed the add-pyspark-dataframe-loader branch from 0ef8287 to 6d3f793 Compare August 12, 2020 07:06

sryza reviewed Aug 13, 2020

View reviewed changes

DavidKatz-il added 4 commits August 18, 2020 13:14

add dataframe_loader

b7d989b

add tests

2f10ffb

add file_type 'other'

9601a48

add tests

c12da09

DavidKatz-il force-pushed the add-pyspark-dataframe-loader branch from fe550db to 37accae Compare August 18, 2020 10:17

DavidKatz-il requested a review from sryza August 19, 2020 08:33

sryza reviewed Aug 19, 2020

View reviewed changes

fix errors

8a350f5

DavidKatz-il force-pushed the add-pyspark-dataframe-loader branch from 37accae to 8a350f5 Compare August 20, 2020 07:56

sryza merged commit acfcbba into dagster-io:master Aug 20, 2020

DavidKatz-il deleted the add-pyspark-dataframe-loader branch August 20, 2020 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pyspark dataframe_loader #2790

Add pyspark dataframe_loader #2790

DavidKatz-il commented Aug 7, 2020

natekupp commented Aug 10, 2020

sryza commented Aug 11, 2020

DavidKatz-il commented Aug 11, 2020

sryza commented Aug 12, 2020

flvndh commented Aug 12, 2020

DavidKatz-il commented Aug 12, 2020

flvndh commented Aug 12, 2020

DavidKatz-il commented Aug 12, 2020

sryza commented Aug 12, 2020

DavidKatz-il commented Aug 13, 2020

sryza Aug 13, 2020

sryza commented Aug 18, 2020 •

edited

Loading

sryza Aug 19, 2020

DavidKatz-il Aug 20, 2020 •

edited

Loading

sryza commented Aug 20, 2020

Add pyspark dataframe_loader #2790

Add pyspark dataframe_loader #2790

Conversation

DavidKatz-il commented Aug 7, 2020

natekupp commented Aug 10, 2020

sryza commented Aug 11, 2020

DavidKatz-il commented Aug 11, 2020

sryza commented Aug 12, 2020

flvndh commented Aug 12, 2020

DavidKatz-il commented Aug 12, 2020

flvndh commented Aug 12, 2020

DavidKatz-il commented Aug 12, 2020

sryza commented Aug 12, 2020

DavidKatz-il commented Aug 13, 2020

sryza Aug 13, 2020

Choose a reason for hiding this comment

sryza commented Aug 18, 2020 • edited Loading

sryza Aug 19, 2020

Choose a reason for hiding this comment

DavidKatz-il Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

sryza commented Aug 20, 2020

sryza commented Aug 18, 2020 •

edited

Loading

DavidKatz-il Aug 20, 2020 •

edited

Loading