[Spark] Support property filter pushdown by utilizing payload file formats #221

Ziy1-Tan · 2023-08-13T16:44:14Z

This PR is about C++ SDK for OSPP 2023
Issue number: #98.
You can find more detail about this feature here

C++ SDK: [C++] Support property filter pushdown by utilizing payload file formats #178
Spark SDK: [Spark] Support property filter pushdown by utilizing payload file formats #221

Proposed changes

Now we support filter pushdown for spark

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Ziy1-Tan · 2023-08-22T17:39:04Z

cc @lixueclaire @acezen. Currently, reading a single property group is easy to push down:

    val property_group = vertex_info.getPropertyGroup("gender")

    // test reading a single property chunk
    val single_chunk_df = reader.readVertexPropertyChunk(property_group, 0)
    assert(single_chunk_df.columns.length == 3)
    assert(single_chunk_df.count() == 100)
    val cond = "gender = 'female'"
    var df_pd = single_chunk_df.select("firstName", "gender").filter(cond)
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Filter (isnotnull(gender#2) AND (gender#2 = female))
+- *(1) ColumnarToRow
   +- BatchScan[firstName#0, gender#2] GarScan DataFilters: [isnotnull(gender#2), (gender#2 = female)], Format: gar, Location: InMemoryFileIndex(1 paths)[file:/home/simple/code/cpp/GraphAr/spark/src/test/resources/gar-test/l..., PartitionFilters: [], PushedFilters: [IsNotNull(gender), EqualTo(gender,female)], ReadSchema: struct<firstName:string,gender:string>, PushedFilters: [IsNotNull(gender), EqualTo(gender,female)] RuntimeFilters: []
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

But it is difficult to push down for reading multiple property groups:

    val vertex_df_with_index = reader.readAllVertexPropertyGroups()
    assert(vertex_df_with_index.columns.length == 5)
    assert(vertex_df_with_index.count() == 903)
    df_pd = vertex_df_with_index.filter(cond).select("firstName", "gender")
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Project [firstName#196, gender#198]
+- *(1) Filter (isnotnull(gender#198) AND (gender#198 = female))
   +- *(1) Scan ExistingRDD[_graphArVertexIndex#194L,id#195L,firstName#196,lastName#197,gender#198]
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

Because different property groups are actually stored in different parquet.
What do you think about extending this APIs for reading multiple property groups?

lixueclaire · 2023-08-23T03:10:47Z

cc @lixueclaire @acezen. Currently, reading a single property group is easy to push down:

    val property_group = vertex_info.getPropertyGroup("gender")

    // test reading a single property chunk
    val single_chunk_df = reader.readVertexPropertyChunk(property_group, 0)
    assert(single_chunk_df.columns.length == 3)
    assert(single_chunk_df.count() == 100)
    val cond = "gender = 'female'"
    var df_pd = single_chunk_df.select("firstName", "gender").filter(cond)
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Filter (isnotnull(gender#2) AND (gender#2 = female))
+- *(1) ColumnarToRow
   +- BatchScan[firstName#0, gender#2] GarScan DataFilters: [isnotnull(gender#2), (gender#2 = female)], Format: gar, Location: InMemoryFileIndex(1 paths)[file:/home/simple/code/cpp/GraphAr/spark/src/test/resources/gar-test/l..., PartitionFilters: [], PushedFilters: [IsNotNull(gender), EqualTo(gender,female)], ReadSchema: struct<firstName:string,gender:string>, PushedFilters: [IsNotNull(gender), EqualTo(gender,female)] RuntimeFilters: []
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

But it is difficult to push down for reading multiple property groups:

    val vertex_df_with_index = reader.readAllVertexPropertyGroups()
    assert(vertex_df_with_index.columns.length == 5)
    assert(vertex_df_with_index.count() == 903)
    df_pd = vertex_df_with_index.filter(cond).select("firstName", "gender")
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Project [firstName#196, gender#198]
+- *(1) Filter (isnotnull(gender#198) AND (gender#198 = female))
   +- *(1) Scan ExistingRDD[_graphArVertexIndex#194L,id#195L,firstName#196,lastName#197,gender#198]
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

Because different property groups are actually stored in different parquet. What do you think about extending this APIs for reading multiple property groups?

Hi, @Ziy1-Tan, thanks for your proposal. Currently, it is OK for me. We do not intend to propose a method to support filter pushdown between different parquet files. In GraphAr's design, properties are encouraged to be in the same group if they are accessed together, thus pushdown is supported as well.

Ziy1-Tan · 2023-08-29T13:26:01Z

cc @lixueclaire @acezen. Currently, reading a single property group is easy to push down:

    val property_group = vertex_info.getPropertyGroup("gender")

    // test reading a single property chunk
    val single_chunk_df = reader.readVertexPropertyChunk(property_group, 0)
    assert(single_chunk_df.columns.length == 3)
    assert(single_chunk_df.count() == 100)
    val cond = "gender = 'female'"
    var df_pd = single_chunk_df.select("firstName", "gender").filter(cond)
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Filter (isnotnull(gender#2) AND (gender#2 = female))
+- *(1) ColumnarToRow
   +- BatchScan[firstName#0, gender#2] GarScan DataFilters: [isnotnull(gender#2), (gender#2 = female)], Format: gar, Location: InMemoryFileIndex(1 paths)[file:/home/simple/code/cpp/GraphAr/spark/src/test/resources/gar-test/l..., PartitionFilters: [], PushedFilters: [IsNotNull(gender), EqualTo(gender,female)], ReadSchema: struct<firstName:string,gender:string>, PushedFilters: [IsNotNull(gender), EqualTo(gender,female)] RuntimeFilters: []
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

But it is difficult to push down for reading multiple property groups:

    val vertex_df_with_index = reader.readAllVertexPropertyGroups()
    assert(vertex_df_with_index.columns.length == 5)
    assert(vertex_df_with_index.count() == 903)
    df_pd = vertex_df_with_index.filter(cond).select("firstName", "gender")
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Project [firstName#196, gender#198]
+- *(1) Filter (isnotnull(gender#198) AND (gender#198 = female))
   +- *(1) Scan ExistingRDD[_graphArVertexIndex#194L,id#195L,firstName#196,lastName#197,gender#198]
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

Because different property groups are actually stored in different parquet. What do you think about extending this APIs for reading multiple property groups?

Hi, @Ziy1-Tan, thanks for your proposal. Currently, it is OK for me. We do not intend to propose a method to support filter pushdown between different parquet files. In GraphAr's design, properties are encouraged to be in the same group if they are accessed together, thus pushdown is supported as well.

Got it. I'm going to test the performance improvement based on spark filter pushdown.

It seems that testing/ldbc and testing/modern_graph only contain metadata and not actual chunk data.
Where can I find these data? Does the performance comparison need to be incorporated into the current documentation?

lixueclaire · 2023-08-30T01:35:28Z

cc @lixueclaire @acezen. Currently, reading a single property group is easy to push down:

    val property_group = vertex_info.getPropertyGroup("gender")

    // test reading a single property chunk
    val single_chunk_df = reader.readVertexPropertyChunk(property_group, 0)
    assert(single_chunk_df.columns.length == 3)
    assert(single_chunk_df.count() == 100)
    val cond = "gender = 'female'"
    var df_pd = single_chunk_df.select("firstName", "gender").filter(cond)
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Filter (isnotnull(gender#2) AND (gender#2 = female))
+- *(1) ColumnarToRow
   +- BatchScan[firstName#0, gender#2] GarScan DataFilters: [isnotnull(gender#2), (gender#2 = female)], Format: gar, Location: InMemoryFileIndex(1 paths)[file:/home/simple/code/cpp/GraphAr/spark/src/test/resources/gar-test/l..., PartitionFilters: [], PushedFilters: [IsNotNull(gender), EqualTo(gender,female)], ReadSchema: struct<firstName:string,gender:string>, PushedFilters: [IsNotNull(gender), EqualTo(gender,female)] RuntimeFilters: []
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

But it is difficult to push down for reading multiple property groups:

    val vertex_df_with_index = reader.readAllVertexPropertyGroups()
    assert(vertex_df_with_index.columns.length == 5)
    assert(vertex_df_with_index.count() == 903)
    df_pd = vertex_df_with_index.filter(cond).select("firstName", "gender")
    df_pd.explain()
    df_pd.show()

== Physical Plan ==
*(1) Project [firstName#196, gender#198]
+- *(1) Filter (isnotnull(gender#198) AND (gender#198 = female))
   +- *(1) Scan ExistingRDD[_graphArVertexIndex#194L,id#195L,firstName#196,lastName#197,gender#198]
2

+------------+------+
|   firstName|gender|
+------------+------+
|         Eli|female|
|      Joseph|female|
|        Jose|female|
|         Jun|female|
|       A. C.|female|
|       Karim|female|
|        Poul|female|
|       Chipo|female|
|       Dovid|female|
|       Ashin|female|
|         Cam|female|
|        Kurt|female|
|Daouda Malam|female|
|       David|female|
|      Batong|female|
|       Zheng|female|
|     Gabriel|female|
|       Boris|female|
|        Jose|female|
|    Fernando|female|
+------------+------+

Because different property groups are actually stored in different parquet. What do you think about extending this APIs for reading multiple property groups?

Hi, @Ziy1-Tan, thanks for your proposal. Currently, it is OK for me. We do not intend to propose a method to support filter pushdown between different parquet files. In GraphAr's design, properties are encouraged to be in the same group if they are accessed together, thus pushdown is supported as well.

Got it. I'm going to test the performance improvement based on spark filter pushdown.

It seems that testing/ldbc and testing/modern_graph only contain metadata and not actual chunk data.
Where can I find these data? Does the performance comparison need to be incorporated into the current documentation?

We do not provide the complete data of LDBC to reduce the size of testing repo. @acezen , could you please give Ziyi a guideline of how to use the LDBC data?
Regarding the documentation, you can create another pull request to update the documentation and include the performance comparison. It would be helpful to provide a brief description of your evaluation methodology and the results of the performance comparison. You can refer to the GraphScope page as an example of how the documentation can be structured.

acezen · 2023-08-31T02:49:30Z

We do not provide the complete data of LDBC to reduce the size of testing repo. @acezen , could you please give Ziyi a guideline of how to use the LDBC data?

Regarding the documentation, you can create another pull request to update the documentation and include the performance comparison. It would be helpful to provide a brief description of your evaluation methodology and the results of the performance comparison. You can refer to the GraphScope page as an example of how the documentation can be structured.

We don't have large scale ldbc dataset yet. I can generate a copy of ldbc-sf30, ldbc-100 to OSS for performance test.

Ziy1-Tan · 2023-09-05T03:07:51Z

We do not provide the complete data of LDBC to reduce the size of testing repo. @acezen , could you please give Ziyi a guideline of how to use the LDBC data?

Regarding the documentation, you can create another pull request to update the documentation and include the performance comparison. It would be helpful to provide a brief description of your evaluation methodology and the results of the performance comparison. You can refer to the GraphScope page as an example of how the documentation can be structured.

We don't have large scale ldbc dataset yet. I can generate a copy of ldbc-sf30, ldbc-100 to OSS for performance test.

Got it. Can't wait to test the performance improvement.

acezen · 2023-09-05T03:23:00Z

We do not provide the complete data of LDBC to reduce the size of testing repo. @acezen , could you please give Ziyi a guideline of how to use the LDBC data?

Regarding the documentation, you can create another pull request to update the documentation and include the performance comparison. It would be helpful to provide a brief description of your evaluation methodology and the results of the performance comparison. You can refer to the GraphScope page as an example of how the documentation can be structured.

We don't have large scale ldbc dataset yet. I can generate a copy of ldbc-sf30, ldbc-100 to OSS for performance test.

Got it. Can't wait to test the performance improvement.

hi, @Ziy1-Tan, the large scale ldbc dataset has been publish to here, you can download it to test the performance.

acezen · 2023-09-07T01:47:41Z

hi, @Ziy1-Tan ,code format for scala and java has been merge into main, you can rebase and apply the format withmvn spotless:apply

lixueclaire · 2023-09-08T01:22:26Z

spark/src/test/scala/com/alibaba/graphar/TestReader.scala

-    assert(property_df.columns.size == 3)
+    val cond = "gender = 'female'"
+    var df_pd = single_chunk_df.select("firstName","gender").filter(cond)
+    df_pd.explain()


Could you please include the resulting physical plan in the comments? This would effectively demonstrate the filter pushdown in a more intuitive manner.

Ok, I will apply plan on it.

Signed-off-by: Ziy1-Tan <[email protected]>

lixueclaire

LGTM～ This is highly appreciated and valuable, thank you for your contribution!

Signed-off-by: Ziy1-Tan <[email protected]>

Ziy1-Tan changed the title ~~Feat: filter pushdown for spark~~ [Spark] Support property filter pushdown by utilizing payload file formats Aug 13, 2023

Ziy1-Tan marked this pull request as draft August 13, 2023 16:45

Ziy1-Tan mentioned this pull request Aug 13, 2023

[C++] Support property filter pushdown by utilizing payload file formats #178

Merged

12 tasks

Ziy1-Tan force-pushed the pushdown-spark branch 2 times, most recently from ab0cd84 to ba8dbdb Compare August 22, 2023 17:31

Ziy1-Tan marked this pull request as ready for review August 22, 2023 17:31

lixueclaire reviewed Sep 8, 2023

View reviewed changes

Ziy1-Tan force-pushed the pushdown-spark branch from b75b081 to 44f2e80 Compare September 8, 2023 08:11

Ziy1-Tan added 3 commits September 16, 2023 14:48

Feat: filter pushdown for spark

eb97d36

Signed-off-by: Ziy1-Tan <[email protected]>

Refactor: fix style

1d3f668

Signed-off-by: Ziy1-Tan <[email protected]>

Refactor: physical plan

d9e4e6e

Signed-off-by: Ziy1-Tan <[email protected]>

Ziy1-Tan force-pushed the pushdown-spark branch from 44f2e80 to d9e4e6e Compare September 16, 2023 06:57

Refactor: fix style

d6ca967

Signed-off-by: Ziy1-Tan <[email protected]>

Ziy1-Tan requested a review from lixueclaire September 16, 2023 07:01

Bugfix: enable pushdown

c8f5311

Signed-off-by: Ziy1-Tan <[email protected]>

lixueclaire approved these changes Sep 22, 2023

View reviewed changes

Refactor: no pushdown when disabled

b277a38

Signed-off-by: Ziy1-Tan <[email protected]>

Ziy1-Tan force-pushed the pushdown-spark branch from 4ffa007 to b277a38 Compare September 22, 2023 15:46

Merge branch 'main' into pushdown-spark

5f88d7e

lixueclaire merged commit 1263e73 into apache:main Sep 25, 2023
2 checks passed

Ziy1-Tan mentioned this pull request Sep 26, 2023

[Doc][Improvment] Performance evaluation of filter pushdown #236

Open

lixueclaire mentioned this pull request Jan 19, 2024

Extend GraphAr Reader for predicate pushdown based on attribute filtering conditions alibaba/GraphScope#2646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Support property filter pushdown by utilizing payload file formats #221

[Spark] Support property filter pushdown by utilizing payload file formats #221

Ziy1-Tan commented Aug 13, 2023 •

edited

Loading

Ziy1-Tan commented Aug 22, 2023

lixueclaire commented Aug 23, 2023

Ziy1-Tan commented Aug 29, 2023

lixueclaire commented Aug 30, 2023

acezen commented Aug 31, 2023 •

edited

Loading

Ziy1-Tan commented Sep 5, 2023

acezen commented Sep 5, 2023 •

edited

Loading

acezen commented Sep 7, 2023

lixueclaire Sep 8, 2023

Ziy1-Tan Sep 16, 2023

lixueclaire left a comment

[Spark] Support property filter pushdown by utilizing payload file formats #221

[Spark] Support property filter pushdown by utilizing payload file formats #221

Conversation

Ziy1-Tan commented Aug 13, 2023 • edited Loading

Proposed changes

Types of changes

Checklist

Ziy1-Tan commented Aug 22, 2023

lixueclaire commented Aug 23, 2023

Ziy1-Tan commented Aug 29, 2023

lixueclaire commented Aug 30, 2023

acezen commented Aug 31, 2023 • edited Loading

Ziy1-Tan commented Sep 5, 2023

acezen commented Sep 5, 2023 • edited Loading

acezen commented Sep 7, 2023

lixueclaire Sep 8, 2023

Choose a reason for hiding this comment

Ziy1-Tan Sep 16, 2023

Choose a reason for hiding this comment

lixueclaire left a comment

Choose a reason for hiding this comment

Ziy1-Tan commented Aug 13, 2023 •

edited

Loading

acezen commented Aug 31, 2023 •

edited

Loading

acezen commented Sep 5, 2023 •

edited

Loading