[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

acezen · 2023-02-14T02:26:58Z

Proposed changes

This PR aims to improve GraphAr spark writer's performace by changes:

Revise the edge DataFrame to edge chunk partition process:
- Use sort and repartition by custom partitioner to do the edge chunk split job
Implement custom writer builder for csv, parquet and orc to bypass the default write process of spark
Implement a custom commit protocol class GarCommitProtocol to change the file strategy of spark write

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

#68

github-actions · 2023-02-15T17:34:52Z

🎊 PR Preview 04e1c3d has been successfully built and deployed to https://alibaba-graphar-build-pr-92.surge.sh

_{🤖 By surge-preview}

lixueclaire · 2023-02-17T03:57:15Z

spark/src/main/java/com/alibaba/graphar/GeneralParams.java

@@ -26,5 +26,7 @@ public class GeneralParams {
    public static final String vertexChunkIndexCol = "_graphArVertexChunkIndex";
    public static final String edgeIndexCol = "_graphArEdgeIndex";
    public static final String regularSeperator = "_";
+    public static final String offsetStartChunkIndexKey = "_graphar_offste_start_chunk_index";


"offste" -> "offset"

lixueclaire · 2023-02-17T04:08:40Z

spark/src/main/scala/com/alibaba/graphar/datasources/GarCommitProtocol.scala

+    var mid = 0
+    while (low <= high) {
+      mid = (high + low) / 2;
+      if (aggNums(mid) <= key && aggNums(mid + 1) > key) {


aggNums(mid + 1) may access an invalid address when low == high == mid == aggNums.length-1

Actually, in our scenario, the key would always landing on index < aggNums.length-1, but we can revise the condition to if (aggNums(mid) <= key && (mid == length - 1 || aggNums(mid + 1) > key)) for save.

lixueclaire · 2023-02-17T04:22:08Z

spark/src/main/scala/com/alibaba/graphar/datasources/GarDataSource.scala

+    val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(
+      map.asCaseSensitiveMap().asScala.toMap)
+    shortName() + " " + paths.map(qualifiedPathName(_, hadoopConf)).mkString(",")
+  }


val name = shortName() + " " + paths.map(qualifiedPathName(_, hadoopConf)).mkString(",") Utils.redact(sparkSession.sessionState.conf.stringRedactionPattern, name)

I'm not sure the redact() function in the original code is required for our case

spark/src/main/scala/com/alibaba/graphar/utils/Patitioner.scala

lixueclaire · 2023-02-17T04:59:52Z

spark/src/main/scala/com/alibaba/graphar/writer/EdgeWriter.scala

+    val edgeNumOfVertexChunks = sortedDfRDD.mapPartitions(iterator => {
+      iterator.map(row => (row(colIndex).asInstanceOf[Long] / vertexChunkSize, 1))
+    }).reduceByKey(_ + _).collectAsMap()
+    val vertexChunkNum = edgeNumOfVertexChunks.size


set vertexChunkNum = Max(VertexChunkId) + 1 to handle the case when there are no edges for a vertex chunk

Resolved by passing a vertex num augment.

lixueclaire · 2023-02-19T01:45:06Z

spark/src/main/scala/com/alibaba/graphar/writer/EdgeWriter.scala

-    // generate global edge id for each record of dataframe
-    val parition_counts = df_rdd
+    val edgeSchema = edgeDf.schema
+    val colIndex = edgeSchema.fieldIndex(if (adjListType == AdjListType.ordered_by_source) GeneralParams.srcIndexCol else GeneralParams.dstIndexCol)


Is this function repartitionAndSort() only required for AdjListType.ordered_by_source & AdjListType.ordered_by_dest? It seems that it is called for all types.

lixueclaire · 2023-02-19T01:46:47Z

spark/src/main/scala/com/alibaba/graphar/writer/EdgeWriter.scala

+    val vertexChunkNum: Int = ((vertexNumOfPrimaryVertexLabel + vertexChunkSize - 1) / vertexChunkSize).toInt  // ceil
+
+    // sort by primary key and generate continue edge id for edge records
+    val sortedDfRDD = edgeDf.sort(GeneralParams.srcIndexCol).rdd


"GeneralParams.srcIndexCol" -> "colIndex"

Update Write enbale Update

lixueclaire

LGTM.

acezen changed the title ~~Improve GraphAr spark writer performance by adding a custom~~ Improve GraphAr spark writer performance by adding a custom CommitProtocal Feb 14, 2023

acezen force-pushed the 68-spark-writer-improve branch from 3058940 to 566e853 Compare February 15, 2023 17:34

acezen force-pushed the 68-spark-writer-improve branch 2 times, most recently from caaabaf to d5dfbe0 Compare February 16, 2023 08:00

acezen marked this pull request as ready for review February 16, 2023 08:00

acezen changed the title ~~Improve GraphAr spark writer performance by adding a custom CommitProtocal~~ Improve GraphAr spark writer performance by implementing WriterBuilder for csv/paruqet/orc Feb 16, 2023

acezen requested a review from lixueclaire February 16, 2023 11:16

acezen changed the title ~~Improve GraphAr spark writer performance by implementing WriterBuilder for csv/paruqet/orc~~ Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior Feb 17, 2023

acezen changed the title ~~Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior~~ [Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior Feb 17, 2023

lixueclaire reviewed Feb 17, 2023

View reviewed changes

lixueclaire reviewed Feb 19, 2023

View reviewed changes

acezen added 10 commits February 20, 2023 10:07

Improve the Spark Writer performance

4959c88

Update Write enbale Update

Fix

463b4bd

Format

3a903ff

Refact

3ebab63

Delete test file

74de5c0

Refact

5a32b3b

Update

9d9cb3d

Fix typos

f63279f

Update

973b310

Revise

91dca36

acezen force-pushed the 68-spark-writer-improve branch from 1f3b0df to 91dca36 Compare February 20, 2023 02:08

acezen added 2 commits February 20, 2023 11:13

Fix: consider unordered

d9b2fe9

Add compression support

552e5c6

acezen force-pushed the 68-spark-writer-improve branch from c38bfb7 to 552e5c6 Compare February 20, 2023 06:22

Merge branch 'main' into 68-spark-writer-improve

04e1c3d

lixueclaire approved these changes Feb 20, 2023

View reviewed changes

lixueclaire merged commit ad30121 into apache:main Feb 20, 2023

lixueclaire mentioned this pull request Feb 20, 2023

[Feat] Improve the performance of Spark writer #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

acezen commented Feb 14, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2023 •

edited

Loading

lixueclaire Feb 17, 2023

lixueclaire Feb 17, 2023

acezen Feb 17, 2023 •

edited

Loading

lixueclaire Feb 17, 2023

lixueclaire Feb 17, 2023

acezen Feb 20, 2023

lixueclaire Feb 19, 2023

lixueclaire Feb 19, 2023

lixueclaire left a comment

[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

Conversation

acezen commented Feb 14, 2023 • edited Loading

Proposed changes

Types of changes

Checklist

Further comments

github-actions bot commented Feb 15, 2023 • edited Loading

lixueclaire Feb 17, 2023

Choose a reason for hiding this comment

lixueclaire Feb 17, 2023

Choose a reason for hiding this comment

acezen Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

lixueclaire Feb 17, 2023

Choose a reason for hiding this comment

lixueclaire Feb 17, 2023

Choose a reason for hiding this comment

acezen Feb 20, 2023

Choose a reason for hiding this comment

lixueclaire Feb 19, 2023

Choose a reason for hiding this comment

lixueclaire Feb 19, 2023

Choose a reason for hiding this comment

lixueclaire left a comment

Choose a reason for hiding this comment

acezen commented Feb 14, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2023 •

edited

Loading

acezen Feb 17, 2023 •

edited

Loading