Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior #92

Merged
merged 13 commits into from
Feb 20, 2023

Conversation

acezen
Copy link
Contributor

@acezen acezen commented Feb 14, 2023

Proposed changes

This PR aims to improve GraphAr spark writer's performace by changes:

  • Revise the edge DataFrame to edge chunk partition process:
    • Use sort and repartition by custom partitioner to do the edge chunk split job
  • Implement custom writer builder for csv, parquet and orc to bypass the default write process of spark
  • Implement a custom commit protocol class GarCommitProtocol to change the file strategy of spark write

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

#68

@acezen acezen changed the title Improve GraphAr spark writer performance by adding a custom Improve GraphAr spark writer performance by adding a custom CommitProtocal Feb 14, 2023
@acezen acezen force-pushed the 68-spark-writer-improve branch from 3058940 to 566e853 Compare February 15, 2023 17:34
@github-actions
Copy link

github-actions bot commented Feb 15, 2023

🎊 PR Preview 04e1c3d has been successfully built and deployed to https://alibaba-graphar-build-pr-92.surge.sh

🤖 By surge-preview

@acezen acezen force-pushed the 68-spark-writer-improve branch 2 times, most recently from caaabaf to d5dfbe0 Compare February 16, 2023 08:00
@acezen acezen marked this pull request as ready for review February 16, 2023 08:00
@acezen acezen changed the title Improve GraphAr spark writer performance by adding a custom CommitProtocal Improve GraphAr spark writer performance by implementing WriterBuilder for csv/paruqet/orc Feb 16, 2023
@acezen acezen requested a review from lixueclaire February 16, 2023 11:16
@acezen acezen changed the title Improve GraphAr spark writer performance by implementing WriterBuilder for csv/paruqet/orc Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior Feb 17, 2023
@acezen acezen changed the title Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior [Improvement] Improve GraphAr spark writer performance and implement custom writer builder to bypass spark's write behavior Feb 17, 2023
@@ -26,5 +26,7 @@ public class GeneralParams {
public static final String vertexChunkIndexCol = "_graphArVertexChunkIndex";
public static final String edgeIndexCol = "_graphArEdgeIndex";
public static final String regularSeperator = "_";
public static final String offsetStartChunkIndexKey = "_graphar_offste_start_chunk_index";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"offste" -> "offset"

var mid = 0
while (low <= high) {
mid = (high + low) / 2;
if (aggNums(mid) <= key && aggNums(mid + 1) > key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggNums(mid + 1) may access an invalid address when low == high == mid == aggNums.length-1

Copy link
Contributor Author

@acezen acezen Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in our scenario, the key would always landing on index < aggNums.length-1, but we can revise the condition to if (aggNums(mid) <= key && (mid == length - 1 || aggNums(mid + 1) > key)) for save.

val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(
map.asCaseSensitiveMap().asScala.toMap)
shortName() + " " + paths.map(qualifiedPathName(_, hadoopConf)).mkString(",")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val name = shortName() + " " + paths.map(qualifiedPathName(_, hadoopConf)).mkString(",")
Utils.redact(sparkSession.sessionState.conf.stringRedactionPattern, name)

I'm not sure the redact() function in the original code is required for our case

val edgeNumOfVertexChunks = sortedDfRDD.mapPartitions(iterator => {
iterator.map(row => (row(colIndex).asInstanceOf[Long] / vertexChunkSize, 1))
}).reduceByKey(_ + _).collectAsMap()
val vertexChunkNum = edgeNumOfVertexChunks.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set vertexChunkNum = Max(VertexChunkId) + 1 to handle the case when there are no edges for a vertex chunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by passing a vertex num augment.

// generate global edge id for each record of dataframe
val parition_counts = df_rdd
val edgeSchema = edgeDf.schema
val colIndex = edgeSchema.fieldIndex(if (adjListType == AdjListType.ordered_by_source) GeneralParams.srcIndexCol else GeneralParams.dstIndexCol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function repartitionAndSort() only required for AdjListType.ordered_by_source & AdjListType.ordered_by_dest? It seems that it is called for all types.

val vertexChunkNum: Int = ((vertexNumOfPrimaryVertexLabel + vertexChunkSize - 1) / vertexChunkSize).toInt // ceil

// sort by primary key and generate continue edge id for edge records
val sortedDfRDD = edgeDf.sort(GeneralParams.srcIndexCol).rdd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GeneralParams.srcIndexCol" -> "colIndex"

@acezen acezen force-pushed the 68-spark-writer-improve branch from 1f3b0df to 91dca36 Compare February 20, 2023 02:08
@acezen acezen force-pushed the 68-spark-writer-improve branch from c38bfb7 to 552e5c6 Compare February 20, 2023 06:22
Copy link
Contributor

@lixueclaire lixueclaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants