[SPARK-3190][GraphX] fix VertexRDD.count exceed on large graph #12835
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As [SPARK-3190] and #2106 described, VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes.
And the PR above expected to fix the issue by converting partition sizes to Longs before summing them. But when the number of vertices in specific partition exceed Integer.MAX_VALUE also can repreduce this issue.
The fundamental cause of this problem is the variable “size” is defined as type Int in class VertexPartitionBase.
def size: Int = mask.cardinality()