-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5561] [mllib] Generalized PeriodicCheckpointer for RDDs and Graphs #7728
Conversation
…ubclasses for RDDs and Graphs.
Test build #38729 has finished for PR 7728 at commit
|
* @tparam T Dataset type, such as RDD[Double] | ||
*/ | ||
private[mllib] abstract class PeriodicCheckpointer[T]( | ||
var currentData: T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need currentData
at construction time? It might be cleaner to let user call update
to add the initial dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem with not doing that is that the type parameter have to be given explicitly to the constructor, but that's fine with me. I'll make the change.
Test build #38968 has finished for PR 7728 at commit
|
Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before. |
… the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
Test build #39008 has finished for PR 7728 at commit
|
@mengxr This should be ready for a final pass. Thanks! |
LGTM. Merged into master. Thanks! Btw, it is not necessary to specify the item type of RDD or Graph. Checkpointing doesn't care the item type. Maybe we can try |
@jkbradley thanks, this is actually not affected by the recent checkpointing changes since we keep the old code path. In the future you can switch to calling |
PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it.
For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.
I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well.
Changes I made:
To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
(2) diff between the 2 test suites
CCing @andrewor14 in case there are relevant changes to checkpointing.
CCing @feynmanliang in case you're interested in learning about checkpointing.
CCing @mengxr for final OK.
Thanks all!