Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5561] [mllib] Generalized PeriodicCheckpointer for RDDs and Graphs #7728

Closed
wants to merge 4 commits into from

Conversation

jkbradley
Copy link
Member

PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it.

For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.

I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well.

Changes I made:

  • Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
  • Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
  • The subclasses for Graphs and RDDs implement those abstract methods.
  • I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.

To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
(2) diff between the 2 test suites

CCing @andrewor14 in case there are relevant changes to checkpointing.
CCing @feynmanliang in case you're interested in learning about checkpointing.
CCing @mengxr for final OK.
Thanks all!

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38729 has finished for PR 7728 at commit 568918c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @tparam T Dataset type, such as RDD[Double]
*/
private[mllib] abstract class PeriodicCheckpointer[T](
var currentData: T,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need currentData at construction time? It might be cleaner to let user call update to add the initial dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem with not doing that is that the type parameter have to be given explicitly to the constructor, but that's fine with me. I'll make the change.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38968 has finished for PR 7728 at commit 32b23b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.

… the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39008 has finished for PR 7728 at commit d41902c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@mengxr This should be ready for a final pass. Thanks!

@asfgit asfgit closed this in c581593 Jul 30, 2015
@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2015

LGTM. Merged into master. Thanks! Btw, it is not necessary to specify the item type of RDD or Graph. Checkpointing doesn't care the item type. Maybe we can try RDD[_] and Graph[_, _], which might simplify the code a little bit (if it compiles).

@andrewor14
Copy link
Contributor

@jkbradley thanks, this is actually not affected by the recent checkpointing changes since we keep the old code path. In the future you can switch to calling rdd.localCheckpoint() and suddenly everything will be a little faster.

@jkbradley jkbradley deleted the gbt-checkpoint branch December 29, 2016 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants