Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix org.scala-lang: * inconsistent versions #234

Closed
wants to merge 160 commits into from

Commits on Mar 31, 2014

  1. SPARK-1352: Improve robustness of spark-submit script

    1. Better error messages when required arguments are missing.
    2. Support for unit testing cases where presented arguments are invalid.
    3. Bug fix: Only use environment varaibles when they are set (otherwise will cause NPE).
    4. A verbose mode to aid debugging.
    5. Visibility of several variables is set to private.
    6. Deprecation warning for existing scripts.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#271 from pwendell/spark-submit and squashes the following commits:
    
    9146def [Patrick Wendell] SPARK-1352: Improve robustness of spark-submit script
    pwendell committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    841721e View commit details
    Browse the repository at this point in the history
  2. [SQL] Rewrite join implementation to allow streaming of one relation.

    Before we were materializing everything in memory.  This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).
    
    @rxin @liancheng
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#250 from marmbrus/hashJoin and squashes the following commits:
    
    1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
    8e6f2a2 [Michael Armbrust] Review comments.
    1e9fb63 [Michael Armbrust] style
    bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.
    marmbrus authored and rxin committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    5731af5 View commit details
    Browse the repository at this point in the history
  3. SPARK-1365 [HOTFIX] Fix RateLimitedOutputStream test

    This test needs to be fixed. It currently depends on Thread.sleep() having exact-timing
    semantics, which is not a valid assumption.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#277 from pwendell/rate-limited-stream and squashes the following commits:
    
    6c0ff81 [Patrick Wendell] SPARK-1365: Fix RateLimitedOutputStream test
    pwendell committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    33b3c2a View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2014

  1. SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"

    Author: Sandy Ryza <[email protected]>
    
    Closes apache#279 from sryza/sandy-spark-1376 and squashes the following commits:
    
    d8aebfa [Sandy Ryza] SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"
    sryza authored and Mridul Muralidharan committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    564f1c1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1377] Upgrade Jetty to 8.1.14v20131031

    Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.
    
    Author: Andrew Or <[email protected]>
    
    Closes apache#280 from andrewor14/jetty-upgrade and squashes the following commits:
    
    dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
    e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031
    andrewor14 authored and pwendell committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    94fe7fd View commit details
    Browse the repository at this point in the history
  3. [Hot Fix apache#42] Persisted RDD disappears on storage page if re-used

    If a previously persisted RDD is re-used, its information disappears from the Storage page.
    
    This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists.
    
    Author: Andrew Or <[email protected]>
    
    Closes apache#281 from andrewor14/ui-storage-fix and squashes the following commits:
    
    408585a [Andrew Or] Fix storage UI bug
    andrewor14 authored and pwendell committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    ada310a View commit details
    Browse the repository at this point in the history
  4. [SQL] SPARK-1372 Support for caching and uncaching tables in a SQLCon…

    …text.
    
    This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`).  However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#282 from marmbrus/cacheTables and squashes the following commits:
    
    83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
    marmbrus authored and rxin committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    f5c418d View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2014

  1. [SPARK-1342] Scala 2.10.4

    Just a Scala version increment
    
    Author: Mark Hamstra <[email protected]>
    
    Closes apache#259 from markhamstra/scala-2.10.4 and squashes the following commits:
    
    fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
    markhamstra authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    764353d View commit details
    Browse the repository at this point in the history
  2. [Spark-1134] only call ipython if no arguments are given; remove IPYT…

    …HONOPTS from call
    
    see comments on Pull Request apache#38
    (i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)
    
    Author: Diana Carroll <[email protected]>
    
    Closes apache#227 from dianacarroll/spark-1134 and squashes the following commits:
    
    ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
    b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
    0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
    Diana Carroll authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    afb5ea6 View commit details
    Browse the repository at this point in the history
  3. Revert "[Spark-1134] only call ipython if no arguments are given; rem…

    …ove IPYTHONOPTS from call"
    
    This reverts commit afb5ea6.
    mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    45df912 View commit details
    Browse the repository at this point in the history
  4. MLI-1 Decision Trees

    Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.
    
    Key features:
    + Supports binary classification and regression
    + Supports gini, entropy and variance for information gain calculation
    + Supports both continuous and categorical features
    
    The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:
    
    1. Level-wise training to reduce passes over the entire dataset.
    2. Bin-wise split calculation to reduce computation overhead.
    3. Aggregation over partitions before combining to reduce communication overhead.
    
    Author: Manish Amde <[email protected]>
    Author: manishamde <[email protected]>
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#79 from manishamde/tree and squashes the following commits:
    
    1e8c704 [Manish Amde] remove numBins field in the Strategy class
    7d54b4f [manishamde] Merge pull request apache#4 from mengxr/dtree
    f536ae9 [Xiangrui Meng] another pass on code style
    e1dd86f [Manish Amde] implementing code style suggestions
    62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
    201702f [Manish Amde] making some more methods private
    f963ef5 [Manish Amde] making methods private
    c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
    24500c5 [Xiangrui Meng] minor style updates
    4576b64 [Manish Amde] documentation and for to while loop conversion
    ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
    632818f [Manish Amde] removing threshold for classification predict method
    2116360 [Manish Amde] removing dummy bin calculation for categorical variables
    6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
    62c2562 [Manish Amde] fixing comment indentation
    ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
    d1ef4f6 [Manish Amde] more documentation
    794ff4d [Manish Amde] minor improvements to docs and style
    eb8fcbe [Manish Amde] minor code style updates
    cd2c2b4 [Manish Amde] fixing code style based on feedback
    63e786b [Manish Amde] added multiple train methods for java compatability
    d3023b3 [Manish Amde] adding more docs for nested methods
    84f85d6 [Manish Amde] code documentation
    9372779 [Manish Amde] code style: max line lenght <= 100
    dd0c0d7 [Manish Amde] minor: some docs
    0dd7659 [manishamde] basic doc
    5841c28 [Manish Amde] unit tests for categorical features
    f067d68 [Manish Amde] minor cleanup
    c0e522b [Manish Amde] updated predict and split threshold logic
    b09dc98 [Manish Amde] minor refactoring
    6b7de78 [Manish Amde] minor refactoring and tests
    d504eb1 [Manish Amde] more tests for categorical features
    dbb7ac1 [Manish Amde] categorical feature support
    6df35b9 [Manish Amde] regression predict logic
    53108ed [Manish Amde] fixing index for highest bin
    e23c2e5 [Manish Amde] added regression support
    c8f6d60 [Manish Amde] adding enum for feature type
    b0e3e76 [Manish Amde] adding enum for feature type
    154aa77 [Manish Amde] enums for configurations
    733d6dd [Manish Amde] fixed tests
    02c595c [Manish Amde] added command line parsing
    98ec8d5 [Manish Amde] tree building and prediction logic
    b0eb866 [Manish Amde] added logic to handle leaf nodes
    80e8c66 [Manish Amde] working version of multi-level split calculation
    4798aae [Manish Amde] added gain stats class
    dad0afc [Manish Amde] decison stump functionality working
    03f534c [Manish Amde] some more tests
    0012a77 [Manish Amde] basic stump working
    8bca1e2 [Manish Amde] additional code for creating intermediate RDD
    92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
    cd53eae [Manish Amde] skeletal framework
    manishamde authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    8b3045c View commit details
    Browse the repository at this point in the history
  5. Remove * from test case golden filename.

    @rxin mentioned this might cause issues on windows machines.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#297 from marmbrus/noStars and squashes the following commits:
    
    263122a [Michael Armbrust] Remove * from test case golden filename.
    marmbrus authored and rxin committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    ea9de65 View commit details
    Browse the repository at this point in the history
  6. Renamed stageIdToActiveJob to jobIdToActiveJob.

    This data structure was misused and, as a result, later renamed to an incorrect name.
    
    This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.
    
    This patch renames it and removes an incorrect indexing into it.  The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here apache@5757993) was not always executed; this commit fixes that.
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes apache#301 from kayousterhout/fixCancellation and squashes the following commits:
    
    bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.
    kayousterhout authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    11973a7 View commit details
    Browse the repository at this point in the history
  7. [SPARK-1385] Use existing code for JSON de/serialization of BlockId

    `BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
    With this PR, the de/serialization of BlockIds will go through the first (older) code path.
    
    (Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)
    
    Author: Andrew Or <[email protected]>
    
    Closes apache#289 from andrewor14/blockid-json and squashes the following commits:
    
    409d226 [Andrew Or] Simplify JSON de/serialization for BlockId
    andrewor14 authored and aarondav committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    de8eefa View commit details
    Browse the repository at this point in the history
  8. Do not re-use objects in the EdgePartition/EdgeTriplet iterators.

    This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.
    
    I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?
    
    This is an example benchmark I did:
    
    test("benchmark") {
      val builder = new EdgePartitionBuilder[Int]
      for (i <- (1 to 10000000)) {
        builder.add(i.toLong, i.toLong, i)
      }
      val p = builder.toEdgePartition
      p.map(_.attr + 1).iterator.toList
    }
    
    It ran for 10 seconds both before and after this change.
    
    Author: Daniel Darabos <[email protected]>
    
    Closes apache#276 from darabos/spark-1188 and squashes the following commits:
    
    574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code.
    4117a64 [Daniel Darabos] Revert EdgePartitionSuite.
    4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code.
    4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions.
    2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition.
    0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.
    c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.
    darabos authored and rxin committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    7823633 View commit details
    Browse the repository at this point in the history
  9. [SPARK-1371][WIP] Compression support for Spark SQL in-memory columna…

    …r storage
    
    JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
    
    (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
    
    This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
    
    *   `CompressionScheme`
    
        Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
    
        * `RunLengthEncoding`
        * `DictionaryEncoding`
    
        Algorithms to be implemented include:
    
        * `BooleanBitSet`
        * `IntDelta`
        * `LongDelta`
    
    *   `CompressibleColumnBuilder`
    
        A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
    
        Memory layout of the final byte buffer is showed below:
    
        ```
         .--------------------------- Column type ID (4 bytes)
         |   .----------------------- Null count N (4 bytes)
         |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
         |   |   |     .------------- Compression scheme ID (4 bytes)
         |   |   |     |   .--------- Compressed non-null elements
         V   V   V     V   V
        +---+---+-----+---+---------+
        |   |   | ... |   | ... ... |
        +---+---+-----+---+---------+
         \-----------/ \-----------/
            header         body
        ```
    
    *   `CompressibleColumnAccessor`
    
        A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
    
    *   `ColumnStats`
    
        Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
    
        Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
    
    A major refactoring change since PR apache#205 is:
    
    * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
    
    Author: Cheng Lian <[email protected]>
    
    Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits:
    
    ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
    d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
    5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
    c298b76 [Cheng Lian] Test suites refactored
    2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
    211331c [Cheng Lian] WIP: in-memory columnar compression support
    85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
    liancheng authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    1faa579 View commit details
    Browse the repository at this point in the history
  10. StopAfter / TopK related changes

    1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
    2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
    3. Avoid breaking lineage in Limit.
    4. Added a bunch of override's to execution/basicOperators.scala.
    
    @marmbrus @liancheng
    
    Author: Reynold Xin <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#233 from rxin/limit and squashes the following commits:
    
    13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
    92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
    4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
    87b7d37 [Reynold Xin] Use the proper serializer in limit.
    9b79246 [Reynold Xin] Updated doc for Limit.
    47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
    231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
    rxin authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    ed730c9 View commit details
    Browse the repository at this point in the history
  11. [SPARK-1212, Part II] Support sparse data in MLlib

    In PR apache#117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
    
    1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
    2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
    3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
    4. Add libSVMFile to MLContext.
    5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
    6. Gradient computation no longer creates temp vectors.
    7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
    
    TODO:
    1. ~~Use axpy when possible.~~
    2. ~~Optimize Naive Bayes.~~
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#245 from mengxr/vector and squashes the following commits:
    
    eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
    c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
    11999c7 [Xiangrui Meng] Merge branch 'master' into vector
    f7da54b [Xiangrui Meng] add minSplits to libSVMFile
    da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
    493f26f [Xiangrui Meng] Merge branch 'master' into vector
    7c1bc01 [Xiangrui Meng] add a TODO to NB
    b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
    b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
    4addc50 [Xiangrui Meng] merge master
    4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
    f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
    d088552 [Xiangrui Meng] use static constructor for MLContext
    6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
    3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
    0f8759b [Xiangrui Meng] minor updates to NB
    b11659c [Xiangrui Meng] style update
    78c4671 [Xiangrui Meng] add libSVMFile to MLContext
    f0fe616 [Xiangrui Meng] add a test for sparse linear regression
    44733e1 [Xiangrui Meng] use in-place gradient computation
    e981396 [Xiangrui Meng] use axpy in Updater
    db808a1 [Xiangrui Meng] update JavaLR example
    befa592 [Xiangrui Meng] passed scala/java tests
    75c83a4 [Xiangrui Meng] passed test compile
    1859701 [Xiangrui Meng] passed compile
    834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
    135ab72 [Xiangrui Meng] merge glm
    0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
    d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
    3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
    mengxr authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    9c65fa7 View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2014

  1. [SQL] SPARK-1364 Improve datatype and test coverage for ScalaReflecti…

    …on schema inference.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#293 from marmbrus/reflectTypes and squashes the following commits:
    
    f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.
    marmbrus authored and pwendell committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    47ebea5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1398] Removed findbugs jsr305 dependency

    Should be a painless upgrade, and does offer some significant advantages should we want to leverage FindBugs more during the 1.0 lifecycle. http://findbugs.sourceforge.net/findbugs2.html
    
    Author: Mark Hamstra <[email protected]>
    
    Closes apache#307 from markhamstra/findbugs and squashes the following commits:
    
    99f2d09 [Mark Hamstra] Removed unnecessary findbugs jsr305 dependency
    markhamstra authored and pwendell committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    92a86b2 View commit details
    Browse the repository at this point in the history
  3. Spark parquet improvements

    A few improvements to the Parquet support for SQL queries:
    - Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
      sources
    - InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
      HiveQL)
    - tests now use the new API
    - Parquet logging can be set to WARNING level (Default)
    - Default compression for Parquet files (GZIP, as in parquet-mr)
    
    Author: Andre Schumacher <[email protected]>
    
    Closes apache#195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:
    
    54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
    AndreSchumacher authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    fbebaed View commit details
    Browse the repository at this point in the history
  4. [SPARK-1360] Add Timestamp Support for SQL

    This PR includes:
    1) Add new data type Timestamp
    2) Add more data type casting base on Hive's Rule
    3) Fix bug missing data type in both parsers (HiveQl & SQLParser).
    
    Author: Cheng Hao <[email protected]>
    
    Closes apache#275 from chenghao-intel/timestamp and squashes the following commits:
    
    df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
    24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
    fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
    d0d1919 [Cheng Hao] Add more data type for scala reflection
    3259808 [Cheng Hao] Add the new Golden files
    3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
    54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
    9cb505c [Cheng Hao] Fix issues according to PR comments
    e529168 [Cheng Hao] Fix bug of converting from String
    6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
    8a1d4d6 [Cheng Hao] Add DataType for SqlParser
    ce4385e [Cheng Hao] Add TimestampType Support
    chenghao-intel authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    5d1feda View commit details
    Browse the repository at this point in the history
  5. Spark 1162 Implemented takeOrdered in pyspark.

    Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
    
    We have our own implementation of max heap.
    
    Author: Prashant Sharma <[email protected]>
    
    Closes apache#97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
    
    35f86ba [Prashant Sharma] code review
    2b1124d [Prashant Sharma] fixed tests
    e8a08e2 [Prashant Sharma] Code review comments.
    49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
    ScrapCodes authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    c1ea3af View commit details
    Browse the repository at this point in the history
  6. [SQL] SPARK-1333 First draft of java API

    WIP: Some work remains...
     * [x] Hive support
     * [x] Tests
     * [x] Update docs
    
    Feedback welcome!
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#248 from marmbrus/javaSchemaRDD and squashes the following commits:
    
    b393913 [Michael Armbrust] @srowen 's java style suggestions.
    f531eb1 [Michael Armbrust] Address matei's comments.
    33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
    822f626 [Michael Armbrust] improve docs.
    ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
    0b859c8 [Michael Armbrust] First draft of java API.
    marmbrus authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    b8f5341 View commit details
    Browse the repository at this point in the history
  7. [SPARK-1134] Fix and document passing of arguments to IPython

    This is based on @dianacarroll's previous pull request apache#227, and @JoshRosen's comments on apache#38. Since we do want to allow passing arguments to IPython, this does the following:
    * It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see ipython/ipython#5226, but no released version has that fix.)
    * If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
    * The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.
    
    This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes ipython/ipython#5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.
    
    @JoshRosen you should probably take the final call on this.
    
    Author: Diana Carroll <[email protected]>
    
    Closes apache#294 from mateiz/spark-1134 and squashes the following commits:
    
    747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
    Diana Carroll authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    a599e43 View commit details
    Browse the repository at this point in the history
  8. [BUILD FIX] Fix compilation of Spark SQL Java API.

    The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#316 from marmbrus/hotFixJavaApi and squashes the following commits:
    
    0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.
    marmbrus authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    d94826b View commit details
    Browse the repository at this point in the history
  9. Fix jenkins from giving the green light to builds that don't compile.

     Adding `| grep` swallows the non-zero return code from sbt failures. See [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13735/consoleFull) for a Jenkins run that fails to compile, but still gets a green light.
    
    Note the [BUILD FIX] commit isn't actually part of this PR, but github is out of date.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#317 from marmbrus/fixJenkins and squashes the following commits:
    
    7c77ff9 [Michael Armbrust] Remove output filter that was swallowing non-zero exit codes for test failures.
    marmbrus authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    9231b01 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2014

  1. Revert "[SPARK-1398] Removed findbugs jsr305 dependency"

    This reverts commit 92a86b2.
    pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    33e6361 View commit details
    Browse the repository at this point in the history
  2. SPARK-1337: Application web UI garbage collects newest stages

    Simple fix...
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#320 from pwendell/stage-clean-up and squashes the following commits:
    
    29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones
    pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    ee6e9e7 View commit details
    Browse the repository at this point in the history
  3. SPARK-1350. Always use JAVA_HOME to run executor container JVMs.

    Author: Sandy Ryza <[email protected]>
    
    Closes apache#313 from sryza/sandy-spark-1350 and squashes the following commits:
    
    bb6d187 [Sandy Ryza] SPARK-1350. Always use JAVA_HOME to run executor container JVMs.
    sryza authored and tgravescs committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    7f32fd4 View commit details
    Browse the repository at this point in the history
  4. SPARK-1404: Always upgrade spark-env.sh vars to environment vars

    This was broken when spark-env.sh was made idempotent, as the idempotence check is an environment variable, but the spark-env.sh variables may not have been.
    
    Tested in zsh, bash, and sh.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes apache#310 from aarondav/SPARK-1404 and squashes the following commits:
    
    c3406a5 [Aaron Davidson] Add extra export in spark-shell
    6a0e340 [Aaron Davidson] SPARK-1404: Always upgrade spark-env.sh vars to environment vars
    aarondav authored and pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    01cf4c4 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1133] Add whole text files reader in MLlib

    Here is a pointer to the former [PR164](apache#164).
    
    I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.
    
    Author: Xusen Yin <[email protected]>
    
    Closes apache#252 from yinxusen/whole-files-input and squashes the following commits:
    
    7191be6 [Xusen Yin] refine comments
    0af3faf [Xusen Yin] add JavaAPI test
    01745ee [Xusen Yin] fix deletion error
    cc97dca [Xusen Yin] move whole text file API to Spark core
    d792cee [Xusen Yin] remove the typo character "+"
    6bdf2c2 [Xusen Yin] test for small local file system block size
    a1f1e7e [Xusen Yin] add two extra spaces
    28cb0fe [Xusen Yin] add whole text files reader
    yinxusen authored and mateiz committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    f1fa617 View commit details
    Browse the repository at this point in the history
  6. SPARK-1375. Additional spark-submit cleanup

    Author: Sandy Ryza <[email protected]>
    
    Closes apache#278 from sryza/sandy-spark-1375 and squashes the following commits:
    
    5fbf1e9 [Sandy Ryza] SPARK-1375. Additional spark-submit cleanup
    sryza authored and pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    16b8308 View commit details
    Browse the repository at this point in the history
  7. Don't create SparkContext in JobProgressListenerSuite.

    This reduces the time of the test from 11 seconds to 20 milliseconds.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#324 from pwendell/job-test and squashes the following commits:
    
    868d9eb [Patrick Wendell] Don't create SparkContext in JobProgressListenerSuite.
    pwendell authored and rxin committed Apr 4, 2014
    1 Configuration menu
    Copy the full SHA
    a02b535 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2014

  1. [SPARK-1198] Allow pipes tasks to run in different sub-directories

    This works as is on Linux/Mac/etc but doesn't cover working on Windows.  In here I use ln -sf for symlinks. Putting this up for comments on that. Do we want to create perhaps some classes for doing shell commands - Linux vs Windows.  Is there some other way we want to do this?   I assume we are still supporting jdk1.6?
    
    Also should I update the Java API for pipes to allow this parameter?
    
    Author: Thomas Graves <[email protected]>
    
    Closes apache#128 from tgravescs/SPARK1198 and squashes the following commits:
    
    abc1289 [Thomas Graves] remove extra tag in pom file
    ba23fc0 [Thomas Graves] Add support for symlink on windows, remove commons-io usage
    da4b221 [Thomas Graves] Merge branch 'master' of https://github.com/tgravescs/spark into SPARK1198
    61be271 [Thomas Graves] Fix file name filter
    6b783bd [Thomas Graves] style fixes
    1ab49ca [Thomas Graves] Add support for running pipe tasks is separate directories
    tgravescs authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    198892f View commit details
    Browse the repository at this point in the history
  2. [SQL] Minor fixes.

    Author: Michael Armbrust <[email protected]>
    
    Closes apache#315 from marmbrus/minorFixes and squashes the following commits:
    
    b23a15d [Michael Armbrust] fix scaladoc
    11062ac [Michael Armbrust] Fix registering "SELECT *" queries as tables and caching them.  As some tests for this and self-joins.
    3997dc9 [Michael Armbrust] Move Row extractor to catalyst.
    208bf5e [Michael Armbrust] More idiomatic naming of DSL functions. * subquery => as * for join condition => on, i.e., `r.join(s, condition = 'a == 'b)` =>`r.join(s, on = 'a == 'b)`
    87211ce [Michael Armbrust] Correctly handle self joins of in-memory cached tables.
    69e195e [Michael Armbrust] Change != to !== in the DSL since != will always translate to != on Any.
    01f2dd5 [Michael Armbrust] Correctly assign aliases to tables in SqlParser.
    marmbrus authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    d956cc2 View commit details
    Browse the repository at this point in the history
  3. SPARK-1414. Python API for SparkContext.wholeTextFiles

    Also clarified comment on each file having to fit in memory
    
    Author: Matei Zaharia <[email protected]>
    
    Closes apache#327 from mateiz/py-whole-files and squashes the following commits:
    
    9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
    mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    60e18ce View commit details
    Browse the repository at this point in the history
  4. Add test utility for generating Jar files with compiled classes.

    This was requested by a few different people and may be generally
    useful, so I'd like to contribute this and not block on a different
    PR for it to get in.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#326 from pwendell/class-loader-test-utils and squashes the following commits:
    
    ff3e88e [Patrick Wendell] Add test utility for generating Jar files with compiled classes.
    pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    5f3c1bb View commit details
    Browse the repository at this point in the history
  5. [SPARK-1419] Bumped parent POM to apache 14

    Keeping up-to-date with the parent, which includes some bugfixes.
    
    Author: Mark Hamstra <[email protected]>
    
    Closes apache#328 from markhamstra/Apache14 and squashes the following commits:
    
    3f19975 [Mark Hamstra] Bumped parent POM to apache 14
    markhamstra authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    1347ebd View commit details
    Browse the repository at this point in the history
  6. SPARK-1305: Support persisting RDD's directly to Tachyon

    Move the PR#468 of apache-incubator-spark to the apache-spark
    "Adding an option to persist Spark RDD blocks into Tachyon."
    
    Author: Haoyuan Li <[email protected]>
    Author: RongGu <[email protected]>
    
    Closes apache#158 from RongGu/master and squashes the following commits:
    
    72b7768 [Haoyuan Li] merge master
    9f7fa1b [Haoyuan Li] fix code style
    ae7834b [Haoyuan Li] minor cleanup
    a8b3ec6 [Haoyuan Li] merge master branch
    e0f4891 [Haoyuan Li] better check offheap.
    55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
    7cd4600 [RongGu] remove some logic code for tachyonstore's replication
    51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
    8adfcfa [RongGu] address arron's comment on inTachyonSize
    120e48a [RongGu] changed the root-level dir name in Tachyon
    5cc041c [Haoyuan Li] address aaron's comments
    9b97935 [Haoyuan Li] address aaron's comments
    d9a6438 [Haoyuan Li] fix for pspark
    77d2703 [Haoyuan Li] change python api.git status
    3dcace4 [Haoyuan Li] address matei's comments
    91fa09d [Haoyuan Li] address patrick's comments
    589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
    64348b2 [Haoyuan Li] update conf docs.
    ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
    619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
    be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
    49cc724 [Haoyuan Li] update docs with off_headp option
    4572f9f [RongGu] reserving the old apply function API of StorageLevel
    04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
    c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
    76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
    e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
    fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
    939e467 [Haoyuan Li] 0.4.1-thrift from maven central
    86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
    16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
    eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
    bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
    6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
    d827250 [RongGu] fix JsonProtocolSuie test failure
    716e93b [Haoyuan Li] revert the version
    ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
    2825a13 [RongGu] up-merging to the current master branch of the apache spark
    6a22c1a [Haoyuan Li] fix scalastyle
    8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
    77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
    1dcadf9 [Haoyuan Li] typo
    bf278fa [Haoyuan Li] fix python tests
    e82909c [Haoyuan Li] minor cleanup
    776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
    8859371 [Haoyuan Li] various minor fixes and clean up
    e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
    fcaeab2 [Haoyuan Li] address Aaron's comment
    e554b1e [Haoyuan Li] add python code
    47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
    dc8ef24 [Haoyuan Li] add old storelevel constructor
    e01a271 [Haoyuan Li] update tachyon 0.4.1
    8011a96 [RongGu] fix a brought-in mistake in StorageLevel
    70ca182 [RongGu] a bit change in comment
    556978b [RongGu] fix the scalastyle errors
    791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
    haoyuan authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    b50ddfd View commit details
    Browse the repository at this point in the history
  7. [SQL] SPARK-1366 Consistent sql function across different types of SQ…

    …LContexts
    
    Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#319 from marmbrus/standardizeSqlHql and squashes the following commits:
    
    de68d0e [Michael Armbrust] Fix sampling test.
    fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
    marmbrus authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    8de038e View commit details
    Browse the repository at this point in the history
  8. small fix ( proogram -> program )

    Author: Prabeesh K <[email protected]>
    
    Closes apache#331 from prabeesh/patch-3 and squashes the following commits:
    
    9399eb5 [Prabeesh K] small fix(proogram -> program)
    prabeesh authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    0acc7a0 View commit details
    Browse the repository at this point in the history
  9. HOTFIX for broken CI, by SPARK-1336

    Learnt about `set -o pipefail` is very useful.
    
    Author: Prashant Sharma <[email protected]>
    Author: Prashant Sharma <[email protected]>
    
    Closes apache#321 from ScrapCodes/hf-SPARK-1336 and squashes the following commits:
    
    9d22bc2 [Prashant Sharma] added comment why echo -e q exists.
    f865951 [Prashant Sharma] made error to match with word boundry so errors does not match. This is there to make sure build fails if provided SparkBuild has compile errors.
    7fffdf2 [Prashant Sharma] Removed a stray line.
    97379d8 [Prashant Sharma] HOTFIX for broken CI, by SPARK-1336
    ScrapCodes authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    7c18428 View commit details
    Browse the repository at this point in the history
  10. Remove the getStageInfo() method from SparkContext.

    This method exposes the Stage objects, which are
    private to Spark and should not be exposed to the
    user.
    
    This method was added in apache@01d77f3; ccing @squito here in case there's a good reason to keep this!
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes apache#308 from kayousterhout/remove_public_method and squashes the following commits:
    
    2e2f009 [Kay Ousterhout] Remove the getStageInfo() method from SparkContext.
    kayousterhout authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    2d0150c View commit details
    Browse the repository at this point in the history
  11. [SPARK-1371] fix computePreferredLocations signature to not depend on…

    … underlying implementation
    
    Change to Map and Set - not mutable HashMap and HashSet
    
    Author: Mridul Muralidharan <[email protected]>
    
    Closes apache#302 from mridulm/master and squashes the following commits:
    
    df747af [Mridul Muralidharan] Address review comments
    17e2907 [Mridul Muralidharan] fix computePreferredLocations signature to not depend on underlying implementation
    Mridul Muralidharan authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    6e88583 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2014

  1. Fix for PR apache#195 for Java 6

    Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility. See PR apache#195
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#334 from srowen/FixPR195ForJava6 and squashes the following commits:
    
    f92fbd3 [Sean Owen] Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility
    srowen authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    890d63b View commit details
    Browse the repository at this point in the history
  2. SPARK-1421. Make MLlib work on Python 2.6

    The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes apache#335 from mateiz/mllib-python-2.6 and squashes the following commits:
    
    f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
    a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
    mateiz committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    0b85516 View commit details
    Browse the repository at this point in the history
  3. Fix SPARK-1420 The maven build error for Spark Catalyst

    Author: witgo <[email protected]>
    
    Closes apache#333 from witgo/SPARK-1420 and squashes the following commits:
    
    902519e [witgo] add dependency scala-reflect to catalyst
    witgo authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    7012ffa View commit details
    Browse the repository at this point in the history
  4. [SPARK-1259] Make RDD locally iterable

    Author: Egor Pakhomov <[email protected]>
    
    Closes apache#156 from epahomov/SPARK-1259 and squashes the following commits:
    
    8ec8f24 [Egor Pakhomov] Make to local iterator shorter
    34aa300 [Egor Pakhomov] Fix toLocalIterator docs
    08363ef [Egor Pakhomov] SPARK-1259 from toLocallyIterable to toLocalIterator
    6a994eb [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    8be3dcf [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    33ecb17 [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    epahomov authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    e258e50 View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2014

  1. SPARK-1387. Update build plugins, avoid plugin version warning, centr…

    …alize versions
    
    Another handful of small build changes to organize and standardize a bit, and avoid warnings:
    
    - Update Maven plugin versions for good measure
    - Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
    - Use variables to define versions across dependencies where they should move in lock step
    - ... and make this consistent between Maven/SBT
    
    OK, I also updated the JIRA URL while I was at it here.
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#291 from srowen/SPARK-1387 and squashes the following commits:
    
    461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
    c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
    srowen authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    856c50f View commit details
    Browse the repository at this point in the history
  2. SPARK-1349: spark-shell gets its own command history

    Currently, spark-shell shares its command history with scala repl.
    
    This fix is simply a modification of the default FileBackedHistory file setting:
    https://github.com/scala/scala/blob/master/src/repl/scala/tools/nsc/interpreter/session/FileBackedHistory.scala#L77
    
    Author: Aaron Davidson <[email protected]>
    
    Closes apache#267 from aarondav/repl and squashes the following commits:
    
    f9c62d2 [Aaron Davidson] SPARK-1349: spark-shell gets its own command history separate from scala repl
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    7ce52c4 View commit details
    Browse the repository at this point in the history
  3. SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging

    Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.
    
    This patch has the following features/bug fixes:
    
    - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
    - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
    - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
    - avoid adding log message in compute-classpath.sh to the classpath :)
    
    Still TODO before mergeable:
    - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
    - Spark SQL documentation updates.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes apache#237 from aarondav/master and squashes the following commits:
    
    5dc4329 [Aaron Davidson] Typo fixes
    dd4f298 [Aaron Davidson] Doc update
    dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
    a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    4106558 View commit details
    Browse the repository at this point in the history
  4. SPARK-1154: Clean up app folders in worker nodes

    This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154).   The issue is that worker nodes fill up with a huge number of app-* folders after some time.  This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL.
    
    Two new configuration parameters have been introduced:
      spark.worker.cleanup_interval
      spark.worker.app_data_ttl
    
    This change does not include moving the downloads of application jars to a location outside of the work directory.  We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR.
    
    Author: Evan Chan <[email protected]>
    Author: Kelvin Chu <[email protected]>
    
    Closes apache#288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits:
    
    0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode
    9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled
    f2f6027 [Evan Chan] CR from @andrewor14
    553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds
    8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge.
    cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles()
    72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds.
    ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles()
    dc1a311 [Evan Chan] Don't recompute current time with every new file
    e3c408e [Evan Chan] Document the two new settings
    b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories
    Evan Chan authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    1440154 View commit details
    Browse the repository at this point in the history
  5. SPARK-1431: Allow merging conflicting pull requests

    Sometimes if there is a small conflict it's nice to be able to just
    manually fix it up rather than have another RTT with the contributor.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#342 from pwendell/merge-conflicts and squashes the following commits:
    
    cdce61a [Patrick Wendell] SPARK-1431: Allow merging conflicting pull requests
    pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    87d0928 View commit details
    Browse the repository at this point in the history
  6. [SQL] SPARK-1371 Hash Aggregation Improvements

    Given:
    ```scala
    case class Data(a: Int, b: Int)
    val rdd =
      sparkContext
        .parallelize(1 to 200)
        .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
    rdd.registerAsTable("data")
    cacheTable("data")
    ```
    Before:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    16795.567ms
    SELECT a, SUM(b) FROM data GROUP BY a
    7536.436ms
    SELECT SUM(b) FROM data
    10954.1ms
    ```
    
    After:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    1372.175ms
    SELECT a, SUM(b) FROM data GROUP BY a
    2070.446ms
    SELECT SUM(b) FROM data
    958.969ms
    ```
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#295 from marmbrus/hashAgg and squashes the following commits:
    
    ec63575 [Michael Armbrust] Add comment.
    d0495a9 [Michael Armbrust] Use scaladoc instead.
    b4a6887 [Michael Armbrust] Address review comments.
    a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
    7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
    5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
    6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
    92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
    1279df2 [Michael Armbrust] Increase default number of partitions.
    marmbrus authored and rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    accd099 View commit details
    Browse the repository at this point in the history
  7. [SQL] SPARK-1427 Fix toString for SchemaRDD NativeCommands.

    Author: Michael Armbrust <[email protected]>
    
    Closes apache#343 from marmbrus/toStringFix and squashes the following commits:
    
    37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
    marmbrus authored and rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    b5bae84 View commit details
    Browse the repository at this point in the history
  8. SPARK-1432: Make sure that all metadata fields are properly cleaned

    While working on spark-1337 with @pwendell, we noticed that not all of the metadata maps in JobProgessListener were being properly cleaned. This could lead to a (hypothetical) memory leak issue should a job run long enough. This patch aims to address the issue.
    
    Author: Davis Shepherd <[email protected]>
    
    Closes apache#338 from dgshep/master and squashes the following commits:
    
    a77b65c [Davis Shepherd] In the contex of SPARK-1337: Make sure that all metadata fields are properly cleaned
    Davis Shepherd authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    a3c51c6 View commit details
    Browse the repository at this point in the history
  9. [sql] Rename Expression.apply to eval for better readability.

    Also used this opportunity to add a bunch of override's and made some members private.
    
    Author: Reynold Xin <[email protected]>
    
    Closes apache#340 from rxin/eval and squashes the following commits:
    
    a7c7ca7 [Reynold Xin] Fixed conflicts in merge.
    9069de6 [Reynold Xin] Merge branch 'master' into eval
    3ccc313 [Reynold Xin] Merge branch 'master' into eval
    1a47e10 [Reynold Xin] Renamed apply to eval for generators and added a bunch of override's.
    ea061de [Reynold Xin] Rename Expression.apply to eval for better readability.
    rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    83f2a2f View commit details
    Browse the repository at this point in the history
  10. SPARK-1252. On YARN, use container-log4j.properties for executors

    container-log4j.properties is a file that YARN provides so that containers can have log4j.properties distinct from that of the NodeManagers.
    
    Logs now go to syslog, and stderr and stdout just have the process's standard err and standard out.
    
    I tested this on pseudo-distributed clusters for both yarn (Hadoop 2.2) and yarn-alpha (Hadoop 0.23.7)/
    
    Author: Sandy Ryza <[email protected]>
    
    Closes apache#148 from sryza/sandy-spark-1252 and squashes the following commits:
    
    c0043b8 [Sandy Ryza] Put log4j.properties file under common
    55823da [Sandy Ryza] Add license headers to new files
    10934b8 [Sandy Ryza] Add log4j-spark-container.properties and support SPARK_LOG4J_CONF
    e74450b [Sandy Ryza] SPARK-1252. On YARN, use container-log4j.properties for executors
    sryza authored and tgravescs committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    9dd8b91 View commit details
    Browse the repository at this point in the history
  11. HOTFIX: Disable actor input stream test.

    This test makes incorrect assumptions about the behavior of Thread.sleep().
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#347 from pwendell/stream-tests and squashes the following commits:
    
    10e09e0 [Patrick Wendell] HOTFIX: Disable actor input stream.
    pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    2a2ca48 View commit details
    Browse the repository at this point in the history
  12. SPARK-1099: Introduce local[*] mode to infer number of cores

    This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes apache#182 from aarondav/110 and squashes the following commits:
    
    a88294c [Aaron Davidson] Rebased changes for new spark-shell
    a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    0307db0 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2014

  1. [sql] Rename execution/aggregates.scala Aggregate.scala, and added a …

    …bunch of private[this] to variables.
    
    Author: Reynold Xin <[email protected]>
    
    Closes apache#348 from rxin/aggregate and squashes the following commits:
    
    f4bc36f [Reynold Xin] Rename execution/aggregates.scala Aggregate.scala, and added a bunch of private[this] to variables.
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    14c9238 View commit details
    Browse the repository at this point in the history
  2. Removed the default eval implementation from Expression, and added a …

    …bunch of override's in classes I touched.
    
    It is more robust to not provide a default implementation for Expression's.
    
    Author: Reynold Xin <[email protected]>
    
    Closes apache#350 from rxin/eval-default and squashes the following commits:
    
    0a83b8f [Reynold Xin] Removed the default eval implementation from Expression, and added a bunch of override's in classes I touched.
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    55dfd5d View commit details
    Browse the repository at this point in the history
  3. Added eval for Rand (without any support for user-defined seed).

    Author: Reynold Xin <[email protected]>
    
    Closes apache#349 from rxin/rand and squashes the following commits:
    
    fd11322 [Reynold Xin] Added eval for Rand (without any support for user-defined seed).
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    31e6fff View commit details
    Browse the repository at this point in the history
  4. Change timestamp cast semantics. When cast to numeric types, return t…

    …he unix time in seconds (instead of millis).
    
    @marmbrus @chenghao-intel
    
    Author: Reynold Xin <[email protected]>
    
    Closes apache#352 from rxin/timestamp-cast and squashes the following commits:
    
    18aacd3 [Reynold Xin] Fixed precision for double.
    2adb235 [Reynold Xin] Change timestamp cast semantics. When cast to numeric types, return the unix time in seconds (instead of millis).
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    f27e56a View commit details
    Browse the repository at this point in the history
  5. [SPARK-1402] Added 3 more compression schemes

    JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402)
    
    This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage:
    
    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`
    
    Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme.
    
    Also fixed a bug in PR apache#286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception.
    
    Author: Cheng Lian <[email protected]>
    
    Closes apache#330 from liancheng/moreCompressionSchemes and squashes the following commits:
    
    1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times
    d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta)
    3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites
    44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
    liancheng authored and rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    0d0493f View commit details
    Browse the repository at this point in the history
  6. [SPARK-1103] Automatic garbage collection of RDD, shuffle and broadca…

    …st data
    
    This PR allows Spark to automatically cleanup metadata and data related to persisted RDDs, shuffles and broadcast variables when the corresponding RDDs, shuffles and broadcast variables fall out of scope from the driver program. This is still a work in progress as broadcast cleanup has not been implemented.
    
    **Implementation Details**
    A new class `ContextCleaner` is responsible cleaning all the state. It is instantiated as part of a `SparkContext`. RDD and ShuffleDependency classes have overridden `finalize()` function that gets called whenever their instances go out of scope. The `finalize()` function enqueues the object’s identifier (i.e. RDD ID, shuffle ID, etc.) with the `ContextCleaner`, which is a very short and cheap operation and should not significantly affect the garbage collection mechanism. The `ContextCleaner`, on a different thread, performs the cleanup, whose details are given below.
    
    *RDD cleanup:*
    `ContextCleaner` calls `RDD.unpersist()` is used to cleanup persisted RDDs. Regarding metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all jobs have completed. Only the `SparkContext.persistentRDDs` keeps strong references to persisted RDDs. The `TimeStampedHashMap` used for that has been replaced by `TimeStampedWeakValueHashMap` that keeps only weak references to the RDDs, allowing them to be garbage collected.
    
    *Shuffle cleanup:*
    New BlockManager message `RemoveShuffle(<shuffle ID>)` asks the `BlockManagerMaster` and currently active `BlockManager`s to delete all the disk blocks related to the shuffle ID. `ContextCleaner` cleans up shuffle data using this message and also cleans up the metadata in the `MapOutputTracker` of the driver. The `MapOutputTracker` at the workers, that caches the shuffle metadata, maintains a `BoundedHashMap` to limit the shuffle information it caches. Refetching the shuffle information from the driver is not too costly.
    
    *Broadcast cleanup:*
    To be done. [This PR](https://github.com/apache/incubator-spark/pull/543/) adds mechanism for explicit cleanup of broadcast variables. `Broadcast.finalize()` will enqueue its own ID with ContextCleaner and the PRs mechanism will be used to unpersist the Broadcast data.
    
    *Other cleanup:*
    `ShuffleMapTask` and `ResultTask` caches tasks and used TTL based cleanup (using `TimeStampedHashMap`), so nothing got cleaned up if TTL was not set. Instead, they now use `BoundedHashMap` to keep a limited number of map output information. Cost of repopulating the cache if necessary is very small.
    
    **Current state of implementation**
    Implemented RDD and shuffle cleanup. Things left to be done are.
    - Cleaning up for broadcast variable still to be done.
    - Automatic cleaning up keys with empty weak refs as values in `TimeStampedWeakValueHashMap`
    
    Author: Tathagata Das <[email protected]>
    Author: Andrew Or <[email protected]>
    Author: Roman Pastukhov <[email protected]>
    
    Closes apache#126 from tdas/state-cleanup and squashes the following commits:
    
    61b8d6e [Tathagata Das] Fixed issue with Tachyon + new BlockManager methods.
    f489fdc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    d25a86e [Tathagata Das] Fixed stupid typo.
    cff023c [Tathagata Das] Fixed issues based on Andrew's comments.
    4d05314 [Tathagata Das] Scala style fix.
    2b95b5e [Tathagata Das] Added more documentation on Broadcast implementations, specially which blocks are told about to the driver. Also, fixed Broadcast API to hide destroy functionality.
    41c9ece [Tathagata Das] Added more unit tests for BlockManager, DiskBlockManager, and ContextCleaner.
    6222697 [Tathagata Das] Fixed bug and adding unit test for removeBroadcast in BlockManagerSuite.
    104a89a [Tathagata Das] Fixed failing BroadcastSuite unit tests by introducing blocking for removeShuffle and removeBroadcast in BlockManager*
    a430f06 [Tathagata Das] Fixed compilation errors.
    b27f8e8 [Tathagata Das] Merge pull request apache#3 from andrewor14/cleanup
    cd72d19 [Andrew Or] Make automatic cleanup configurable (not documented)
    ada45f0 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
    a2cc8bc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    c5b1d98 [Andrew Or] Address Patrick's comments
    a6460d4 [Andrew Or] Merge github.com:apache/spark into cleanup
    762a4d8 [Tathagata Das] Merge pull request #1 from andrewor14/cleanup
    f0aabb1 [Andrew Or] Correct semantics for TimeStampedWeakValueHashMap + add tests
    5016375 [Andrew Or] Address TD's comments
    7ed72fb [Andrew Or] Fix style test fail + remove verbose test message regarding broadcast
    634a097 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
    7edbc98 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into state-cleanup
    8557c12 [Andrew Or] Merge github.com:apache/spark into cleanup
    e442246 [Andrew Or] Merge github.com:apache/spark into cleanup
    88904a3 [Andrew Or] Make TimeStampedWeakValueHashMap a wrapper of TimeStampedHashMap
    fbfeec8 [Andrew Or] Add functionality to query executors for their local BlockStatuses
    34f436f [Andrew Or] Generalize BroadcastBlockId to remove BroadcastHelperBlockId
    0d17060 [Andrew Or] Import, comments, and style fixes (minor)
    c92e4d9 [Andrew Or] Merge github.com:apache/spark into cleanup
    f201a8d [Andrew Or] Test broadcast cleanup in ContextCleanerSuite + remove BoundedHashMap
    e95479c [Andrew Or] Add tests for unpersisting broadcast
    544ac86 [Andrew Or] Clean up broadcast blocks through BlockManager*
    d0edef3 [Andrew Or] Add framework for broadcast cleanup
    ba52e00 [Andrew Or] Refactor broadcast classes
    c7ccef1 [Andrew Or] Merge branch 'bc-unpersist-merge' of github.com:ignatich/incubator-spark into cleanup
    6c9dcf6 [Tathagata Das] Added missing Apache license
    d2f8b97 [Tathagata Das] Removed duplicate unpersistRDD.
    a007307 [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    620eca3 [Tathagata Das] Changes based on PR comments.
    f2881fd [Tathagata Das] Changed ContextCleaner to use ReferenceQueue instead of finalizer
    e1fba5f [Tathagata Das] Style fix
    892b952 [Tathagata Das] Removed use of BoundedHashMap, and made BlockManagerSlaveActor cleanup shuffle metadata in MapOutputTrackerWorker.
    a7260d3 [Tathagata Das] Added try-catch in context cleaner and null value cleaning in TimeStampedWeakValueHashMap.
    e61daa0 [Tathagata Das] Modifications based on the comments on PR 126.
    ae9da88 [Tathagata Das] Removed unncessary TimeStampedHashMap from DAGScheduler, added try-catches in finalize() methods, and replaced ArrayBlockingQueue to LinkedBlockingQueue to avoid blocking in Java's finalizing thread.
    cb0a5a6 [Tathagata Das] Fixed docs and styles.
    a24fefc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    8512612 [Tathagata Das] Changed TimeStampedHashMap to use WrappedJavaHashMap.
    e427a9e [Tathagata Das] Added ContextCleaner to automatically clean RDDs and shuffles when they fall out of scope. Also replaced TimeStampedHashMap to BoundedHashMaps and TimeStampedWeakValueHashMap for the necessary hashmap behavior.
    80dd977 [Roman Pastukhov] Fix for Broadcast unpersist patch.
    1e752f1 [Roman Pastukhov] Added unpersist method to Broadcast.
    tdas authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    11eabbe View commit details
    Browse the repository at this point in the history
  7. [SPARK-1331] Added graceful shutdown to Spark Streaming

    Current version of StreamingContext.stop() directly kills all the data receivers (NetworkReceiver) without waiting for the data already received to be persisted and processed. This PR provides the fix. Now, when the StreamingContext.stop() is called, the following sequence of steps will happen.
    1. The driver will send a stop signal to all the active receivers.
    2. Each receiver, when it gets a stop signal from the driver, first stop receiving more data, then waits for the thread that persists data blocks to BlockManager to finish persisting all receive data, and finally quits.
    3. After all the receivers have stopped, the driver will wait for the Job Generator and Job Scheduler to finish processing all the received data.
    
    It also fixes the semantics of StreamingContext.start and stop. It will throw appropriate errors and warnings if stop() is called before start(), stop() is called twice, etc.
    
    Author: Tathagata Das <[email protected]>
    
    Closes apache#247 from tdas/graceful-shutdown and squashes the following commits:
    
    61c0016 [Tathagata Das] Updated MIMA binary check excludes.
    ae1d39b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into graceful-shutdown
    6b59cfc [Tathagata Das] Minor changes based on Andrew's comment on PR.
    d0b8d65 [Tathagata Das] Reduced time taken by graceful shutdown unit test.
    f55bc67 [Tathagata Das] Fix scalastyle
    c69b3a7 [Tathagata Das] Updates based on Patrick's comments.
    c43b8ae [Tathagata Das] Added graceful shutdown to Spark Streaming.
    tdas authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    83ac9a4 View commit details
    Browse the repository at this point in the history
  8. [SPARK-1396] Properly cleanup DAGScheduler on job cancellation.

    Previously, when jobs were cancelled, not all of the state in the
    DAGScheduler was cleaned up, leading to a slow memory leak in the
    DAGScheduler.  As we expose easier ways to cancel jobs, it's more
    important to fix these issues.
    
    This commit also fixes a second and less serious problem, which is that
    previously, when a stage failed, not all of the appropriate stages
    were cancelled.  See the "failure of stage used by two jobs" test
    for an example of this.  This just meant that extra work was done, and is
    not a correctness problem.
    
    This commit adds 3 tests.  “run shuffle with map stage failure” is
    a new test to more thoroughly test this functionality, and passes on
    both the old and new versions of the code.  “trivial job
    cancellation” fails on the old code because all state wasn’t cleaned
    up correctly when jobs were cancelled (we didn’t remove the job from
    resultStageToJob).  “failure of stage used by two jobs” fails on the
    old code because taskScheduler.cancelTasks wasn’t called for one of
    the stages (see test comments).
    
    This should be checked in before apache#246, which makes it easier to
    cancel stages / jobs.
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes apache#305 from kayousterhout/incremental_abort_fix and squashes the following commits:
    
    f33d844 [Kay Ousterhout] Mark review comments
    9217080 [Kay Ousterhout] Properly cleanup DAGScheduler on job cancellation.
    kayousterhout committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    6dc5f58 View commit details
    Browse the repository at this point in the history
  9. Remove extra semicolon in import statement and unused import in Appli…

    …cationMaster
    
    Small nit cleanup to remove extra semicolon and unused import in Yarn's stable ApplicationMaster (it bothers me every time I saw it)
    
    Author: Henry Saputra <[email protected]>
    
    Closes apache#358 from hsaputra/nitcleanup_removesemicolon_import_applicationmaster and squashes the following commits:
    
    bffb685 [Henry Saputra] Remove extra semicolon in import statement and unused import in ApplicationMaster.scala
    hsaputra authored and rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    3bc0548 View commit details
    Browse the repository at this point in the history
  10. SPARK-1348 binding Master, Worker, and App Web UI to all interfaces

    Author: Kan Zhang <[email protected]>
    
    Closes apache#318 from kanzhang/SPARK-1348 and squashes the following commits:
    
    e625a5f [Kan Zhang] reverting the changes to startJettyServer()
    7a8084e [Kan Zhang] SPARK-1348 binding Master, Worker, and App Web UI to all interfaces
    kanzhang authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    a8d86b0 View commit details
    Browse the repository at this point in the history
  11. SPARK-1445: compute-classpath should not print error if lib_managed n…

    …ot found
    
    This was added to the check for the assembly jar, forgot it for the datanucleus jars.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes apache#361 from aarondav/cc and squashes the following commits:
    
    8facc16 [Aaron Davidson] SPARK-1445: compute-classpath should not print error if lib_managed not found
    aarondav authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    e25b593 View commit details
    Browse the repository at this point in the history
  12. [SPARK-1397] Notify SparkListeners when stages fail or are cancelled.

    [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, apache#305.  You can look at just the second commit: kayousterhout@93f08ba to see just the changes relevant to this PR]
    
    Previously, when stages fail or get cancelled, the SparkListener is only notified
    indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
    stage that failed.  This worked before job cancellation, because jobs would only fail
    due to a single stage failure.  However, with job cancellation, multiple running stages
    can fail when a job gets cancelled.  Right now, this is not handled correctly, which
    results in stages that get stuck in the “Running Stages” window in the UI even
    though they’re dead.
    
    This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
    event, and uses this event to tell SparkListeners when stages fail in addition to when
    they complete successfully.  This change is NOT publicly backward compatible for two
    reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
    SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
    this is less consistent with the listener events for tasks / jobs ending, and will result in some
    code duplication for listeners (because failed and completed stages are handled in similar
    ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
    because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).
    
    It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
    method to no longer include a stage that caused the failure.  I think this change should definitely
    stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
    single stage.
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes apache#309 from kayousterhout/stage_cancellation and squashes the following commits:
    
    5533ecd [Kay Ousterhout] Fixes in response to Mark's review
    320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.
    kayousterhout authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    fac6085 View commit details
    Browse the repository at this point in the history
  13. SPARK-1433: Upgrade Mesos dependency to 0.17.0

    Mesos 0.13.0 was released 6 months ago.
    Upgrade Mesos dependency to 0.17.0
    
    Author: Sandeep <[email protected]>
    
    Closes apache#355 from techaddict/mesos_update and squashes the following commits:
    
    f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
    techaddict authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    12c077d View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2014

  1. Spark 1271: Co-Group and Group-By should pass Iterable[X]

    Author: Holden Karau <[email protected]>
    
    Closes apache#242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
    
    f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
    77048f8 [Holden Karau] Fix merge up to master
    d3fe909 [Holden Karau] use toSeq instead
    7a092a3 [Holden Karau] switch resultitr to resultiterable
    eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
    c5075aa [Holden Karau] If guava 14 had iterables
    2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
    11e730c [Holden Karau] Fix streaming tests
    66b583d [Holden Karau] Fix the core test suite to compile
    4ed579b [Holden Karau] Refactor from iterator to iterable
    d052c07 [Holden Karau] Python tests now pass with iterator pandas
    3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
    cd1e81c [Holden Karau] Try and make pickling list iterators work
    c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
    88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
    a5ee714 [Holden Karau] oops, was checking wrong iterator
    e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
    ec8cc3e [Holden Karau] Fix test issues\!
    4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
    fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
    ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
    b692868 [Holden Karau] Revert
    7e533f7 [Holden Karau] Fix the bug
    8a5153a [Holden Karau] Revert me, but we have some stuff to debug
    b4e86a9 [Holden Karau] Add a join based on the problem in SVD
    c4510e2 [Holden Karau] Revert this but for now put things in list pandas
    b4e0b1d [Holden Karau] Fix style issues
    71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
    b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
    37888ec [Holden Karau] core/tests now pass
    249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
    6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
    fe992fe [Holden Karau] hmmm try and fix up basic operation suite
    172705c [Holden Karau] Fix Java API suite
    caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
    88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
    4991af6 [Holden Karau] Fix some tests
    be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
    687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
    holdenk authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    ce8ec54 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1434] [MLLIB] change labelParser from anonymous function to trait

    This is a patch to address @mateiz 's comment in apache#245
    
    MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#345 from mengxr/label-parser and squashes the following commits:
    
    ac44409 [Xiangrui Meng] use singleton objects for label parsers
    3b1a7c6 [Xiangrui Meng] add tests for label parsers
    c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton
    11c94e0 [Xiangrui Meng] add return types
    7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
    mengxr authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    b9e0c93 View commit details
    Browse the repository at this point in the history
  3. Spark-939: allow user jars to take precedence over spark jars

    I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great.
    
    Author: Holden Karau <[email protected]>
    
    Closes apache#217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits:
    
    cf0cac9 [Holden Karau] Fix the executorclassloader
    1955232 [Holden Karau] Fix long line in TestUtils
    8f89965 [Holden Karau] Fix tests for new class name
    7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader
    644719f [Holden Karau] User the class generator for the repl class loader tests too
    f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests
    204b199 [Holden Karau] Fix the generated classes
    9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes
    858aba2 [Holden Karau] Remove a bunch of test junk
    261aaee [Holden Karau] simplify executorurlclassloader a bit
    7a7bf5f [Holden Karau] CR feedback
    d4ae848 [Holden Karau] rewrite component into scala
    aa95083 [Holden Karau] CR feedback
    7752594 [Holden Karau] re-add https comment
    a0ef85a [Holden Karau] Fix style issues
    125ea7f [Holden Karau] Easier to just remove those files, we don't need them
    bb8d179 [Holden Karau] Fix issues with the repl class loader
    241b03d [Holden Karau] fix my rat excludes
    a343350 [Holden Karau] Update rat-excludes and remove a useless file
    d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it
    4919bf9 [Holden Karau] Fix parent calling class loader issue
    8a67302 [Holden Karau] Test are good
    9e2d236 [Holden Karau] It works comrade
    691ee00 [Holden Karau] It works ish
    dc4fe44 [Holden Karau] Does not depend on being in my home directory
    47046ff [Holden Karau] Remove bad import'
    22d83cb [Holden Karau] Add a test suite for the executor url class loader suite
    7ef4628 [Holden Karau] Clean up
    792d961 [Holden Karau] Almost works
    16aecd1 [Holden Karau] Doesn't quite work
    8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options
    648b559 [Holden Karau] Both class loaders compile. Now for testing
    e1d9f71 [Holden Karau] One loader workers.
    holdenk authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    fa0524f View commit details
    Browse the repository at this point in the history
  4. [SPARK-1390] Refactoring of matrices backed by RDDs

    This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have
    
    1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
    2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.
    
    We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.
    
    The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`):
    
    1. `RDDMatrix`: trait for matrices backed by one or more RDDs
    2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]`
    3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering
    4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices
    
    The current code also introduces local matrices.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#296 from mengxr/mat and squashes the following commits:
    
    24d8294 [Xiangrui Meng] fix for groupBy returning Iterable
    bfc2b26 [Xiangrui Meng] merge master
    8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat
    0135193 [Xiangrui Meng] address Reza's comments
    03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests
    b177ff1 [Xiangrui Meng] address Matei's comments
    be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices
    b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD
    e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix
    0d1491c [Xiangrui Meng] fix test errors
    a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow
    b8b6ac3 [Xiangrui Meng] Remove old code
    4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance
    7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs
    mengxr authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    9689b66 View commit details
    Browse the repository at this point in the history
  5. SPARK-1093: Annotate developer and experimental API's

    This patch marks some existing classes as private[spark] and adds two types of API annotations:
    - `EXPERIMENTAL API` = experimental user-facing module
    - `DEVELOPER API - UNSTABLE` = developer-facing API that might change
    
    There is some discussion of the different mechanisms for doing this here:
    https://issues.apache.org/jira/browse/SPARK-1081
    
    I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.
    
    A few notes here:
    - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
    - Noted that compression and serialization formats don't have to be wire compatible across versions.
    - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
    - Metrics sources are made private - user only interacts with them through Spark's reflection
    
    Author: Patrick Wendell <[email protected]>
    Author: Andrew Or <[email protected]>
    
    Closes apache#274 from pwendell/private-apis and squashes the following commits:
    
    44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
    042c803 [Patrick Wendell] spark.annotations -> spark.annotation
    bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
    8d0c873 [Patrick Wendell] Warning in SparkEnv
    99b223a [Patrick Wendell] Cleaning up annotations
    e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
    982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
    a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
    c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
    0d48908 [Andrew Or] Comments and new lines (minor)
    f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
    99192ef [Andrew Or] Dynamically add badges based on annotations
    824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
    037755c [Patrick Wendell] Some changes after working with andrew or
    f7d124f [Patrick Wendell] Small fixes
    c318b24 [Patrick Wendell] Use CSS styles
    e4c76b9 [Patrick Wendell] Logging
    f390b13 [Patrick Wendell] Better visibility for workaround constructors
    d6b0afd [Patrick Wendell] Small chang to existing constructor
    403ba52 [Patrick Wendell] Style fix
    870a7ba [Patrick Wendell] Work around for SI-8479
    7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
    4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
    c581dce [Patrick Wendell] Changes after building against Shark.
    8452309 [Patrick Wendell] Style fixes
    1ed27d2 [Patrick Wendell] Formatting and coloring of badges
    cd7a465 [Patrick Wendell] Code review feedback
    2f706f1 [Patrick Wendell] Don't use floats
    542a736 [Patrick Wendell] Small fixes
    cf23ec6 [Patrick Wendell] Marking GraphX as alpha
    d86818e [Patrick Wendell] Another naming change
    5a76ed6 [Patrick Wendell] More visiblity clean-up
    42c1f09 [Patrick Wendell] Using better labels
    9d48cbf [Patrick Wendell] Initial pass
    pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    87bd1f9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-1357] [MLLIB] Annotate developer and experimental APIs

    Annotate developer and experimental APIs in MLlib.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#298 from mengxr/api and squashes the following commits:
    
    13390e8 [Xiangrui Meng] Merge branch 'master' into api
    dc4cbb3 [Xiangrui Meng] mark distribute matrices experimental
    6b9f8e2 [Xiangrui Meng] add Experimental annotation
    8773d0d [Xiangrui Meng] add DeveloperApi annotation
    da31733 [Xiangrui Meng] update developer and experimental tags
    555e0fe [Xiangrui Meng] Merge branch 'master' into api
    ef1a717 [Xiangrui Meng] mark some constructors private add default parameters to JavaDoc
    00ffbcc [Xiangrui Meng] update tree API annotation
    0b674fa [Xiangrui Meng] mark decision tree APIs
    86b9e34 [Xiangrui Meng] one pass over APIs of GLMs, NaiveBayes, and ALS
    f21d862 [Xiangrui Meng] Merge branch 'master' into api
    2b133d6 [Xiangrui Meng] intial annotation of developer and experimental apis
    mengxr authored and pwendell committed Apr 9, 2014
    4 Configuration menu
    Copy the full SHA
    bde9cc1 View commit details
    Browse the repository at this point in the history
  7. SPARK-1407 drain event queue before stopping event logger

    Author: Kan Zhang <[email protected]>
    
    Closes apache#366 from kanzhang/SPARK-1407 and squashes the following commits:
    
    cd0629f [Kan Zhang] code refactoring and adding test
    b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger
    kanzhang authored and pwendell committed Apr 9, 2014
    2 Configuration menu
    Copy the full SHA
    eb5f2b6 View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2014

  1. [SPARK-1357 (fix)] remove empty line after :: DeveloperApi/Experiment…

    …al ::
    
    Remove empty line after :: DeveloperApi/Experimental :: in comments to make the original doc show up in the preview of the generated html docs. Thanks @andrewor14 !
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#373 from mengxr/api and squashes the following commits:
    
    9c35bdc [Xiangrui Meng] remove the empty line after :: DeveloperApi/Experimental ::
    mengxr authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    0adc932 View commit details
    Browse the repository at this point in the history
  2. SPARK-729: Closures not always serialized at capture time

    [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured.  Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  There are a few possible approaches to solving this problem and this PR will discuss some of them.  The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing.
    
    ## What I did
    
    The basic approach I took depends on the work I did for apache#143, and so this PR is based atop that.  Specifically: apache#143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled).  Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used.
    
    Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in apache#143 is creating a serialized closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument).  This branch modifies `ClosureCleaner.clean` so that it returns a value:  if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time.  `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument.
    
    I've added tests for this behavior (777a1bc).  The pull request as it stands, given the changes in apache#143, is nearly trivial.  There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important.  I think this is a fine way to solve this problem, but it's not perfect.
    
    ## What we might want to do
    
    The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure.  If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later.  By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like:
    
    ```scala
    package org.apache.spark.util
    
    import java.nio.ByteBuffer
    import scala.reflect.ClassManifest
    
    sealed abstract class ClosureBox[T] { def func: T }
    final case class RawClosure[T](func: T) extends ClosureBox[T] {}
    final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
    final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {}
    
    object ClosureBoxImplicits {
      implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun)
    }
    ```
    
    With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`.  A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer).  Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time).  Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate.  Tracking this sort of state in the type system seems like the right thing to do to me.
    
    ### Why we might not want to do that
    
    _It's pretty invasive._  Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`.  This obscures what's going on and is not a little ugly.  Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions).
    
    _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._  It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future.  Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal.
    
    Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once.
    
    ## Other possibilities
    
    I felt like the serialize-and-deserialize approach was best due to its obvious simplicity.  But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`.  It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness.
    
    I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general.
    
    Author: William Benton <[email protected]>
    
    Closes apache#189 from willb/spark-729 and squashes the following commits:
    
    f4cafa0 [William Benton] Stylistic changes and cleanups
    b3d9c86 [William Benton] Fixed style issues in tests
    9b56ce0 [William Benton] Added array-element capture test
    97e9d91 [William Benton] Split closure-serializability failure tests
    12ef6e3 [William Benton] Skip proactive closure capture for runJob
    8ee3ee7 [William Benton] Predictable closure environment capture
    12c63a7 [William Benton] Added tests for variable capture in closures
    d6e8dd6 [William Benton] Don't check serializability of DStream transforms.
    4ecf841 [William Benton] Make proactive serializability checking optional.
    d8df3db [William Benton] Adds proactive closure-serializablilty checking
    21b4b06 [William Benton] Test cases for SPARK-897.
    d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.
    willb authored and mateiz committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    8ca3b2b View commit details
    Browse the repository at this point in the history
  3. SPARK-1446: Spark examples should not do a System.exit

    Spark examples should exit nice using SparkContext.stop() method, rather than System.exit
    System.exit can cause issues like in SPARK-1407
    
    Author: Sandeep <[email protected]>
    
    Closes apache#370 from techaddict/1446 and squashes the following commits:
    
    e9234cf [Sandeep] SPARK-1446: Spark examples should not do a System.exit Spark examples should exit nice using SparkContext.stop() method, rather than System.exit System.exit can cause issues like in SPARK-1407
    techaddict authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    e55cc4b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e6d4a74 View commit details
    Browse the repository at this point in the history
  5. Fix SPARK-1413: Parquet messes up stdout and stdin when used in Spark…

    … REPL
    
    Author: witgo <[email protected]>
    
    Closes apache#325 from witgo/SPARK-1413 and squashes the following commits:
    
    e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler  methods
    45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    0d5f819 [witgo] review commit
    45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    fa69dcf [witgo] Merge branch 'master' into SPARK-1413
    3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
    38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    ba09bcd [witgo] remove set the parquet log level
    a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    3feb635 [witgo] parquet logger use parent handler
    fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    8bb6ffd [witgo] enableLogForwarding note fix
    edd9630 [witgo]  move to
    f447f50 [witgo] merging master
    5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    76670c1 [witgo] review commit
    70f3c64 [witgo] Fix SPARK-1413
    witgo authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    a74fbbb View commit details
    Browse the repository at this point in the history
  6. [SPARK-1276] Add a HistoryServer to render persisted UI

    The new feature of event logging, introduced in apache#42, allows the user to persist the details of his/her Spark application to storage, and later replay these events to reconstruct an after-the-fact SparkUI.
    Currently, however, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of this new feature as many people also run Spark on Yarn / Mesos.
    
    This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications and performs no scheduling.
    
    To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080.
    
    Comments and feedback are most welcome.
    
    ---
    
    A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we have added more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in apache#42.
    
    A potential TODO in the future (not part of this PR) is to render live applications in addition to just completed applications. This is useful when applications fail, a condition that our current HistoryServer does not handle unless the user manually signals application completion (by creating the APPLICATION_COMPLETION file). Handling live applications becomes significantly more challenging, however, because it is now necessary to render the same SparkUI multiple times. To avoid reading the entire log every time, which is inefficient, we must handle reading the log from where we previously left off, but this becomes fairly complicated because we must deal with the arbitrary behavior of each input stream.
    
    Author: Andrew Or <[email protected]>
    
    Closes apache#204 from andrewor14/master and squashes the following commits:
    
    7b7234c [Andrew Or] Finished -> Completed
    b158d98 [Andrew Or] Address Patrick's comments
    69d1b41 [Andrew Or] Do not block on posting SparkListenerApplicationEnd
    19d5dd0 [Andrew Or] Merge github.com:apache/spark
    f7f5bf0 [Andrew Or] Make history server's web UI port a Spark configuration
    2dfb494 [Andrew Or] Decouple checking for application completion from replaying
    d02dbaa [Andrew Or] Expose Spark version and include it in event logs
    2282300 [Andrew Or] Add documentation for the HistoryServer
    567474a [Andrew Or] Merge github.com:apache/spark
    6edf052 [Andrew Or] Merge github.com:apache/spark
    19e1fb4 [Andrew Or] Address Thomas' comments
    248cb3d [Andrew Or] Limit number of live applications + add configurability
    a3598de [Andrew Or] Do not close file system with ReplayBus + fix bind address
    bc46fc8 [Andrew Or] Merge github.com:apache/spark
    e2f4ff9 [Andrew Or] Merge github.com:apache/spark
    050419e [Andrew Or] Merge github.com:apache/spark
    81b568b [Andrew Or] Fix strange error messages...
    0670743 [Andrew Or] Decouple page rendering from loading files from disk
    1b2f391 [Andrew Or] Minor changes
    a9eae7e [Andrew Or] Merge branch 'master' of github.com:apache/spark
    d5154da [Andrew Or] Styling and comments
    5dbfbb4 [Andrew Or] Merge branch 'master' of github.com:apache/spark
    60bc6d5 [Andrew Or] First complete implementation of HistoryServer (only for finished apps)
    7584418 [Andrew Or] Report application start/end times to HistoryServer
    8aac163 [Andrew Or] Add basic application table
    c086bd5 [Andrew Or] Add HistoryServer and scripts ++ Refactor WebUI interface
    andrewor14 authored and pwendell committed Apr 10, 2014
    2 Configuration menu
    Copy the full SHA
    79820fe View commit details
    Browse the repository at this point in the history
  7. SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 …

    …instead of complaining
    
    Author: Sandeep <[email protected]>
    
    Closes apache#356 from techaddict/1428 and squashes the following commits:
    
    3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
    techaddict authored and mateiz committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    3bd3129 View commit details
    Browse the repository at this point in the history
  8. Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0"

    This reverts commit 12c077d.
    pwendell committed Apr 10, 2014
    1 Configuration menu
    Copy the full SHA
    7b52b66 View commit details
    Browse the repository at this point in the history
  9. Update tuning.md

    http://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
    
    Author: Andrew Ash <[email protected]>
    
    Closes apache#384 from ash211/patch-2 and squashes the following commits:
    
    da1b0be [Andrew Ash] Update tuning.md
    ash211 authored and rxin committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    f046662 View commit details
    Browse the repository at this point in the history
  10. Remove Unnecessary Whitespace's

    stack these together in a commit else they show up chunk by chunk in different commits.
    
    Author: Sandeep <[email protected]>
    
    Closes apache#380 from techaddict/white_space and squashes the following commits:
    
    b58f294 [Sandeep] Remove Unnecessary Whitespace's
    techaddict authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    930b70f View commit details
    Browse the repository at this point in the history
  11. [SQL] Improve column pruning in the optimizer.

    Author: Michael Armbrust <[email protected]>
    
    Closes apache#378 from marmbrus/columnPruning and squashes the following commits:
    
    779da56 [Michael Armbrust] More consistent naming.
    1a4e9ea [Michael Armbrust] More comments.
    2f4e7b9 [Michael Armbrust] Improve column pruning in the optimizer.
    marmbrus authored and rxin committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    f99401a View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2014

  1. SPARK-1202 - Add a "cancel" button in the UI for stages

    Author: Sundeep Narravula <[email protected]>
    Author: Sundeep Narravula <[email protected]>
    
    Closes apache#246 from sundeepn/uikilljob and squashes the following commits:
    
    5fdd0e2 [Sundeep Narravula] Fix test string
    f6fdff1 [Sundeep Narravula] Format fix; reduced line size to less than 100 chars
    d1daeb9 [Sundeep Narravula] Incorporating review comments.
    8d97923 [Sundeep Narravula] Ability to kill jobs thru the UI. This behavior can be turned on be settings the following variable: spark.ui.killEnabled=true (default=false) Adding DAGScheduler event StageCancelled and corresponding handlers. Added cancellation reason to handlers.
    Sundeep Narravula authored and pwendell committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    2c55783 View commit details
    Browse the repository at this point in the history
  2. Set spark.executor.uri from environment variable (needed by Mesos)

    The Mesos backend uses this property when setting up a slave process.  It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark.
    
    Author: Ivan Wick <[email protected]>
    
    This patch had conflicts when merged, resolved by
    Committer: Matei Zaharia <[email protected]>
    
    Closes apache#311 from ivanwick/master and squashes the following commits:
    
    da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos)
    ivanwick authored and mateiz committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    5cd11d5 View commit details
    Browse the repository at this point in the history
  3. Add Spark v0.9.1 to ec2 launch script and use it as the default

    Mainly ported from branch-0.9.
    
    Author: Harvey Feng <[email protected]>
    
    Closes apache#385 from harveyfeng/0.9.1-ec2 and squashes the following commits:
    
    769ac2f [Harvey Feng] Add Spark v0.9.1 to ec2 launch script and use it as the default
    harveyfeng authored and pwendell committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    7b4203a View commit details
    Browse the repository at this point in the history
  4. SPARK-1202: Improvements to task killing in the UI.

    1. Adds a separate endpoint for the killing logic that is outside of a page.
    2. Narrows the scope of the killingEnabled tracking.
    3. Some style improvements.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#386 from pwendell/kill-link and squashes the following commits:
    
    8efe02b [Patrick Wendell] Improvements to task killing in the UI.
    pwendell committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    44f654e View commit details
    Browse the repository at this point in the history
  5. SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken

    Author: Thomas Graves <[email protected]>
    
    Closes apache#344 from tgravescs/SPARK-1417 and squashes the following commits:
    
    c450b5f [Thomas Graves] fix test
    e1c1d7e [Thomas Graves] add missing $ to appUIAddress
    e982ddb [Thomas Graves] use appUIHostPort in appUIAddress
    0803ec2 [Thomas Graves] Review comment updates - remove extra newline, simplify assert in test
    658a8ec [Thomas Graves] Add a appUIHostPort routine
    0614208 [Thomas Graves] Fix test
    2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken
    tgravescs authored and Mridul Muralidharan committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    446bb34 View commit details
    Browse the repository at this point in the history
  6. Some clean up in build/docs

    (a) Deleted an outdated line from the docs
    (b) Removed a work around that is no longer necessary given the mesos version bump.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#382 from pwendell/maven-clean and squashes the following commits:
    
    f0447fa [Patrick Wendell] Minor doc clean-up
    pwendell committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    98225a6 View commit details
    Browse the repository at this point in the history
  7. [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificatio…

    …nMetrics
    
    This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from apache#136 . It also contains refactoring of apache#160 for binary classification evaluation.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#364 from mengxr/auc and squashes the following commits:
    
    a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
    3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
    fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
    b1b7dab [Xiangrui Meng] fix code styles
    9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
    ca31da5 [Xiangrui Meng] remove PredictionAndResponse
    3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
    8f78958 [Xiangrui Meng] add PredictionAndResponse
    dda82d5 [Xiangrui Meng] add confusion matrix
    aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
    221ebce [Xiangrui Meng] add a new test to sliding
    a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
    a9b250a [Xiangrui Meng] move sliding to mllib
    cab9a52 [Xiangrui Meng] use last for the last element
    db6cb30 [Xiangrui Meng] remove unnecessary toSeq
    9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
    284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
    c1c6c22 [Xiangrui Meng] add AreaUnderCurve
    65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
    5ee6001 [Xiangrui Meng] add TODO
    d2a600d [Xiangrui Meng] add sliding to rdd
    mengxr authored and mateiz committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    f5ace8d View commit details
    Browse the repository at this point in the history
  8. HOTFIX: Ignore python metastore files in RAT checks.

    This was causing some errors with pull request tests.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#393 from pwendell/hotfix and squashes the following commits:
    
    6201dd3 [Patrick Wendell] HOTFIX: Ignore python metastore files in RAT checks.
    pwendell committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    6a0f8e3 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2014

  1. [FIX] make coalesce test deterministic in RDDSuite

    Make coalesce test deterministic by setting pre-defined seeds. (Saw random failures in other PRs.)
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#387 from mengxr/fix-random and squashes the following commits:
    
    59bc16f [Xiangrui Meng] make coalesce test deterministic in RDDSuite
    mengxr authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    7038b00 View commit details
    Browse the repository at this point in the history
  2. [WIP] [SPARK-1328] Add vector statistics

    As with the new vector system in MLlib, we find that it is good to add some new APIs to precess the `RDD[Vector]`. Beside, the former implementation of `computeStat` is not stable which could loss precision, and has the possibility to cause `Nan` in scientific computing, just as said in the [SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).
    
    APIs contain:
    
    * rowMeans(): RDD[Double]
    * rowNorm2(): RDD[Double]
    * rowSDs(): RDD[Double]
    * colMeans(): Vector
    * colMeans(size: Int): Vector
    * colNorm2(): Vector
    * colNorm2(size: Int): Vector
    * colSDs(): Vector
    * colSDs(size: Int): Vector
    * maxOption((Vector, Vector) => Boolean): Option[Vector]
    * minOption((Vector, Vector) => Boolean): Option[Vector]
    * rowShrink(): RDD[Vector]
    * colShrink(): RDD[Vector]
    
    This is working in process now, and some more APIs will add to `LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to `MLContext` later.
    
    Author: Xusen Yin <[email protected]>
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#268 from yinxusen/vector-statistics and squashes the following commits:
    
    d61363f [Xusen Yin] rebase to latest master
    16ae684 [Xusen Yin] fix minor error and remove useless method
    10cf5d3 [Xusen Yin] refine some return type
    b064714 [Xusen Yin] remove computeStat in MLUtils
    cbbefdb [Xiangrui Meng] update multivariate statistical summary interface and clean tests
    4eaf28a [Xusen Yin] merge VectorRDDStatistics into RowMatrix
    48ee053 [Xusen Yin] fix minor error
    e624f93 [Xusen Yin] fix scala style error
    1fba230 [Xusen Yin] merge while loop together
    69e1f37 [Xusen Yin] remove lazy eval, and minor memory footprint
    548e9de [Xusen Yin] minor revision
    86522c4 [Xusen Yin] add comments on functions
    dc77e38 [Xusen Yin] test sparse vector RDD
    18cf072 [Xusen Yin] change def to lazy val to make sure that the computations in function be evaluated only once
    f7a3ca2 [Xusen Yin] fix the corner case of maxmin
    967d041 [Xusen Yin] full revision with Aggregator class
    138300c [Xusen Yin] add new Aggregator class
    1376ff4 [Xusen Yin] rename variables and adjust code
    4a5c38d [Xusen Yin] add scala doc, refine code and comments
    036b7a5 [Xusen Yin] fix the bug of Nan occur
    f6e8e9a [Xusen Yin] add sparse vectors test
    4cfbadf [Xusen Yin] fix bug of min max
    4e4fbd1 [Xusen Yin] separate seqop and combop out as independent functions
    a6d5a2e [Xusen Yin] rewrite for only computing non-zero elements
    3980287 [Xusen Yin] rename variables
    62a2c3e [Xusen Yin] use axpy and in-place if possible
    9a75ebd [Xusen Yin] add case class to wrap return values
    d816ac7 [Xusen Yin] remove useless APIs
    c4651bb [Xusen Yin] remove row-wise APIs and refine code
    1338ea1 [Xusen Yin] all-in-one version test passed
    cc65810 [Xusen Yin] add parallel mean and variance
    9af2e95 [Xusen Yin] refine the code style
    ad6c82d [Xusen Yin] add shrink test
    e09d5d2 [Xusen Yin] add scala docs and refine shrink method
    8ef3377 [Xusen Yin] pass all tests
    28cf060 [Xusen Yin] fix error of column means
    54b19ab [Xusen Yin] add new API to shrink RDD[Vector]
    8c6c0e1 [Xusen Yin] add basic statistics
    yinxusen authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    fdfb45e View commit details
    Browse the repository at this point in the history
  3. Update WindowedDStream.scala

    update the content of Exception when windowDuration is not multiple of parent.slideDuration
    
    Author: baishuo(白硕) <[email protected]>
    
    Closes apache#390 from baishuo/windowdstream and squashes the following commits:
    
    533c968 [baishuo(白硕)] Update WindowedDStream.scala
    baishuo authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    aa8bb11 View commit details
    Browse the repository at this point in the history
  4. SPARK-1057 (alternative) Remove fastutil

    (This is for discussion at this point -- I'm not suggesting this should be committed.)
    
    This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.
    
    Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.
    
    The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.
    
    `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:
    
    - It is no longer private
    - The key must be a `ClassTag`
    - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`
    
    It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.
    
    But what about those last changes?
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#266 from srowen/SPARK-1057-alternate and squashes the following commits:
    
    2601129 [Sean Owen] Fix Map return type error not previously caught
    ec65502 [Sean Owen] Updates from matei's review
    00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
    srowen authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    165e06a View commit details
    Browse the repository at this point in the history
  5. [SPARK-1386] Web UI for Spark Streaming

    When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers?
    
    While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. This PR introduces such a UI. It shows various statistics related to the streaming application. Here is a screenshot of the UI running on my local machine.
    
    http://i.imgur.com/1ooDGhm.png
    
    This UI is integrated into the Spark UI running at 4040.
    
    Author: Tathagata Das <[email protected]>
    Author: Andrew Or <[email protected]>
    
    Closes apache#290 from tdas/streaming-web-ui and squashes the following commits:
    
    fc73ca5 [Tathagata Das] Merge pull request apache#9 from andrewor14/ui-refactor
    642dd88 [Andrew Or] Merge SparkUISuite.scala into UISuite.scala
    eb30517 [Andrew Or] Merge github.com:apache/spark into ui-refactor
    f4f4cbe [Tathagata Das] More minor fixes.
    34bb364 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
    252c566 [Tathagata Das] Merge pull request apache#8 from andrewor14/ui-refactor
    e038b4b [Tathagata Das] Addressed Patrick's comments.
    125a054 [Andrew Or] Disable serving static resources with gzip
    90feb8d [Andrew Or] Address Patrick's comments
    89dae36 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
    72fe256 [Tathagata Das] Merge pull request apache#6 from andrewor14/ui-refactor
    2fc09c8 [Tathagata Das] Added binary check exclusions
    aa396d4 [Andrew Or] Rename tabs and pages (No more IndexPage.scala)
    f8e1053 [Tathagata Das] Added Spark and Streaming UI unit tests.
    caa5e05 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
    585cd65 [Tathagata Das] Merge pull request apache#5 from andrewor14/ui-refactor
    914b8ff [Tathagata Das] Moved utils functions to UIUtils.
    548c98c [Andrew Or] Wide refactoring of WebUI, UITab, and UIPage (see commit message)
    6de06b0 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
    ee6543f [Tathagata Das] Minor changes based on Andrew's comments.
    fa760fe [Tathagata Das] Fixed long line.
    1c0bcef [Tathagata Das] Refactored streaming UI into two files.
    1af239b [Tathagata Das] Changed streaming UI to attach itself as a tab with the Spark UI.
    827e81a [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
    168fe86 [Tathagata Das] Merge pull request #2 from andrewor14/ui-refactor
    3e986f8 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
    c78c92d [Andrew Or] Remove outdated comment
    8f7323b [Andrew Or] End of file new lines, indentation, and imports (minor)
    0d61ee8 [Andrew Or] Merge branch 'streaming-web-ui' of github.com:tdas/spark into ui-refactor
    9a48fa1 [Andrew Or] Allow adding tabs to SparkUI dynamically + add example
    61358e3 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-web-ui
    53be2c5 [Tathagata Das] Minor style updates.
    ed25dfc [Andrew Or] Generalize SparkUI header to display tabs dynamically
    a37ad4f [Andrew Or] Comments, imports and formatting (minor)
    cd000b0 [Andrew Or] Merge github.com:apache/spark into ui-refactor
    7d57444 [Andrew Or] Refactoring the UI interface to add flexibility
    aef4dd5 [Tathagata Das] Added Apache licenses.
    db27bad [Tathagata Das] Added last batch processing time to StreamingUI.
    4d86e98 [Tathagata Das] Added basic stats to the StreamingUI and refactored the UI to a Page to make it easier to transition to using SparkUI later.
    93f1c69 [Tathagata Das] Added network receiver information to the Streaming UI.
    56cc7fb [Tathagata Das] First cut implementation of Streaming UI.
    tdas authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    6aa08c3 View commit details
    Browse the repository at this point in the history
  6. [Fix apache#204] Update out-dated comments

    This PR is self-explanatory.
    
    Author: Andrew Or <[email protected]>
    
    Closes apache#381 from andrewor14/master and squashes the following commits:
    
    3e8dde2 [Andrew Or] Fix comments for apache#204
    andrewor14 authored and pwendell committed Apr 12, 2014
    Configuration menu
    Copy the full SHA
    c2d160f View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2014

  1. [SPARK-1403] Move the class loader creation back to where it was in 0…

    ….9.0
    
    [SPARK-1403] I investigated why spark 0.9.0 loads fine on mesos while spark 1.0.0 fails. What I found was that in SparkEnv.scala, while creating the SparkEnv object, the current thread's classloader is null. But in 0.9.0, at the same place, it is set to org.apache.spark.repl.ExecutorClassLoader . I saw that apache@7edbea4 moved it to it current place. I moved it back and saw that 1.0.0 started working fine on mesos.
    
    I just created a minimal patch that allows me to run spark on mesos correctly. It seems like SecurityManager's creation needs to be taken into account for a correct fix. Also moving the creation of the serializer out of SparkEnv might be a part of the right solution. PTAL.
    
    Author: Bharath Bhushan <[email protected]>
    
    Closes apache#322 from manku-timma/spark-1403 and squashes the following commits:
    
    606c2b9 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    ec8f870 [Bharath Bhushan] revert the logger change for java 6 compatibility as PR 334 is doing it
    728beca [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    044027d [Bharath Bhushan] fix compile error
    6f260a4 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    b3a053f [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    04b9662 [Bharath Bhushan] add missing line
    4803c19 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    f3c9a14 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
    42d3d6a [Bharath Bhushan] used code fragment from @ueshin to fix the problem in a better way
    89109d7 [Bharath Bhushan] move the class loader creation back to where it was in 0.9.0
    Bharath Bhushan authored and pwendell committed Apr 13, 2014
    Configuration menu
    Copy the full SHA
    ca11919 View commit details
    Browse the repository at this point in the history
  2. SPARK-1480: Clean up use of classloaders

    The Spark codebase is a bit fast-and-loose when accessing classloaders and this has caused a few bugs to surface in master.
    
    This patch defines some utility methods for accessing classloaders. This makes the intention when accessing a classloader much more explicit in the code and fixes a few cases where the wrong one was chosen.
    
    case (a) -> We want the classloader that loaded Spark
    case (b) -> We want the context class loader, or if not present, we want (a)
    
    This patch provides a better fix for SPARK-1403 (https://issues.apache.org/jira/browse/SPARK-1403) than the current work around, which it reverts. It also fixes a previously unreported bug that the `./spark-submit` script did not work for running with `local` master. It didn't work because the executor classloader did not properly delegate to the context class loader (if it is defined) and in local mode the context class loader is set by the `./spark-submit` script. A unit test is added for that case.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#398 from pwendell/class-loaders and squashes the following commits:
    
    b4a1a58 [Patrick Wendell] Minor clean up
    14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders
    pwendell committed Apr 13, 2014
    Configuration menu
    Copy the full SHA
    4bc07ee View commit details
    Browse the repository at this point in the history
  3. [SPARK-1415] Hadoop min split for wholeTextFiles()

    JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-1415).
    
    New Hadoop API of `InputFormat` does not provide the `minSplits` parameter, which makes the API incompatible between `HadoopRDD` and `NewHadoopRDD`. The PR is for constructing compatible APIs.
    
    Though `minSplits` is deprecated by New Hadoop API, we think it is better to make APIs compatible here.
    
    **Note** that `minSplits` in `wholeTextFiles` could only be treated as a *suggestion*, the real number of splits may not be greater than `minSplits` due to `isSplitable()=false`.
    
    Author: Xusen Yin <[email protected]>
    
    Closes apache#376 from yinxusen/hadoop-min-split and squashes the following commits:
    
    76417f6 [Xusen Yin] refine comments
    c10af60 [Xusen Yin] refine comments and rewrite new class for wholeTextFile
    766d05b [Xusen Yin] refine Java API and comments
    4875755 [Xusen Yin] add minSplits for WholeTextFiles
    yinxusen authored and mateiz committed Apr 13, 2014
    Configuration menu
    Copy the full SHA
    037fe4d View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2014

  1. [BUGFIX] In-memory columnar storage bug fixes

    Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass.
    
    @rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think?
    
    **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed.
    
    Author: Cheng Lian <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#374 from liancheng/inMemBugFix and squashes the following commits:
    
    6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite
    5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
    882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
    32cc9ce [Cheng Lian] Code style cleanup
    99382bf [Cheng Lian] Enable compression by default
    4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
    d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway?
    ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax.
    1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows.
    e36cdd0 [Michael Armbrust] Spelling.
    2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
    6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan
    c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
    9c8fc40 [Cheng Lian] Disable compression by default
    e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize
    8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument
    036cd09 [Cheng Lian] Clean up unused imports
    44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account
    052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values
    95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
    liancheng authored and pwendell committed Apr 14, 2014
    Configuration menu
    Copy the full SHA
    7dbca68 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    268b535 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2014

  1. SPARK-1488. Resolve scalac feature warnings during build

    For your consideration: scalac currently notes a number of feature warnings during compilation:
    
    ```
    [warn] there were 65 feature warning(s); re-run with -feature for details
    ```
    
    Warnings are like:
    
    ```
    [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
    [warn] by making the implicit value scala.language.implicitConversions visible.
    [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
    [warn] or by setting the compiler option -language:implicitConversions.
    [warn] See the Scala docs for value scala.language.implicitConversions for a discussion
    [warn] why the feature should be explicitly enabled.
    [warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
    [warn]                ^
    ```
    
    scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.
    
    This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#404 from srowen/SPARK-1488 and squashes the following commits:
    
    8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
    39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
    srowen authored and pwendell committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    0247b5c View commit details
    Browse the repository at this point in the history
  2. SPARK-1374: PySpark API for SparkSQL

    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
    
    Author: Ahir Reddy <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#363 from ahirreddy/pysql and squashes the following commits:
    
    0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
    307d6e0 [Ahir Reddy] Style fix
    6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
    3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
    29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
    f2312c7 [Ahir Reddy] Moved everything into sql.py
    a19afe4 [Ahir Reddy] Doc fixes
    6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
    521ff6d [Ahir Reddy] Trying to get spark to build with hive
    ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
    ded03e7 [Ahir Reddy] Added doc test for HiveContext
    22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
    e4da06c [Ahir Reddy] Display message if hive is not built into spark
    227a0be [Michael Armbrust] Update API links. Fix Hive example.
    58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
    4285340 [Michael Armbrust] Fix building of Hive API Docs.
    38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
    337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
    40491c9 [Ahir Reddy] PR Changes + Method Visibility
    1836944 [Michael Armbrust] Fix comments.
    e00980f [Michael Armbrust] First draft of python sql programming guide.
    b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
    f98a422 [Ahir Reddy] HiveContexts
    79621cf [Ahir Reddy] cleaning up cruft
    b406ba0 [Ahir Reddy] doctest formatting
    20936a5 [Ahir Reddy] Added tests and documentation
    e4d21b4 [Ahir Reddy] Added pyrolite dependency
    79f739d [Ahir Reddy] added more tests
    7515ba0 [Ahir Reddy] added more tests :)
    d26ec5e [Ahir Reddy] added test
    e9f5b8d [Ahir Reddy] adding tests
    906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
    251f99d [Ahir Reddy] for now only allow dictionaries as input
    09b9980 [Ahir Reddy] made jrdd explicitly lazy
    c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
    725c91e [Ahir Reddy] awesome row objects
    55d1c76 [Ahir Reddy] return row objects
    4fe1319 [Ahir Reddy] output dictionaries correctly
    be079de [Ahir Reddy] returning dictionaries works
    cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
    e948bd9 [Ahir Reddy] yippie
    4886052 [Ahir Reddy] even better
    c0fb1c6 [Ahir Reddy] more working
    043ca85 [Ahir Reddy] working
    5496f9f [Ahir Reddy] doesn't crash
    b8b904b [Ahir Reddy] Added schema rdd class
    67ba875 [Ahir Reddy] java to python, and python to java
    bcc0f23 [Ahir Reddy] Java to python
    ab6025d [Ahir Reddy] compiling
    ahirreddy authored and pwendell committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    c99bcb7 View commit details
    Browse the repository at this point in the history
  3. SPARK-1426: Make MLlib work with NumPy versions older than 1.7

    Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array.
    Replace it with a fallback
    
    Author: Sandeep <[email protected]>
    
    Closes apache#391 from techaddict/1426 and squashes the following commits:
    
    d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older than 1.7 Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array. Replace it with a fallback
    techaddict authored and mateiz committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    df36091 View commit details
    Browse the repository at this point in the history
  4. SPARK-1501: Ensure assertions in Graph.apply are asserted.

    The Graph.apply test in GraphSuite had some assertions in a closure in
    a graph transformation. As a consequence, these assertions never
    actually executed.  Furthermore, these closures had a reference to
    (non-serializable) test harness classes because they called assert(),
    which could be a problem if we proactively check closure serializability
    in the future.
    
    This commit simply changes the Graph.apply test to collect the graph
    triplets so it can assert about each triplet from a map method.
    
    Author: William Benton <[email protected]>
    
    Closes apache#415 from willb/graphsuite-nop-fix and squashes the following commits:
    
    0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.
    willb authored and rxin committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    2580a3b View commit details
    Browse the repository at this point in the history
  5. [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.

    This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !
    
    When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.
    
    Let's review how updater works when returning newWeights given the input parameters.
    
    w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that regGradient is function of w!
    If we set gradient = 0, thisIterStepSize = 1, then
    regGradient(w) = w - w'
    
    As a result, for regVal, it can be computed by
    
        val regVal = updater.compute(
          weights,
          new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
    and for regGradient, it can be obtained by
    
          val regGradient = weights.sub(
            updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)
    
    The PR includes the tests which compare the result with SGD with/without regularization.
    
    We did a comparison between LBFGS and SGD, and often we saw 10x less
    steps in LBFGS while the cost of per step is the same (just computing
    the gradient).
    
    The following is the paper by Prof. Ng at Stanford comparing different
    optimizers including LBFGS and SGD. They use them in the context of
    deep learning, but worth as reference.
    http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf
    
    Author: DB Tsai <[email protected]>
    
    Closes apache#353 from dbtsai/dbtsai-LBFGS and squashes the following commits:
    
    984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.
    DB Tsai authored and pwendell committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    6843d63 View commit details
    Browse the repository at this point in the history
  6. Decision Tree documentation for MLlib programming guide

    Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release.
    
    Apart from a general review, I need specific input on the following:
    * I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release.
    * I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get.
    
    cc: @mengxr, @hirakendu, @etrain, @atalwalkar
    
    Author: Manish Amde <[email protected]>
    
    Closes apache#402 from manishamde/tree_doc and squashes the following commits:
    
    022485a [Manish Amde] more documentation
    865826e [Manish Amde] minor: grammar
    dbb0e5e [Manish Amde] minor improvements to text
    b9ef6c4 [Manish Amde] basic decision tree code examples
    6e297d7 [Manish Amde] added subsections
    f427e84 [Manish Amde] renaming sections
    9c0c4be [Manish Amde] split candidate
    6925275 [Manish Amde] impurity and information gain
    94fd2f9 [Manish Amde] more reorg
    b93125c [Manish Amde] more subsection reorg
    3ecb2ad [Manish Amde] minor text addition
    1537dd3 [Manish Amde] added placeholders and some doc
    d06511d [Manish Amde] basic skeleton
    manishamde authored and pwendell committed Apr 15, 2014
    Configuration menu
    Copy the full SHA
    07d72fe View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2014

  1. SPARK-1455: Better isolation for unit tests.

    This is a simple first step towards avoiding running the Hive tests
    whenever possible.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#420 from pwendell/test-isolation and squashes the following commits:
    
    350c8af [Patrick Wendell] SPARK-1455: Better isolation for unit tests.
    pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    5aaf983 View commit details
    Browse the repository at this point in the history
  2. [FIX] update sbt-idea to version 1.6.0

    I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It seems upgrading `sbt-idea` to 1.6.0 fixed the problem.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#419 from mengxr/idea-plugin and squashes the following commits:
    
    fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0
    mengxr authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    8517911 View commit details
    Browse the repository at this point in the history
  3. [WIP] SPARK-1430: Support sparse data in Python MLlib

    This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
    
    On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
    
    Some to-do items left:
    - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
    - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
    - [x] Explain how to use these in the Python MLlib docs.
    
    CC @mengxr, @JoshRosen
    
    Author: Matei Zaharia <[email protected]>
    
    Closes apache#341 from mateiz/py-ml-update and squashes the following commits:
    
    d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
    ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
    b9f97a3 [Matei Zaharia] Fix test
    1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
    88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
    37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
    da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
    c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
    a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
    74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
    889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
    ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
    a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
    0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
    eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
    2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
    154f45d [Matei Zaharia] Update docs, name some magic values
    881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
    mateiz authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    63ca581 View commit details
    Browse the repository at this point in the history
  4. [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs

    This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#354 from marmbrus/insertIntoTable and squashes the following commits:
    
    6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
    f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
    765c506 [Michael Armbrust] Add to JavaAPI.
    77b512c [Michael Armbrust] typos.
    5c3ef95 [Michael Armbrust] use names for boolean args.
    882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
    d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
    fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
    marmbrus authored and rxin committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    273c2fd View commit details
    Browse the repository at this point in the history
  5. [SPARK-959] Updated SBT from 0.13.1 to 0.13.2

    JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959)
    
    SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround.
    
    Author: Cheng Lian <[email protected]>
    
    Closes apache#426 from liancheng/updateSbt and squashes the following commits:
    
    95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959
    liancheng authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    6a10d80 View commit details
    Browse the repository at this point in the history
  6. Make "spark logo" link refer to "/".

    This is not an issue with the driver UI, but when you fire
    up the history server, there's currently no way to go back to
    the app listing page without editing the browser's location
    field (since the logo's link points to the root of the
    application's own UI - i.e. the "stages" tab).
    
    The change just points the logo link to "/", which is the app
    listing for the history server, and the stages tab for the
    driver's UI.
    
    Tested with both history server and live driver.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes apache#408 from vanzin/web-ui-root and squashes the following commits:
    
    1b60cb6 [Marcelo Vanzin] Make "spark logo" link refer to "/".
    Marcelo Vanzin authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    c0273d8 View commit details
    Browse the repository at this point in the history
  7. Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME

    When running Hive tests, the working directory is `$SPARK_HOME/sql/hive`, while when running `sbt hive/console`, it becomes `$SPARK_HOME`, and test tables are not loaded if `HIVE_DEV_HOME` is not defined.
    
    Author: Cheng Lian <[email protected]>
    
    Closes apache#417 from liancheng/loadTestTables and squashes the following commits:
    
    7cea8d6 [Cheng Lian] Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME
    liancheng authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    fec462c View commit details
    Browse the repository at this point in the history
  8. update spark.default.parallelism

    actually, the value 8 is only valid in mesos fine-grained mode :
    <code>
      override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8)
    </code>
    
    while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers!
    <code>
    override def defaultParallelism(): Int = {
       conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
      }
    </code>
    
    Author: Chen Chao <[email protected]>
    
    Closes apache#389 from CrazyJvm/patch-2 and squashes the following commits:
    
    84a7fe4 [Chen Chao] miss </li> at the end of every single line
    04a9796 [Chen Chao] change format
    ee0fae0 [Chen Chao] update spark.default.parallelism
    CrazyJvm authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    9edd887 View commit details
    Browse the repository at this point in the history
  9. SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold…

    … to MLUtils & fixes bug in BernoulliSampler]
    
    Author: Holden Karau <[email protected]>
    
    Closes apache#18 from holdenk/addkfoldcrossvalidation and squashes the following commits:
    
    208db9b [Holden Karau] Fix a bad space
    e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead
    6ddbf05 [Holden Karau] swap training and validation order
    7157ae9 [Holden Karau] CR feedback
    90896c7 [Holden Karau] New line
    150889c [Holden Karau] Fix up error messages in the MLUtilsSuite
    2cb90b3 [Holden Karau] Fix the names in kFold
    c702a96 [Holden Karau] Fix imports in MLUtils
    e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala
    c5b723f [Holden Karau] clean up
    7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line
    bb5fa56 [Holden Karau] extra line sadness
    163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds
    5a33f1d [Holden Karau] Code review follow up.
    e8741a7 [Holden Karau] CR feedback
    b78804e [Holden Karau] Remove cross validation [TODO in another pull request]
    91eae64 [Holden Karau] Consolidate things in mlutils
    264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param
    dd0b737 [Holden Karau] Wrap long lines (oops)
    c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD
    08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement
    a751ec6 [Holden Karau] Add k-fold cross validation to MLLib
    holdenk authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    c3527a3 View commit details
    Browse the repository at this point in the history
  10. SPARK-1497. Fix scalastyle warnings in YARN, Hive code

    (I wasn't sure how to automatically set `SPARK_YARN=true` and `SPARK_HIVE=true` when running scalastyle, but these are the errors that turn up.)
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#413 from srowen/SPARK-1497 and squashes the following commits:
    
    f0c9318 [Sean Owen] Fix more scalastyle warnings in yarn
    80bf4c3 [Sean Owen] Add YARN alpha / YARN profile to scalastyle check
    026319c [Sean Owen] Fix scalastyle warnings in YARN, Hive code
    srowen authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    77f8367 View commit details
    Browse the repository at this point in the history
  11. Minor addition to SPARK-1497

    pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    82349fb View commit details
    Browse the repository at this point in the history
  12. SPARK-1469: Scheduler mode should accept lower-case definitions and h…

    …ave...
    
    ... nicer error messages
    
    There are  two improvements to Scheduler Mode:
    1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO).
    2. If an invalid mode is given we should print a better error message.
    
    Author: Sandeep <[email protected]>
    
    Closes apache#388 from techaddict/1469 and squashes the following commits:
    
    a31bbd5 [Sandeep] SPARK-1469: Scheduler mode should accept lower-case definitions and have nicer error messages There are  two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message.
    techaddict authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    e269c24 View commit details
    Browse the repository at this point in the history
  13. SPARK-1465: Spark compilation is broken with the latest hadoop-2.4.0 …

    …release
    
    YARN-1824 changes the APIs (addToEnvironment, setEnvFromInputString) in Apps, which causes the spark build to break if built against a version 2.4.0. To fix this, create the spark own function to do that functionality which will not break compiling against 2.3 and other 2.x versions.
    
    Author: xuan <[email protected]>
    Author: xuan <[email protected]>
    
    Closes apache#396 from xgong/master and squashes the following commits:
    
    42b5984 [xuan] Remove two extra imports
    bc0926f [xuan] Remove usage of org.apache.hadoop.util.Shell
    be89fa7 [xuan] fix Spark compilation is broken with the latest hadoop-2.4.0 release
    xuan authored and tgravescs committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    725925c View commit details
    Browse the repository at this point in the history
  14. [SPARK-1511] use Files.move instead of renameTo in TestUtils.scala

    JIRA issue:[SPARK-1511](https://issues.apache.org/jira/browse/SPARK-1511)
    
    TestUtils.createCompiledClass method use renameTo() to move files which fails when the src and dest files are in different disks or partitions. This pr uses Files.move() instead. The move method will try to use renameTo() and then fall back to copy() and delete(). I think this should handle this issue.
    
    I didn't found a test suite for this file, so I add file existence detection after file moving.
    
    Author: Ye Xianjin <[email protected]>
    
    Closes apache#427 from advancedxy/SPARK-1511 and squashes the following commits:
    
    a2b97c7 [Ye Xianjin] Based on @srowen's comment, assert file existence.
    6f95550 [Ye Xianjin] use Files.move instead of renameTo to handle the src and dest files are in different disks or partitions.
    advancedxy authored and pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    10b1c59 View commit details
    Browse the repository at this point in the history
  15. Add clean to build

    pwendell committed Apr 16, 2014
    Configuration menu
    Copy the full SHA
    987760e View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2014

  1. Rebuild routing table after Graph.reverse

    GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning.
    
    However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched.
    
    A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view.
    
    Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html).
    
    Author: Ankur Dave <[email protected]>
    
    Closes apache#431 from ankurdave/fix-reverse-bug and squashes the following commits:
    
    75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse
    ankurdave authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    235a47c View commit details
    Browse the repository at this point in the history
  2. SPARK-1329: Create pid2vid with correct number of partitions

    Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition.
    
    GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case.
    
    Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
    
    Author: Ankur Dave <[email protected]>
    
    Closes apache#368 from ankurdave/fix-pid2vid-size and squashes the following commits:
    
    5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
    ankurdave authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    17d3234 View commit details
    Browse the repository at this point in the history
  3. remove unnecessary brace and semicolon in 'putBlockInfo.synchronize' …

    …block
    
    delete semicolon
    
    Author: Chen Chao <[email protected]>
    
    Closes apache#411 from CrazyJvm/patch-5 and squashes the following commits:
    
    72333a3 [Chen Chao] remove unnecessary brace
    de5d9a7 [Chen Chao] style fix
    CrazyJvm authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    016a877 View commit details
    Browse the repository at this point in the history
  4. Fixing a race condition in event listener unit test

    Author: Kan Zhang <[email protected]>
    
    Closes apache#401 from kanzhang/fix-1475 and squashes the following commits:
    
    c6058bd [Kan Zhang] Fixing a race condition in event listener unit test
    kanzhang authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    38877cc View commit details
    Browse the repository at this point in the history
  5. misleading task number of groupByKey

    "By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to apache#389
    
    detail is as following code :
    
      def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
        val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
        for (r <- bySize if r.partitioner.isDefined) {
          return r.partitioner.get
        }
        if (rdd.context.conf.contains("spark.default.parallelism")) {
          new HashPartitioner(rdd.context.defaultParallelism)
        } else {
          new HashPartitioner(bySize.head.partitions.size)
        }
      }
    
    Author: Chen Chao <[email protected]>
    
    Closes apache#403 from CrazyJvm/patch-4 and squashes the following commits:
    
    42f6c9e [Chen Chao] fix format
    829a995 [Chen Chao] fix format
    1568336 [Chen Chao] misleading task number of groupByKey
    CrazyJvm authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    9c40b9e View commit details
    Browse the repository at this point in the history
  6. Update ReducedWindowedDStream.scala

    change  _slideDuration  to   _windowDuration
    
    Author: baishuo(白硕) <[email protected]>
    
    Closes apache#425 from baishuo/master and squashes the following commits:
    
    6f09ea1 [baishuo(白硕)] Update ReducedWindowedDStream.scala
    baishuo authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    07b7ad3 View commit details
    Browse the repository at this point in the history
  7. Include stack trace for exceptions thrown by user code.

    It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler.  This is a simple patch to include the stack trace for the actual failure in the error message.  Suggestions on formatting welcome.
    
    Before:
    ```
    scala> sc.parallelize(1 :: Nil).map(_ => sys.error("Ahh!")).collect()
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!)
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
    ...
    ```
    
    After:
    ```
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!
            scala.sys.package$.error(package.scala:27)
            $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
            $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
            scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
            scala.collection.Iterator$class.foreach(Iterator.scala:727)
            scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
            scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
            scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
            scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
            scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
            scala.collection.AbstractIterator.to(Iterator.scala:1157)
            scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
            scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
            scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
            scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
            org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
            org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
            org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
            org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
            org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110)
            org.apache.spark.scheduler.Task.run(Task.scala:50)
            org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
            org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46)
            org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:744)
    Driver stacktrace:
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
    	at scala.Option.foreach(Option.scala:236)
    	at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143)
    	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#409 from marmbrus/stacktraces and squashes the following commits:
    
    3e4eb65 [Michael Armbrust] indent. include header for driver stack trace.
    018b06b [Michael Armbrust] Include stack trace for exceptions in user code.
    marmbrus authored and rxin committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    d4916a8 View commit details
    Browse the repository at this point in the history
  8. SPARK-1462: Examples of ML algorithms are using deprecated APIs

    This will also fix SPARK-1464: Update MLLib Examples to Use Breeze.
    
    Author: Sandeep <[email protected]>
    
    Closes apache#416 from techaddict/1462 and squashes the following commits:
    
    a43638e [Sandeep] Some Style Changes
    3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples
    6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs
    techaddict authored and mateiz committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    6ad4c54 View commit details
    Browse the repository at this point in the history
  9. [python alternative] pyspark require Python2, failing if system defau…

    …lt is Py3 from shell.py
    
    Python alternative for apache#392; managed from shell.py
    
    Author: AbhishekKr <[email protected]>
    
    Closes apache#399 from abhishekkr/pyspark_shell and squashes the following commits:
    
    134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py
    abhishekkr authored and rxin committed Apr 17, 2014
    2 Configuration menu
    Copy the full SHA
    bb76eae View commit details
    Browse the repository at this point in the history
  10. [SPARK-1395] Allow "local:" URIs to work on Yarn.

    This only works for the three paths defined in the environment
    (SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF).
    
    Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case).
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes apache#303 from vanzin/yarn-local and squashes the following commits:
    
    82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.
    Marcelo Vanzin authored and tgravescs committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    6904750 View commit details
    Browse the repository at this point in the history
  11. SPARK-1408 Modify Spark on Yarn to point to the history server when a…

    …pp ...
    
    ...finishes
    
    Note this is dependent on apache#204 to have a working history server, but there are no code dependencies.
    
    This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly.
    
    Author: Thomas Graves <[email protected]>
    
    Closes apache#362 from tgravescs/SPARK-1408 and squashes the following commits:
    
    ec89705 [Thomas Graves] Fix typo.
    446122d [Thomas Graves] Make config yarn specific
    f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes
    tgravescs committed Apr 17, 2014
    Configuration menu
    Copy the full SHA
    0058b5d View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2014

  1. FIX: Don't build Hive in assembly unless running Hive tests.

    This will make the tests more stable when not running SQL tests.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#439 from pwendell/hive-tests and squashes the following commits:
    
    88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
    pwendell committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    6c746ba View commit details
    Browse the repository at this point in the history
  2. HOTFIX: Ignore streaming UI test

    This is currently causing many builds to hang.
    
    https://issues.apache.org/jira/browse/SPARK-1530
    
    Author: Patrick Wendell <[email protected]>
    
    Closes apache#440 from pwendell/uitest-fix and squashes the following commits:
    
    9a143dc [Patrick Wendell] Ignore streaming UI test
    pwendell committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    7863ecc View commit details
    Browse the repository at this point in the history
  3. SPARK-1483: Rename minSplits to minPartitions in public APIs

    https://issues.apache.org/jira/browse/SPARK-1483
    
    From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
    
    Author: CodingCat <[email protected]>
    
    Closes apache#430 from CodingCat/SPARK-1483 and squashes the following commits:
    
    4b60541 [CodingCat] deprecate defaultMinSplits
    ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
    CodingCat authored and rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    e31c8ff View commit details
    Browse the repository at this point in the history
  4. Reuses Row object in ExistingRdd.productToRowRdd()

    Author: Cheng Lian <[email protected]>
    
    Closes apache#432 from liancheng/reuseRow and squashes the following commits:
    
    9e6d083 [Cheng Lian] Simplified code with BufferedIterator
    52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
    liancheng authored and rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    89f4743 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1520] remove fastutil from dependencies

    A quick fix for https://issues.apache.org/jira/browse/SPARK-1520
    
    By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6.
    
    With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes apache#437 from mengxr/remove-fastutil and squashes the following commits:
    
    00f9beb [Xiangrui Meng] remove fastutil from dependencies
    mengxr authored and rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    aa17f02 View commit details
    Browse the repository at this point in the history
  6. SPARK-1357 (addendum). More Experimental items in MLlib

    Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much.
    
    Author: Sean Owen <[email protected]>
    
    Closes apache#372 from srowen/SPARK-1357Addendum and squashes the following commits:
    
    17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::"
    6800e4c [Sean Owen] Remove blank line after ":: Experimental ::"
    b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0
    srowen authored and rxin committed Apr 18, 2014
    1 Configuration menu
    Copy the full SHA
    8aa1f4c View commit details
    Browse the repository at this point in the history
  7. SPARK-1523: improve the readability of code in AkkaUtil

    Actually it is separated from apache#85 as suggested by @rxin
    
    compare
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L122
    
    and
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L117
    
    the first one use get and then toLong, the second one getLong....better to make them consistent
    
    very very small fix........
    
    Author: CodingCat <[email protected]>
    
    Closes apache#434 from CodingCat/SPARK-1523 and squashes the following commits:
    
    0e86f3f [CodingCat] improve the readability of code in AkkaUtil
    CodingCat authored and rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    3c7a9ba View commit details
    Browse the repository at this point in the history
  8. Fixed broken pyspark shell.

    Author: Reynold Xin <[email protected]>
    
    Closes apache#444 from rxin/pyspark and squashes the following commits:
    
    fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6.
    571830b [Reynold Xin] Fixed broken pyspark shell.
    rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    81a152c View commit details
    Browse the repository at this point in the history
  9. SPARK-1456 Remove view bounds on Ordered in favor of a context bound …

    …on Ordering.
    
    This doesn't require creating new Ordering objects per row.  Additionally, [view bounds are going to be deprecated](https://issues.scala-lang.org/browse/SI-7629), so we should get rid of them while APIs are still flexible.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#410 from marmbrus/viewBounds and squashes the following commits:
    
    c574221 [Michael Armbrust] fix example.
    812008e [Michael Armbrust] Update Java API.
    1b9b85c [Michael Armbrust] Update scala doc.
    35798a8 [Michael Armbrust] Remove view bounds on Ordered in favor of a context bound on Ordering.
    marmbrus authored and rxin committed Apr 18, 2014
    Configuration menu
    Copy the full SHA
    c399baa View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2014

  1. SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and s…

    …ave...
    
    ...AsNewAPIHadoopDataset
    
    `writer.close` should be put in the `finally` block to avoid potential resource leaks.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-1482
    
    Author: zsxwing <[email protected]>
    
    Closes apache#400 from zsxwing/SPARK-1482 and squashes the following commits:
    
    06b197a [zsxwing] SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
    zsxwing authored and mateiz committed Apr 19, 2014
    Configuration menu
    Copy the full SHA
    2089e0e View commit details
    Browse the repository at this point in the history
  2. README update

    Author: Reynold Xin <[email protected]>
    
    Closes apache#443 from rxin/readme and squashes the following commits:
    
    16853de [Reynold Xin] Updated SBT and Scala instructions.
    3ac3ceb [Reynold Xin] README update
    rxin committed Apr 19, 2014
    Configuration menu
    Copy the full SHA
    28238c8 View commit details
    Browse the repository at this point in the history
  3. Use scala deprecation instead of java.

    This gets rid of a warning when compiling core (since we were depending on a deprecated interface with a non-deprecated function).  I also tested with javac, and this does the right thing when compiling java code.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes apache#452 from marmbrus/scalaDeprecation and squashes the following commits:
    
    f628b4d [Michael Armbrust] Use scala deprecation instead of java.
    marmbrus authored and mateiz committed Apr 19, 2014
    Configuration menu
    Copy the full SHA
    5d0f58b View commit details
    Browse the repository at this point in the history
  4. Add insertInto and saveAsTable to Python API.

    Author: Michael Armbrust <[email protected]>
    
    Closes apache#447 from marmbrus/pythonInsert and squashes the following commits:
    
    c7ab692 [Michael Armbrust] Keep docstrings < 72 chars.
    ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.
    marmbrus authored and mateiz committed Apr 19, 2014
    Configuration menu
    Copy the full SHA
    10d0421 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1535] ALS: Avoid the garbage-creating ctor of DoubleMatrix

    `new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away.  This pull request avoids that constructor in the ALS code.
    
    Author: Tor Myklebust <[email protected]>
    
    Closes apache#442 from tmyklebu/foo2 and squashes the following commits:
    
    2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate.
    a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's.
    tmyklebu authored and mateiz committed Apr 19, 2014
    Configuration menu
    Copy the full SHA
    25fc318 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2014

  1. REPL cleanup.

    Author: Michael Armbrust <[email protected]>
    
    Closes apache#451 from marmbrus/replCleanup and squashes the following commits:
    
    088526a [Michael Armbrust] REPL cleanup.
    marmbrus authored and aarondav committed Apr 20, 2014
    Configuration menu
    Copy the full SHA
    3a390bf View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2014

  1. Configuration menu
    Copy the full SHA
    42238b6 View commit details
    Browse the repository at this point in the history
  2. remove exclusion scalap

    witgo committed Apr 21, 2014
    Configuration menu
    Copy the full SHA
    b434ec0 View commit details
    Browse the repository at this point in the history