[SPARK-4186] add binaryFiles and binaryRecords in Python #3078

davies · 2014-11-03T20:41:55Z

add binaryFiles() and binaryRecords() in Python

binaryFiles(self, path, minPartitions=None):
    :: Developer API ::

    Read a directory of binary files from HDFS, a local file system
    (available on all nodes), or any Hadoop-supported file system URI
    as a byte array. Each file is read as a single record and returned
    in a key-value pair, where the key is the path of each file, the
    value is the content of each file.

    Note: Small files are preferred, large file is also allowable, but
    may cause bad performance.

binaryRecords(self, path, recordLength):
    Load data from a flat binary file, assuming each record is a set of numbers
    with the specified numerical format (see ByteBuffer), and the number of
    bytes per record is constant.

    :param path: Directory to the input data files
    :param recordLength: The length at which to split the records

davies · 2014-11-03T20:42:15Z

cc @mateiz

SparkQA · 2014-11-03T20:45:17Z

Test build #22823 has started for PR 3078 at commit bb22442.

This patch merges cleanly.

SparkQA · 2014-11-03T21:49:51Z

Test build #22823 has finished for PR 3078 at commit bb22442.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-03T21:49:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22823/
Test FAILed.

SparkQA · 2014-11-03T22:35:08Z

Test build #509 has started for PR 3078 at commit bb22442.

This patch merges cleanly.

SparkQA · 2014-11-03T23:40:31Z

Test build #509 has finished for PR 3078 at commit bb22442.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-04T01:25:13Z

Test build #22842 has started for PR 3078 at commit 5ceaa8a.

This patch merges cleanly.

AmplabJenkins · 2014-11-04T01:32:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22841/
Test FAILed.

SparkQA · 2014-11-04T01:38:57Z

Test build #510 has started for PR 3078 at commit 5ceaa8a.

This patch merges cleanly.

SparkQA · 2014-11-04T02:48:47Z

Test build #22842 has finished for PR 3078 at commit 5ceaa8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-04T02:48:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22842/
Test PASSed.

SparkQA · 2014-11-04T03:02:39Z

Test build #510 has finished for PR 3078 at commit 5ceaa8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-04T07:02:41Z

Test build #22868 has started for PR 3078 at commit 24e84b6.

This patch merges cleanly.

SparkQA · 2014-11-04T09:02:42Z

Test build #22868 timed out for PR 3078 at commit 24e84b6 after a configured wait of 120m.

AmplabJenkins · 2014-11-04T09:02:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22868/
Test FAILed.

SparkQA · 2014-11-04T18:11:27Z

Test build #511 has started for PR 3078 at commit 24e84b6.

This patch merges cleanly.

SparkQA · 2014-11-04T19:34:18Z

Test build #511 has finished for PR 3078 at commit 24e84b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- case class Params(
- class VectorUDT(UserDefinedType):
- class NullType(PrimitiveType):
- class UserDefinedType(DataType):
- // in some cases, such as when a class is enclosed in an object (in which case
- case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)
- abstract class UserDefinedType[UserType] extends DataType with Serializable
- public abstract class UserDefinedType<UserType> extends DataType implements Serializable

mateiz · 2014-11-05T23:41:44Z

python/pyspark/context.py

@@ -396,6 +396,34 @@ def wholeTextFiles(self, path, minPartitions=None, use_unicode=True):
        return RDD(self._jsc.wholeTextFiles(path, minPartitions), self,
                   PairDeserializer(UTF8Deserializer(use_unicode), UTF8Deserializer(use_unicode)))

+    def binaryFiles(self, path, minPartitions=None):
+        """
+        :: Developer API ::


This shouldn't be DeveloperAPI I think, it should just be experimental (and same with the binaryRecords method). Maybe the labels are wrong in the Java and Scala API; if so do you mind changing them?

mateiz · 2014-11-05T23:42:12Z

Looks good, I just noticed one weird thing in the docs (probably an issue in the Java/Scala docs but we might as well fix those too).

SparkQA · 2014-11-05T23:57:34Z

Test build #22961 has started for PR 3078 at commit cd0bdbd.

This patch merges cleanly.

SparkQA · 2014-11-06T01:23:18Z

Test build #22961 has finished for PR 3078 at commit cd0bdbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-06T01:23:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22961/
Test PASSed.

mateiz · 2014-11-06T08:22:01Z

Looks good, thanks!

add binaryFiles() and binaryRecords() in Python ``` binaryFiles(self, path, minPartitions=None): :: Developer API :: Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. Note: Small files are preferred, large file is also allowable, but may cause bad performance. binaryRecords(self, path, recordLength): Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. :param path: Directory to the input data files :param recordLength: The length at which to split the records ``` Author: Davies Liu <[email protected]> Closes #3078 from davies/binary and squashes the following commits: cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 3aa349b [Davies Liu] add experimental notes 24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 1900085 [Davies Liu] bugfix bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python (cherry picked from commit b41a39e) Signed-off-by: Matei Zaharia <[email protected]>

add binaryFiles and binaryRecords in Python

bb22442

Davies Liu added 2 commits November 3, 2014 17:16

bugfix

1900085

Merge branch 'master' of github.com:apache/spark into binary

5ceaa8a

Merge branch 'master' of github.com:apache/spark into binary

24e84b6

mateiz reviewed Nov 5, 2014
View reviewed changes

Davies Liu added 2 commits November 5, 2014 15:54

add experimental notes

3aa349b

Merge branch 'master' of github.com:apache/spark into binary

cd0bdbd

asfgit closed this in b41a39e Nov 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4186] add binaryFiles and binaryRecords in Python #3078

[SPARK-4186] add binaryFiles and binaryRecords in Python #3078

davies commented Nov 3, 2014

davies commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

mateiz Nov 5, 2014

mateiz commented Nov 5, 2014

SparkQA commented Nov 5, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

mateiz commented Nov 6, 2014

[SPARK-4186] add binaryFiles and binaryRecords in Python #3078

[SPARK-4186] add binaryFiles and binaryRecords in Python #3078

Conversation

davies commented Nov 3, 2014

davies commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

mateiz Nov 5, 2014

Choose a reason for hiding this comment

mateiz commented Nov 5, 2014

SparkQA commented Nov 5, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

mateiz commented Nov 6, 2014