-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4186] add binaryFiles and binaryRecords in Python #3078
Conversation
cc @mateiz |
Test build #22823 has started for PR 3078 at commit
|
Test build #22823 has finished for PR 3078 at commit
|
Test FAILed. |
Test build #509 has started for PR 3078 at commit
|
Test build #509 has finished for PR 3078 at commit
|
Test build #22842 has started for PR 3078 at commit
|
Test FAILed. |
Test build #510 has started for PR 3078 at commit
|
Test build #22842 has finished for PR 3078 at commit
|
Test PASSed. |
Test build #510 has finished for PR 3078 at commit
|
Test build #22868 has started for PR 3078 at commit
|
Test build #22868 timed out for PR 3078 at commit |
Test FAILed. |
Test build #511 has started for PR 3078 at commit
|
Test build #511 has finished for PR 3078 at commit
|
@@ -396,6 +396,34 @@ def wholeTextFiles(self, path, minPartitions=None, use_unicode=True): | |||
return RDD(self._jsc.wholeTextFiles(path, minPartitions), self, | |||
PairDeserializer(UTF8Deserializer(use_unicode), UTF8Deserializer(use_unicode))) | |||
|
|||
def binaryFiles(self, path, minPartitions=None): | |||
""" | |||
:: Developer API :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be DeveloperAPI I think, it should just be experimental (and same with the binaryRecords method). Maybe the labels are wrong in the Java and Scala API; if so do you mind changing them?
Looks good, I just noticed one weird thing in the docs (probably an issue in the Java/Scala docs but we might as well fix those too). |
Test build #22961 has started for PR 3078 at commit
|
Test build #22961 has finished for PR 3078 at commit
|
Test PASSed. |
Looks good, thanks! |
add binaryFiles() and binaryRecords() in Python ``` binaryFiles(self, path, minPartitions=None): :: Developer API :: Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. Note: Small files are preferred, large file is also allowable, but may cause bad performance. binaryRecords(self, path, recordLength): Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. :param path: Directory to the input data files :param recordLength: The length at which to split the records ``` Author: Davies Liu <[email protected]> Closes #3078 from davies/binary and squashes the following commits: cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 3aa349b [Davies Liu] add experimental notes 24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 1900085 [Davies Liu] bugfix bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python (cherry picked from commit b41a39e) Signed-off-by: Matei Zaharia <[email protected]>
add binaryFiles() and binaryRecords() in Python