-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental support for HDFS #1243
Conversation
ebernhardson
commented
Feb 15, 2018
- Read and write datsets from hdfs.
- Only enabled when cmake is run with -DUSE_HDFS:BOOL=TRUE
- Introduces VirtualFile(Reader|Writer) to asbtract VFS differences
I've forwarded the CLA to legal, typically resolved in a day. |
include/LightGBM/bin.h
Outdated
@@ -98,7 +98,7 @@ class BinMapper { | |||
* \brief Save binary data to file | |||
* \param file File want to write | |||
*/ | |||
void SaveBinaryToFile(FILE* file) const; | |||
void SaveBinaryToFile(std::shared_ptr<VirtualFileWriter> file) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use share_ptr
? we often use const
raw pointers. what are the advantages of share_ptr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only a personal preference, these objects are very low volume so the overhead should be minimal, and the shared_ptr made thinking about the lifetime much easier. If you prefer though i can re-work this so the base instance gets allocated on the stack and we pass const pointers. Stack allocation should keep the relatively simple lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, got it. The pointer is actual not 'const' inside these functions.
include/LightGBM/utils/file_io.h
Outdated
if (fs_ == NULL) { | ||
auto namenode = std::vector<char>(100); | ||
if (1 != sscanf(filename_.c_str(), "hdfs://%99[^/]/", namenode.data())) { | ||
Log::Warning("Could not parse namenode from uri [%s]", filename_.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log::Fatal, or at least return false
include/LightGBM/utils/file_io.h
Outdated
} | ||
|
||
static inline bool Exists(const char* filename) { | ||
FILE* file_ = fopen(filename, "rb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ifdef's for MSC
include/LightGBM/utils/file_io.h
Outdated
} | ||
|
||
virtual bool Init() { | ||
if (file_ == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this init code is pretty much duplicated for Read/Write. Can it be shared?
include/LightGBM/bin.h
Outdated
@@ -98,7 +98,7 @@ class BinMapper { | |||
* \brief Save binary data to file | |||
* \param file File want to write | |||
*/ | |||
void SaveBinaryToFile(FILE* file) const; | |||
void SaveBinaryToFile(std::shared_ptr<VirtualFileWriter> file) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only a personal preference, these objects are very low volume so the overhead should be minimal, and the shared_ptr made thinking about the lifetime much easier. If you prefer though i can re-work this so the base instance gets allocated on the stack and we pass const pointers. Stack allocation should keep the relatively simple lifetime.
include/LightGBM/utils/file_io.h
Outdated
|
||
virtual bool Init()=0; | ||
virtual size_t Write(const void* data, size_t bytes) const=0; | ||
static std::shared_ptr<VirtualFileWriter> Make(const char* filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_ptr ?
src/io/parser.cpp
Outdated
} | ||
std::string line1, line2; | ||
size_t buffer_size = 32*1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may is not enough for the long lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this so it should work for any length line possible, and added a test on the python package to assert it's working.
@ebernhardson sorry, we are in the Chinese New Year holiday, so the review is slow. @StrikerRUS it seems the check-the-docs test failed again. |
@guolinke Yeah, I see... Sites with Yahoo dataset and MPICH paper are down. SWIG site returns this warning: I think that Yahoo and SWIG will be online in a few days and we should just wait. Not sure about site with paper, maybe it should be replaced with another site. |
@guolinke All sites are alive now, I've re-run Travis. |
* Read and write datsets from hdfs. * Only enabled when cmake is run with -DUSE_HDFS:BOOL=TRUE * Introduces VirtualFile(Reader|Writer) to asbtract VFS differences
@guolinke I've tested this against a hadoop cluster I have available, running CDH 5.5, on input files up to 100GB. Works for reading in text or binary datasets and writing out binary datasets. I haven't explicitly tested the other code paths like predict but see no reason it should vary as long as the local file tests also pass. I've also added a second patch to this PR, but i could split it into a second if preferred. The second patch allows providing a comma separated list of filenames and have them treated a single file concatenated together. This is particularly important for hdfs as it is very typical (and highly performant when writing) to write out a single large file as lots of smaller files of perhaps 256M or 1GB each. Some python tests were added to show this working. |
@ebernhardson thanks very much. |
@ebernhardson |
Ran each of these for 10 iterations against the full higgs dataset (xeon [email protected]). Numbers are all close I would judge the patch as having no performance impact. HDFS speed is also not so bad at 80% of line speed (and the nic is shared).
|
@ebernhardson thanks very much. It is a very great PR. |
LocalFile(const std::string& filename, const std::string& mode) : filename_(filename), mode_(mode) {} | ||
virtual ~LocalFile() { | ||
if (file_ != NULL) { | ||
fclose(file_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set file to NULL after close ? I am not sure will fclose
change FILE to NULL or not.