Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for HDFS #1243

Merged
merged 1 commit into from
Feb 27, 2018
Merged

Experimental support for HDFS #1243

merged 1 commit into from
Feb 27, 2018

Conversation

ebernhardson
Copy link
Contributor

  • Read and write datsets from hdfs.
  • Only enabled when cmake is run with -DUSE_HDFS:BOOL=TRUE
  • Introduces VirtualFile(Reader|Writer) to asbtract VFS differences

@msftclas
Copy link

msftclas commented Feb 15, 2018

CLA assistant check
All CLA requirements met.

@ebernhardson
Copy link
Contributor Author

ebernhardson commented Feb 15, 2018

I've forwarded the CLA to legal, typically resolved in a day.

@@ -98,7 +98,7 @@ class BinMapper {
* \brief Save binary data to file
* \param file File want to write
*/
void SaveBinaryToFile(FILE* file) const;
void SaveBinaryToFile(std::shared_ptr<VirtualFileWriter> file) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use share_ptr ? we often use const raw pointers. what are the advantages of share_ptr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a personal preference, these objects are very low volume so the overhead should be minimal, and the shared_ptr made thinking about the lifetime much easier. If you prefer though i can re-work this so the base instance gets allocated on the stack and we pass const pointers. Stack allocation should keep the relatively simple lifetime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, got it. The pointer is actual not 'const' inside these functions.

if (fs_ == NULL) {
auto namenode = std::vector<char>(100);
if (1 != sscanf(filename_.c_str(), "hdfs://%99[^/]/", namenode.data())) {
Log::Warning("Could not parse namenode from uri [%s]", filename_.c_str());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log::Fatal, or at least return false

}

static inline bool Exists(const char* filename) {
FILE* file_ = fopen(filename, "rb");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ifdef's for MSC

}

virtual bool Init() {
if (file_ == NULL) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this init code is pretty much duplicated for Read/Write. Can it be shared?

@@ -98,7 +98,7 @@ class BinMapper {
* \brief Save binary data to file
* \param file File want to write
*/
void SaveBinaryToFile(FILE* file) const;
void SaveBinaryToFile(std::shared_ptr<VirtualFileWriter> file) const;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a personal preference, these objects are very low volume so the overhead should be minimal, and the shared_ptr made thinking about the lifetime much easier. If you prefer though i can re-work this so the base instance gets allocated on the stack and we pass const pointers. Stack allocation should keep the relatively simple lifetime.


virtual bool Init()=0;
virtual size_t Write(const void* data, size_t bytes) const=0;
static std::shared_ptr<VirtualFileWriter> Make(const char* filename);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ptr ?

}
std::string line1, line2;
size_t buffer_size = 32*1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may is not enough for the long lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this so it should work for any length line possible, and added a test on the python package to assert it's working.

@guolinke
Copy link
Collaborator

@ebernhardson sorry, we are in the Chinese New Year holiday, so the review is slow.
Did you test this ?

@StrikerRUS it seems the check-the-docs test failed again.

@StrikerRUS
Copy link
Collaborator

@guolinke Yeah, I see...

Sites with Yahoo dataset and MPICH paper are down. SWIG site returns this warning: Project web is currently offline pending the final migration of its data to our new datacenter.

I think that Yahoo and SWIG will be online in a few days and we should just wait. Not sure about site with paper, maybe it should be replaced with another site.

@StrikerRUS
Copy link
Collaborator

@guolinke All sites are alive now, I've re-run Travis.

* Read and write datsets from hdfs.
* Only enabled when cmake is run with -DUSE_HDFS:BOOL=TRUE
* Introduces VirtualFile(Reader|Writer) to asbtract VFS differences
@ebernhardson
Copy link
Contributor Author

@guolinke I've tested this against a hadoop cluster I have available, running CDH 5.5, on input files up to 100GB. Works for reading in text or binary datasets and writing out binary datasets. I haven't explicitly tested the other code paths like predict but see no reason it should vary as long as the local file tests also pass.

I've also added a second patch to this PR, but i could split it into a second if preferred. The second patch allows providing a comma separated list of filenames and have them treated a single file concatenated together. This is particularly important for hdfs as it is very typical (and highly performant when writing) to write out a single large file as lots of smaller files of perhaps 256M or 1GB each. Some python tests were added to show this working.

@guolinke
Copy link
Collaborator

guolinke commented Feb 23, 2018

@ebernhardson thanks very much.
I think use another PR for the mulit-file support is better. And we can merge this PR first.

@guolinke
Copy link
Collaborator

@ebernhardson
Do you have any speedtest for the IO time (dataset loading) after this PR ?
you can use higgs(https://github.com/guolinke/boosting_tree_benchmarks/tree/master/data) dataset to have a quick test.

@ebernhardson
Copy link
Contributor Author

ebernhardson commented Feb 26, 2018

Ran each of these for 10 iterations against the full higgs dataset (xeon [email protected]). Numbers are all close I would judge the patch as having no performance impact. HDFS speed is also not so bad at 80% of line speed (and the nic is shared).

min mean max
master 11.1s 11.7s 14.7s
hdfs patch, USE_HDFS=0, local file 10.9s 11.4s 13.9s
hdfs patch, USE_HDFS=1, local file 11.2s 11.9s 14.8s
hdfs patch, hdfs file 1m15s 1m18s 1m22s

@guolinke
Copy link
Collaborator

@ebernhardson thanks very much. It is a very great PR.

@guolinke guolinke merged commit 7e186a5 into microsoft:master Feb 27, 2018
LocalFile(const std::string& filename, const std::string& mode) : filename_(filename), mode_(mode) {}
virtual ~LocalFile() {
if (file_ != NULL) {
fclose(file_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set file to NULL after close ? I am not sure will fclose change FILE to NULL or not.

@chivee chivee mentioned this pull request Feb 27, 2018
guolinke pushed a commit that referenced this pull request Feb 27, 2018
* minor change

* unix line break
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants