Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some paramater hints when loading from binary file #4701

Merged
merged 3 commits into from
Oct 25, 2021

Conversation

hzy46
Copy link
Contributor

@hzy46 hzy46 commented Oct 21, 2021

#4657 describes ignore_column will cause an error when loading from binary file. I think here're two related issues in the code:

  1. SetHeader should only be called when loading from text file. Now, it is also called when loading the data from binary file. This is why error mentioned in issue 4657 happens.

SetHeader(filename);

  1. Apart from ignore_column, parameters two_round, header, label_column, weight_column, group_column also have no effect when loading from binary. I changed the doc. Need to confirm whether we should add warning for these parameters.

@StrikerRUS
Copy link
Collaborator

@hzy46 Thanks for your contribution! Parameters.rst shouldn't be touched directly. Instead, make your changes in the config.h file and then run python helpers/parameter_generator.py to update Parameters.rst.
Refer to https://github.com/microsoft/LightGBM/blame/master/docs/Parameters.rst#L1.

@hzy46
Copy link
Contributor Author

hzy46 commented Oct 22, 2021

@hzy46 Thanks for your contribution! Parameters.rst shouldn't be touched directly. Instead, make your changes in the config.h file and then run python helpers/parameter_generator.py to update Parameters.rst. Refer to https://github.com/microsoft/LightGBM/blame/master/docs/Parameters.rst#L1.

OK, got it.

SetHeader(filename);
if (filename != nullptr && CheckCanLoadFromBin(filename) == "") {
// SetHeader should only be called when loading from text file
SetHeader(filename);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing cases are failing because SetHeader does not only handle cases where input are from files. It also reads categorical feature indices from the config parameters (see the part outside the if (filename != nullptr) { ... }).

Skipping SetHeader directly here will cause errors when we load data from numpy or pandas arrays (where filename == nullptr) and use categorical features.

So I think we should move the the check filename != nullptr && CheckCanLoadFromBin(filename) == "" into SetHeader. That is, we change if (filename != nullptr) { ... } into if (filename != nullptr && CheckCanLoadFromBin(filename) == "") { ... }

@hzy46
Copy link
Contributor Author

hzy46 commented Oct 22, 2021

@shiyu1994 @StrikerRUS Thanks for your comment. I have updated the code. Please check.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Just one question below:

}
if (config_.ignore_column != "") {
Log::Warning("Config ignore_column works only in case of loading data directly from text file. It will be ignored when loading from binary file.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why there are no similar checks for config_.two_round and config_.header?

Copy link
Contributor Author

@hzy46 hzy46 Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be obvious that two_round and header is not useful when loading from a binary file. What about your opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to agree with you about that header is irrelevant param when loading from binary, but I think two_round isn't so obvious. I'd better list all parameters here for the consistency despite whether it is obvious or not that they are not useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this PR now? If there's any other parameters need a similar explanation, maybe we can leave it in another PR. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
Only header and two_round have notes in their description about that they

works only in case of loading data directly from text file

but providing them for binary file doesn't trigger this warning.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants