Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure consistent, reproducible choice when multiple parameter aliases are given #5304

Closed
jameslamb opened this issue Jun 19, 2022 · 3 comments · Fixed by #5338
Closed

ensure consistent, reproducible choice when multiple parameter aliases are given #5304

jameslamb opened this issue Jun 19, 2022 · 3 comments · Fixed by #5338
Assignees

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Jun 19, 2022

Summary

Many LightGBM interfaces (in R, Python, C++, and others) accept a key-value map params, which can be used to override LightGBM's default configuration. The valid values are documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

For many of those parameters, LightGBM recognizes a "main" parameter name and one or more "aliases" (other names which set the same configuration).

For example, main parameter num_iterations can also be referred to in user code as n_iter, num_round, and more (docs link).

On the C++ side, LightGBM guarantees reproducible behavior whenever multiple of these aliases are provided in the same call, like this:

{
    "num_round": 100,
    "n_iter": 200
}

LightGBM's C++, R, and Python packages should all make identical choices in such situations.

Motivation

Ensuring that LightGBM always chooses the same configuration given a certain content of params eliminates one possible source of the same code producing different results at different times or in different environments. That might save maintainers and users of the project time that would otherwise be lost investigating changes in results.

Description

Currently, the C++ side has some logic to make the choice of alias reproducible.

if (alias_set != tmp_map.end()) { // alias already set
// set priority by length & alphabetically to ensure reproducible behavior
if (alias_set->second.size() < pair.first.size() ||
(alias_set->second.size() == pair.first.size() && alias_set->second < pair.first)) {

Instead of replicating that logic in Python and R code, I believe this feature should be implemented similarly to the approach taken in #4829. The full list of recognized aliases is known at compile time, so it shouldn't necessary to write R and Python code similar to that C++ code which checks name lengths and alphabetic ordering every time params is processed.

I think these could all be kept in sync by:

  • moving that size + alphabetic ordering logic out of ParameterAlias::KeyAliasTransform() and instead having https://github.com/microsoft/LightGBM/blob/master/helpers/parameter_generator.py pre-sort all aliases that way
  • changing ParameterAlias::KeyAliasTransform() in C++ to use the output of Config::DumpAliases() or some other code on Config, iterate over aliases in order, and prefer the first one that it finds (taking advantage of the fact that the aliases have already been sorted)

To avoid the overhead of serializing and deserializing a JSON string, it might also be useful to add an intermediate method for Config::DumpAliases() that returns arrays of names, and re-use that across both Config::DumpAliases() and that alias-resolution code in ParameterAlias::KeyAliasTransform().

References

Created based on #5289 (comment).

The changes in #4829 are highly relevant to this issue, and reading that PR will help those looking to understand this issue more thoroughly.

Initial PR with deterministic aliases resolution method at cpp side: #961.

@StrikerRUS
Copy link
Collaborator

Totally support this!

I remember how we struggled without deterministic alias resolution method: #880 (comment) 😬

@jmoralez
Copy link
Collaborator

I can work on this if that's ok.

@jameslamb
Copy link
Collaborator Author

sure, that would be great!

StrikerRUS pushed a commit that referenced this issue Jul 30, 2022
…5338)

* dump sorted parameter aliases

* update lgb.check.wrapper_param

* update _choose_param_value to look like lgb.check.wrapper_param

* apply suggestions from review

* reduce diff

* move DumpAliases to config

* remove unnecessary check

* restore parameter check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants