-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] 4.0.0 parameters missing from model.txt #6010
Comments
Thanks for using LightGBM.
I believe this was intentional in #5800. Those parameters don't affect serving a model to generate predictions (similar conversation: #6017 (comment)). Could you explain specifically why you think they should be saved? |
I thought about it some more tonight and I support adding these to model files, let's see if other maintainers agree: #6017 (comment) But either way, please share as much detail as you can about why you think they "ought to be saved". |
By the way, I modified the code link in your initial post to one that's tagged to a specific commit. "lines 596-622 of If you're not familiar with that, see https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files#press-y-to-permalink-to-a-file-in-a-specific-commit. |
Please also ensure |
Reproducibility and hyperparamger logging in machine learning in general is valuable -- if I go back and look at a
I think this is true for a lot of parameters that currently are printed? My If the concern here is file size, maybe there could be a succint mode that only stores the bare minimum relevant for serving (so no irrelevant hyperparameters, feature importances, tree descriptions only contain splits/values, not counts/weights). |
Thanks for that. I intentionally wanted to ask in an open-ended, non-leading way to see if there were other specific use cases you were thinking of that I hadn't considered. Seems like there are not and your motivation for this request is "seems like it'd be nice". To be clear... I don't disagree! We'll pick this up at some point, thanks for taking the time to start a thread about it here. |
LightGBM.net relies on all parameters being present in the model file in
order to validate that the parameters were correctly specified at the input
to training.
…On Wed, 30 Aug 2023, 11:54 am James Lamb, ***@***.***> wrote:
Thanks for that.
I intentionally wanted to ask in an open-ended, non-leading way to see if
there were other specific use cases you were thinking of that I hadn't
considered.
Seems like there are not and your motivation for this request is "seems
like it'd be nice". To be clear... I don't disagree! We'll pick this up at
some point, thanks for taking the time to start a thread about it here.
—
Reply to this email directly, view it on GitHub
<#6010 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUE6C2VCMMNVX4GVIAI3XX2MN7ANCNFSM6AAAAAA2Y5QCHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Also in 4.0.0 I noticed that at some point the python code will actually try to parse parameters from the model text file, namely LightGBM/python-package/lightgbm/basic.py Lines 3234 to 3254 in 858eeb5
Though in practice the goal seems to be to override input parameters, see LightGBM/python-package/lightgbm/basic.py Lines 3184 to 3186 in 858eeb5
So this in practice will end up loading the wrong parameters, e.g.: import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"
n, p = 10_000, 10
X = np.random.normal(size=(n, p))
y = np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)
model = lgb.train({"verbosity": -1, "use_quantized_grad": True}, train_set=ds, num_boost_round=1)
print("use_quantized_grad" in model.params)
>>> True
model.save_model("tmp.txt")
booster = lgb.Booster({"verbosity": -1, "use_quantized_grad": True}, model_file="tmp.txt")
print("use_quantized_grad" in booster.params)
>>> False TBH I don't really know what the idea is behind this parameter override when loading a model, but either way this doesn't really seem correct? |
Just to check; was the original reason for omitting these output parameters indeed to keep the output file small for inference? Would having a smaller inference-only text file format resolve this? |
Not necessarily. The idea of excluding parameters from the model file intentionally originally came from a desire to not include CLI-only parameters to the model file. For example, the CLI supports parameters like these:
Having such parameters in the model config by default is problematic outside of the CLI, like in the R and Python packages, because it can result in side-effects like file-writing that users didn't ask for and which happen silently. See this thread: #2589 (comment) (which I found from the git blame: https://github.com/microsoft/LightGBM/blame/82033064005d07ae98cd3003190d675992061a61/include/LightGBM/config.h#L9) I suspect that other parameters were then later marked
I don't support adding this at this time. The size of LightGBM text files is already dominated by the tree structure (which would be necessary for inference too), so I don't think a code path generating a model file which omits parameters and other model information just for the sake of file size is worth the added complexity. |
I don't understand this comment. What specifically is not "correct" about that behavior you linked from the Python package? |
See the code snippet I posted. I set |
I understand the snippet, and understand that you mean it to say that you expected parameters passed to But the fact that that's what you expected, or that that's what you'd prefer, doesn't mean LightGBM's current approach is not "correct". That behavior was an intentional choice. If you look in the git blame view for the code you linked to (blame link), it'll lead you to the PR that introduced it: #5424. From there, you can find that there was actually quite a lot of conversation about this topic on the PR:
@jmoralez when you have time could you read through those two threads again and give us your opinion on whether parameters passed through Re-reading them and considering the conversation in this issue, I think we should have params specified in |
The example is constructing a booster passing both params and model_file, which I'm pretty sure issues a warning about the params being ignored. I believe the conclusion was that if the user wanted to train more iterations it'd have to be done with train and passing the saved model as init_model. |
Oh!! I totally missed that it was a call to |
Just to clarify what my first snippet is supposed to show:
The point is that there is something off, whether my expectations are correct or not. Either the parameters from the [deleted section because I said something dumb] I experimented a bit and actually the situation with import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"
np.random.seed(0)
n, p = 10_000, 10
X, y = np.random.normal(size=(n, p)), np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)
model = lgb.train({"verbosity": -1, "use_quantized_grad": True}, train_set=ds, num_boost_round=1)
model.save_model("tmp.txt")
model2 = lgb.train({"verbosity": -1}, train_set=ds, init_model="tmp.txt", num_boost_round=1)
print("use_quantized_grad" in model2.params)
>>> False
model3 = lgb.train({"verbosity": -1, "use_quantized_grad": False}, train_set=ds, init_model="tmp.txt", num_boost_round=1)
print("use_quantized_grad" in model3.params, model3.params["use_quantized_grad"])
>>> True, False So it seems to me the current situation is:
The last sentences makes perfect sense, becaus if we don't store |
Just to drive the point home. Can you please have a look at this code snippet and guess what it should output, and then try and run it: import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"
np.random.seed(0)
n, p = 10_000, 10
X, y = np.random.normal(size=(n, p)), np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)
params1 = {"verbosity": -1, "use_quantized_grad": True, "bagging_fraction": 0.5}
params2 = {"verbosity": -1, "use_quantized_grad": False, "bagging_fraction": 1.0}
model = lgb.train(params1, train_set=ds, num_boost_round=1)
model.save_model("tmp.txt")
model2 = lgb.Booster(params2, model_file="tmp.txt")
print(f"{model2.params.get('use_quantized_grad', None)}")
print(f"{model2.params.get('bagging_fraction', None)}") |
This param isn't currently saved in the model file and thus can't be recovered, but it's being addressed in #6077.
This issues the warning about the params argument being ignored, so it doesn't have any effect.
This is because of the two points above.
Yes, since it's not currently saved in the model file there's no way to recover it, however you can override the parameters for continued training with the lgb.train function.
This isn't stated anywhere, those parameters are meant to be used on the Python side to get information on the trained model, such as using the learned categorical features (#5246). |
#6077 was just merged, so as of latest |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
I noticed when you save a model, some of the new parameters in the latest 4.0.0 release are not stored, e.g. the ones around quantized training. It seems they are explicitly marked as such in the document:
LightGBM/include/LightGBM/config.h
Lines 596 to 622 in 858eeb5
Is this intentional? It seems to me these parameters ought to be saved.
The text was updated successfully, but these errors were encountered: