[python] add type hints for custom objective and metric functions in scikit-learn interface #4547

jameslamb · 2021-08-23T03:52:58Z

Created in response to #4544 (review).

Contributes to #3756.

Proposes introducing more specific type hints for custom objective and metric functions in the scikit-learn and Dask interfaces in the Python package.

…scikit-learn interface

StrikerRUS · 2021-08-25T17:56:14Z

python-package/lightgbm/sklearn.py

+        [_ArrayLike, _ArrayLike, _GroupType],
+        Tuple[np.ndarray, np.ndarray]


Could you please clarify why are you distinguishing all these three types (_ArrayLike, _GroupType, np.ndarray)? They are all documented as array-like.

LightGBM/python-package/lightgbm/sklearn.py

Lines 32 to 49 in bd28a36

y_true : array-like of shape = [n_samples]

The target values.

y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The predicted values.

Predicted values are returned before any transformation,

e.g. they are raw margin instead of probability of positive class for binary task.

group : array-like

Group/query data.

Only used in the learning-to-rank task.

sum(group) = n_samples.

For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,

where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The value of the first order derivative (gradient) of the loss

with respect to the elements of y_pred for each sample point.

hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The value of the second order derivative (Hessian) of the loss

with respect to the elements of y_pred for each sample point.

I didn't realize that y_true and y_pred could be lists, I thought they had to be a pandas Series, numpy array, or scipy matrix.

For grad and hess, it seems that they cannot be scipy matrices or pandas DataFrames / Series (although I didn't realize they could be lists)

LightGBM/python-package/lightgbm/basic.py

Lines 2956 to 2957 in bd28a36

grad, hess = fobj(self.__inner_predict(0), self.train_set)

return self.__boost(grad, hess)

LightGBM/python-package/lightgbm/basic.py

Lines 2972 to 2977 in bd28a36

grad : list or numpy 1-D array

The value of the first order derivative (gradient) of the loss

with respect to the elements of score for each sample point.

hess : list or numpy 1-D array

The value of the second order derivative (Hessian) of the loss

with respect to the elements of score for each sample point.

To be honest, I'm pretty unsure about the meaning of "array-like" in different parts of LightGBM's docs and I'm not always sure which combinations of these are supported when I see that:

list

numpy array

scipy sparse matrix

h2o datatable

pandas DataFrame

pandas Series

So I took a best guess based on a quick look through the code, but I probably need to test all of those combinations and then updated this PR / the docs as appropriate.

Yeah, absolutely agree with that array-like everywhere in the sklearn-wrapper looks confusing. I might be wrong, but it was written before scikit-learn introduced a formal definition of array-like term:
https://scikit-learn.org/stable/glossary.html#term-array-like

All these values in custom function signatures are supposed to have exactly 1 dimension, right? I believe it will be safe for now assign them the following types which we treat as 1-d array internally

LightGBM/python-package/lightgbm/basic.py

Lines 179 to 180 in bd28a36

raise TypeError(f"Wrong type({type(data).__name__}) for {name}.\n"

"It should be list, numpy 1-D array or pandas Series")

For grad and hess that function list_to_1d_numpy is applied directly.

LightGBM/python-package/lightgbm/basic.py

Lines 2984 to 2985 in bd28a36

grad = list_to_1d_numpy(grad, name='gradient')

hess = list_to_1d_numpy(hess, name='hessian')

For weight and group only np.ndarray is possible, if I'm not mistaken:

LightGBM/python-package/lightgbm/sklearn.py

Lines 176 to 179 in bd28a36

elif argc == 3:

return self.func(labels, preds, dataset.get_weight())

elif argc == 4:

return self.func(labels, preds, dataset.get_weight(), dataset.get_group())

LightGBM/python-package/lightgbm/basic.py

Lines 2215 to 2225 in bd28a36

def get_weight(self):

"""Get the weight of the Dataset.

Returns

-------

weight : numpy array or None

Weight for each data point from the Dataset.

"""

if self.weight is None:

self.weight = self.get_field('weight')

return self.weight

LightGBM/python-package/lightgbm/basic.py

Lines 2271 to 2288 in bd28a36

def get_group(self):

"""Get the group of the Dataset.

Returns

-------

group : numpy array or None

Group/query data.

Only used in the learning-to-rank task.

sum(group) = n_samples.

For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,

where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

"""

if self.group is None:

self.group = self.get_field('group')

if self.group is not None:

# group data from LightGBM is boundaries data, need to convert to group size

self.group = np.diff(self.group)

return self.group

LightGBM/python-package/lightgbm/basic.py

Lines 1507 to 1510 in bd28a36

if weight is not None:

self.set_weight(weight)

if group is not None:

self.set_group(group)

LightGBM/python-package/lightgbm/basic.py

Lines 2099 to 2119 in bd28a36

def set_weight(self, weight):

"""Set weight of each instance.

Parameters

----------

weight : list, numpy 1-D array, pandas Series or None

Weight to be set for each data point.

Returns

-------

self : Dataset

Dataset with set weight.

"""

if weight is not None and np.all(weight == 1):

weight = None

self.weight = weight

if self.handle is not None and weight is not None:

weight = list_to_1d_numpy(weight, name='weight')

self.set_field('weight', weight)

self.weight = self.get_field('weight') # original values can be modified at cpp side

return self

LightGBM/python-package/lightgbm/basic.py

Lines 2141 to 2162 in bd28a36

def set_group(self, group):

"""Set group size of Dataset (used for ranking).

Parameters

----------

group : list, numpy 1-D array, pandas Series or None

Group/query data.

Only used in the learning-to-rank task.

sum(group) = n_samples.

For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,

where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

Returns

-------

self : Dataset

Dataset with set group.

"""

self.group = group

if self.handle is not None and group is not None:

group = list_to_1d_numpy(group, np.int32, name='group')

self.set_field('group', group)

return self

For y_true the same logic is applicable as for weight and group.

LightGBM/python-package/lightgbm/sklearn.py

Line 172 in bd28a36

labels = dataset.get_label()

For y_pred only np.ndarray is possible

LightGBM/python-package/lightgbm/basic.py

Line 2956 in bd28a36

grad, hess = fobj(self.__inner_predict(0), self.train_set)

LightGBM/python-package/lightgbm/basic.py

Line 3732 in bd28a36

feval_ret = eval_function(self.__inner_predict(data_idx), cur_data)

LightGBM/python-package/lightgbm/basic.py

Line 3763 in bd28a36

return self.__inner_predict_buffer[data_idx]

LightGBM/python-package/lightgbm/basic.py

Line 3750 in bd28a36

self.__inner_predict_buffer[data_idx] = np.empty(n_preds, dtype=np.float64)

Sorry for taking so long to get back to this one!

I just pushed ea1aada with my best understanding of your comments above, but to be honest I still am confused about exactly what is allowed.

Here is my interpretation of those comments / links:

eval function

y_true = list, numpy array, or pandas Series

y_pred = numpy array

group = numpy array

weight = numpy array

objective function

y_true = list, numpy array, or pandas Series

y_pred = numpy array

group = numpy array

grad (output) = list, numpy array, or pandas Series

hess (output) = list, numpy array, or pandas Series

Double-checked this and I think your interpretation is fine. Thanks for the detailed investigation.

python-package/lightgbm/sklearn.py

StrikerRUS

I'm sorry, my previous comment was very vague. But you've got almost everything right from it! 😄

python-package/lightgbm/dask.py

python-package/lightgbm/sklearn.py

Co-authored-by: Nikita Titov <[email protected]>

StrikerRUS

Thank you very much for the deep investigation into accepted types!

jameslamb · 2021-11-15T20:06:05Z

Thanks again for the help @StrikerRUS , this one required a lot of investigation 😄

github-actions · 2023-08-23T14:30:39Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

[python] add type hints for custom objective and metric functions in …

a8af94e

…scikit-learn interface

jameslamb added the maintenance label Aug 23, 2021

jameslamb requested a review from StrikerRUS August 23, 2021 03:52

jameslamb requested review from chivee, henry0312 and shiyu1994 as code owners August 23, 2021 03:52

jameslamb added the awaiting review label Aug 25, 2021

StrikerRUS reviewed Aug 25, 2021

View reviewed changes

StrikerRUS removed the awaiting review label Aug 25, 2021

StrikerRUS mentioned this pull request Aug 28, 2021

[python-package] type hints in python package #3756

Open

12 tasks

merge master

9d00c4c

jameslamb requested review from hzy46, jmoralez and tongwu-sh as code owners November 8, 2021 02:17

jameslamb added 2 commits November 7, 2021 20:36

update type hints

ea1aada

remote unnecessary input

f469ccb

shiyu1994 reviewed Nov 8, 2021

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

shiyu1994 approved these changes Nov 11, 2021

View reviewed changes

Merge branch 'master' into function-hints

07d2170

StrikerRUS requested changes Nov 14, 2021

View reviewed changes

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb and others added 3 commits November 14, 2021 22:29

Update python-package/lightgbm/sklearn.py

d39c040

Co-authored-by: Nikita Titov <[email protected]>

Merge branch 'master' into function-hints

a2ff6ee

remove type hint on objective being callable

ccca965

jameslamb requested a review from StrikerRUS November 15, 2021 03:33

StrikerRUS approved these changes Nov 15, 2021

View reviewed changes

jameslamb merged commit 843d380 into master Nov 15, 2021

jameslamb deleted the function-hints branch November 15, 2021 20:05

StrikerRUS mentioned this pull request Nov 20, 2021

[python][docs] fix type hints for custom functions and remove vague array-like wording #4816

Merged

StrikerRUS mentioned this pull request Jan 6, 2022

[DO NOT MERGE] Release 3.3.2 #4930

Closed

13 tasks

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] add type hints for custom objective and metric functions in scikit-learn interface #4547

[python] add type hints for custom objective and metric functions in scikit-learn interface #4547

jameslamb commented Aug 23, 2021

StrikerRUS Aug 25, 2021

jameslamb Aug 25, 2021

StrikerRUS Aug 25, 2021 •

edited

Loading

jameslamb Nov 8, 2021

shiyu1994 Nov 11, 2021

StrikerRUS left a comment

StrikerRUS left a comment

jameslamb commented Nov 15, 2021

github-actions bot commented Aug 23, 2023

		[_ArrayLike, _ArrayLike, _GroupType],
		Tuple[np.ndarray, np.ndarray]

	y_true : array-like of shape = [n_samples]
	The target values.
	y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
	The predicted values.
	Predicted values are returned before any transformation,
	e.g. they are raw margin instead of probability of positive class for binary task.
	group : array-like
	Group/query data.
	Only used in the learning-to-rank task.
	sum(group) = n_samples.
	For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
	where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
	grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
	The value of the first order derivative (gradient) of the loss
	with respect to the elements of y_pred for each sample point.
	hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
	The value of the second order derivative (Hessian) of the loss
	with respect to the elements of y_pred for each sample point.

	grad, hess = fobj(self.__inner_predict(0), self.train_set)
	return self.__boost(grad, hess)

	grad : list or numpy 1-D array
	The value of the first order derivative (gradient) of the loss
	with respect to the elements of score for each sample point.
	hess : list or numpy 1-D array
	The value of the second order derivative (Hessian) of the loss
	with respect to the elements of score for each sample point.

	raise TypeError(f"Wrong type({type(data).__name__}) for {name}.\n"
	"It should be list, numpy 1-D array or pandas Series")

	grad = list_to_1d_numpy(grad, name='gradient')
	hess = list_to_1d_numpy(hess, name='hessian')

	elif argc == 3:
	return self.func(labels, preds, dataset.get_weight())
	elif argc == 4:
	return self.func(labels, preds, dataset.get_weight(), dataset.get_group())

	def get_weight(self):
	"""Get the weight of the Dataset.

	Returns
	-------
	weight : numpy array or None
	Weight for each data point from the Dataset.
	"""
	if self.weight is None:
	self.weight = self.get_field('weight')
	return self.weight

	def get_group(self):
	"""Get the group of the Dataset.

	Returns
	-------
	group : numpy array or None
	Group/query data.
	Only used in the learning-to-rank task.
	sum(group) = n_samples.
	For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
	where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
	"""
	if self.group is None:
	self.group = self.get_field('group')
	if self.group is not None:
	# group data from LightGBM is boundaries data, need to convert to group size
	self.group = np.diff(self.group)
	return self.group

	if weight is not None:
	self.set_weight(weight)
	if group is not None:
	self.set_group(group)

	def set_weight(self, weight):
	"""Set weight of each instance.

	Parameters
	----------
	weight : list, numpy 1-D array, pandas Series or None
	Weight to be set for each data point.

	Returns
	-------
	self : Dataset
	Dataset with set weight.
	"""
	if weight is not None and np.all(weight == 1):
	weight = None
	self.weight = weight
	if self.handle is not None and weight is not None:
	weight = list_to_1d_numpy(weight, name='weight')
	self.set_field('weight', weight)
	self.weight = self.get_field('weight') # original values can be modified at cpp side
	return self

	def set_group(self, group):
	"""Set group size of Dataset (used for ranking).

	Parameters
	----------
	group : list, numpy 1-D array, pandas Series or None
	Group/query data.
	Only used in the learning-to-rank task.
	sum(group) = n_samples.
	For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
	where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

	Returns
	-------
	self : Dataset
	Dataset with set group.
	"""
	self.group = group
	if self.handle is not None and group is not None:
	group = list_to_1d_numpy(group, np.int32, name='group')
	self.set_field('group', group)
	return self

[python] add type hints for custom objective and metric functions in scikit-learn interface #4547

[python] add type hints for custom objective and metric functions in scikit-learn interface #4547

Conversation

jameslamb commented Aug 23, 2021

StrikerRUS Aug 25, 2021

Choose a reason for hiding this comment

jameslamb Aug 25, 2021

Choose a reason for hiding this comment

StrikerRUS Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

jameslamb Nov 8, 2021

Choose a reason for hiding this comment

shiyu1994 Nov 11, 2021

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Nov 15, 2021

github-actions bot commented Aug 23, 2023

StrikerRUS Aug 25, 2021 •

edited

Loading