Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer estimator cleanup for FastTrees and LightGBM #1352

Merged
merged 12 commits into from
Oct 27, 2018

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Oct 24, 2018

1- Adding the GroupId to the TrainerEstimatorBase class.
2- Adding the static xtensions for LightGM Multiclass and Ranking. Fixes #1314
3- Reorganizing the static and dynamic xtensions for FastTree and LightGBM

Addresses part of #1318

Adding tests, and refactoring catalog and pigsty statics
adding groupid to the trainer estimator base
refactoring the catalog and static extensions for trees
@@ -377,9 +377,10 @@ public static SchemaShape.Column MakeR4VecFeature(string featureColumn)
/// The <see cref="SchemaShape.Column"/> for the weight column.
/// </summary>
/// <param name="weightColumn">name of the weight column</param>
public static SchemaShape.Column MakeR4ScalarWeightColumn(string weightColumn)
/// <param name="isExplicit">whether the column is implicitely, or explicitely defined</param>
Copy link
Contributor

@rauhs rauhs Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: implicitly & explicitly #Closed

@sfilipi sfilipi self-assigned this Oct 24, 2018
@sfilipi sfilipi added the API Issues pertaining the friendly API label Oct 24, 2018
/// <summary>
/// The optional groupID column that the ranking trainers expects.
/// </summary>
public readonly SchemaShape.Column GroupIdColumn;
Copy link
Contributor

@Zruty0 Zruty0 Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupIdColumn [](start = 43, length = 13)

I am not sure I like the idea that every estimator will have the group ID column, but it only makes sense for ranking trainers. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to modify the role mapped data after the estimator constructs it, and inject the groupId, and that feels error-prone. It's just one reference :)


In reply to: 227866406 [](ancestors = 227866406)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have a TrainerEstimatorBase, is there anything that prevents us from having another subclass, if we wanted to have additional roles? We may have to change something below to be a virtual method or have some other extension point, but I feel like this would be helpful anyway.

I am not really a fan of having a single utility base class handle everything, since if it is capable of doing everything I suspect it will become too unwieldly.


In reply to: 227895594 [](ancestors = 227895594,227866406)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk what i like less; the GroupId field here, or making MakeRoles protected virtual ... changing it anyways.


In reply to: 228312151 [](ancestors = 228312151,227895594,227866406)


internal static class LightGbmStaticsUtils {
/// <summary>
/// LightGbm <see cref="RankingContext"/> extension method.
Copy link
Contributor

@Zruty0 Zruty0 Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightGbm extension method. [](start = 12, length = 55)

Change the summary to look like the below summary #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here


In reply to: 227866657 [](ancestors = 227866657)


if (advancedSettings != null)
CheckArgsAndAdvancedSettingMismatch(numLeaves, minDataPerLeaf, learningRate, numBoostRound, new LightGbmArguments(), Args);

Copy link
Contributor

@Zruty0 Zruty0 Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there has been a misconception.

It is OK to have the parameters (like numLeaves and learningRate) available in 2 places (inside Args and as a ctor argument.
The correct pattern is

var args = new Arguments();
args.NumLeaves = numLeaves;
args.LearningRate  = learningRate;
//...
advancedSettings?.Invoke(args);

That is, we set the args' fields to the user-provided ctor arguments, and then let the user optionally change them again in the delegate: this is not a bug.

One exception is for column names: we are always ignoring the column names if they are set in the arguments. So I suggested to check (and throw) if the user sets something like FeatureColumn in the args to the value different to featureColumn ctor argument (as we will not respect that change).

Apparently, the latter is not yet done, and it's probably fine for now. But let's not check the other args. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I have been doing, is the exact opposite, and i do prefer that tbh because of two reasons:

1- We do have parms in two places, and i think whenever we can deprecate those params from the arguments, we should do so. I don;t think we should promote the buried arguments in favor of the ones that surfaced in the signature.

2- It keeps the behavior consistent with the column names.

cc @TomFinley for his opinion.


In reply to: 227869641 [](ancestors = 227869641)

Copy link
Contributor

@TomFinley TomFinley Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of enforcing a mismatch, since I could easily imagine situations where someone sets some reasonable arguments, but might thanks to some complicated logic decide to think better of it in the more complex setter.

However there is an addition point I wanted to address, since my first reaction is to disagree with it somewhat.

We do have parms in two places, and i think whenever we can deprecate those params from the arguments, we should do so.

I am strongly not in favor of this. I believe it is essential that the object with the advanced setting also continue to have the basic settings.

The reason is, a delegate for setting the object might be something used alongside the constructor, but in more programmatic, non-direct instantiation settings, especially in situations where several decisions have to be taken, it might be a real honest-to-god separate method living in a class somewhere. If the options argument has omitted the most common options, as suggested here, then we would force this scenario to spread its setting objects across two places, in a way that becomes suddenly rather more complex than just setting a bunch of properties in a single object.

A more minor consideration is that it seems more in-line with the idiom from ASP.NET, which is where @alexdegroot originally suggested we lift the idiom here.

My inclination therefore is to agree with @Zruty0 on this one, sorry @sfilipi. :)


In reply to: 227886248 [](ancestors = 227886248,227869641)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the elaborate and convincing argument @[email protected]. changing it everywhere.


In reply to: 228319239 [](ancestors = 228319239,227886248,227869641)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

/// <param name="columnName">name of the weight column</param>
public static SchemaShape.Column MakeU4ScalarColumn(string columnName)
{
if (columnName == null)
Copy link
Member

@singlis singlis Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check needed? It looks like SchemaShape.Column constructor also checks the name for null or empty string. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, because the check inside the constructor will throw if we pass null.


In reply to: 227989462 [](ancestors = 227989462)

/// </summary>
/// <param name="ctx">The <see cref="BinaryClassificationContext"/>.</param>
/// <param name="label">The label column.</param>
/// <param name="features">The features colum.</param>
Copy link
Member

@singlis singlis Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column, same on the other function comments. #Resolved

float l2Weight,
float optimizationTolerance,
int memorySize,
bool enforceNoNegativity)
: this(env, ArgsInit(featureColumn, labelColumn, weightColumn, advancedSettings), labelColumn,
l1Weight, l2Weight, optimizationTolerance, memorySize, enforceNoNegativity)
: this(env, new TArgs(), labelColumn,
Copy link
Member Author

@sfilipi sfilipi Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TArgs() [](start = 27, length = 8)

this needs to get all the other parameters in #Resolved

if (Args.LearningRates != learningRate)
{
using (var ch = Host.Start($"Setting learning rate to: {learningRate} as supplied in the direct arguments."))
Args.LearningRates = learningRate;
Copy link
Contributor

@Zruty0 Zruty0 Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args [](start = 16, length = 4)

indenting #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -51,12 +51,18 @@ internal RegressionGamTrainer(IHostEnvironment env, Arguments args)
/// <param name="labelColumn">The name of the label column.</param>
/// <param name="featureColumn">The name of the feature column.</param>
/// <param name="weightColumn">The name for the column containing the initial weight.</param>
/// <param name="learningRate">The learning rate.</param>
Copy link
Member

@singlis singlis Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, the ordering of parameters, minDocumentsInLeafs and learningRate. #Resolved

/// <param name="weightColumn">The name for the column containing the initial weight.</param>
/// <param name="learningRate">The learning rate.</param>
Copy link
Member

@singlis singlis Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordering of parameters here too. #Resolved


internal class FastTreeStaticsUtils
{
internal static void CheckUserValues(PipelineColumn label, Vector<float> features, Scalar<float> weights,
Copy link
Member

@singlis singlis Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent is off here... #Resolved

public LinearClassificationTrainer(IHostEnvironment env,
string featureColumn,
string featureColumn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent is off here

Copy link
Member

@singlis singlis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I have a few minor nit picks on formatting.

@sfilipi sfilipi merged commit 5123aee into dotnet:master Oct 27, 2018
@sfilipi
Copy link
Member Author

sfilipi commented Oct 27, 2018

thank you for reviewing @singlis and @Zruty0

@sfilipi sfilipi deleted the trainerEstimatorCleanup branch October 27, 2018 06:00
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants