Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More trainer extensions, bug fixes and consistency across trainer extensions #1524

Merged
merged 16 commits into from
Nov 10, 2018

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Nov 4, 2018

Gam trainers placed in the right context, and removing arguments they weren't using from their signatures.
Fixes #1521 making the order of the parameters consistent.
Addresses part of #1318 by adding the FastForest extensions.

…ments from the extensions of their signatures.

No clusters in advancedSettings now overiddes the number of clusters passed directly in KMeans.
removing the default name for Features and Label colum.
@sfilipi sfilipi added the API Issues pertaining the friendly API label Nov 4, 2018
@sfilipi sfilipi self-assigned this Nov 4, 2018
@@ -27,8 +27,8 @@ public static class TreeExtensions
/// <param name="learningRate">The learning rate.</param>
/// <param name="advancedSettings">Algorithm advanced settings.</param>
public static FastTreeRegressionTrainer FastTree(this RegressionContext.RegressionTrainers ctx,
string label = DefaultColumnNames.Label,
string features = DefaultColumnNames.Features,
string label,
Copy link
Contributor

@artidoro artidoro Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen in other files that features comes before label (SymSgdClassificationTrainer for example). We should pick one order and stick to it, otherwise it is very confusing. I would suggest features, label, optional weight. But either ways could work! #Resolved

Copy link
Contributor

@artidoro artidoro Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you already made the decision, label, features, optional weight, which is consistent with all other files in this PR. Maybe we should consider opening an issue and updating the order the arguments in other files in another PR?


In reply to: 230870480 [](ancestors = 230870480)

@@ -94,35 +94,39 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight
/// <summary>
/// Initializes a new instance of <see cref="KMeansPlusPlusTrainer"/>
/// </summary>
/// <param name="env">The private instance of <see cref="IHostEnvironment"/>.</param>
/// <param name="env">The local instance of <see cref="IHostEnvironment"/>.</param>
Copy link
Contributor

@artidoro artidoro Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local instance of [](start = 30, length = 52)

Maybe update to: The environment to use.
So that it easier to understand. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using this comment most everywhere.


In reply to: 230873736 [](ancestors = 230873736)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think so. 'The local instance of IHostEnvironment' is a really bad comment for a public method.


In reply to: 231233831 [](ancestors = 231233831,230873736)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would you suggest we use @Zruty0?


In reply to: 231367726 [](ancestors = 231367726,231233831,230873736)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Artidoro said.


In reply to: 231386902 [](ancestors = 231386902,231367726,231233831,230873736)

Copy link
Member Author

@sfilipi sfilipi Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't sound more more helpful to me than the current one.. but 2 to 1.
I'll look for what do .net uses in those cases in other frameworks, and use that to update everywhere on my next PR.


In reply to: 231618928 [](ancestors = 231618928,231386902,231367726,231233831,230873736)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

{
Host.CheckValue(args, nameof(args));

// override with the advanced settings.
if (advancedSettings != null)
Copy link
Contributor

@Zruty0 Zruty0 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [](start = 12, length = 2)

advancedSettings?.Invoke(args) #Closed

/// <param name="label">The label column.</param>
/// <param name="features">The features column.</param>
/// <param name="weights">The optional weights column.</param>
/// <param name="numTrees">Total number of decision trees to create in the ensemble.</param>
/// <param name="numLeaves">The maximum number of leaves per decision tree.</param>
/// <param name="minDatapointsInLeafs">The minimal number of datapoints allowed in a leaf of the tree, out of the subsampled data.</param>
Copy link
Contributor

@Zruty0 Zruty0 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minDatapointsInLeafs [](start = 25, length = 20)

I sense inconsistent naming #Resolved

Copy link
Member Author

@sfilipi sfilipi Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see what you are refering to..


In reply to: 231226770 [](ancestors = 231226770)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MinDocsPerLeaf, MinDocumentsPerLeaf, MinDatapointsInLeafs

Also 'leaves', not 'leafs'


In reply to: 231232309 [](ancestors = 231232309,231226770)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lol. Apparently it's been like this for a long time.


In reply to: 231367379 [](ancestors = 231367379,231232309,231226770)

@@ -59,7 +59,7 @@ public Arguments()
BasePredictors = new[]
{
ComponentFactoryUtils.CreateFromFunction(
env => new MulticlassLogisticRegression(env, FeatureColumn, LabelColumn))
env => new MulticlassLogisticRegression(env, LabelColumn,FeatureColumn))
Copy link
Contributor

@Zruty0 Zruty0 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

,F [](start = 80, length = 2)

spacing #Closed

@@ -83,7 +83,9 @@ public sealed class Arguments : ArgumentsBase
/// <param name="memorySize">Memory size for <see cref="LogisticRegression"/>. Lower=faster, less accurate.</param>
/// <param name="optimizationTolerance">Threshold for optimizer convergence.</param>
/// <param name="advancedSettings">A delegate to apply all the advanced arguments to the algorithm.</param>
public MulticlassLogisticRegression(IHostEnvironment env, string featureColumn, string labelColumn,
public MulticlassLogisticRegression(IHostEnvironment env,
string labelColumn,
Copy link
Contributor

@Zruty0 Zruty0 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labelColumn [](start = 19, length = 11)

it used to be label and features in other trainers, right? #Resolved

Copy link
Member Author

@sfilipi sfilipi Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.
I changed them to be label, features, weights per the description in #1318, but labelColumn, featuresColumn and weightColumn seems more correct since they are actually column names?


In reply to: 231231728 [](ancestors = 231231728)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but should we have it consistent?


In reply to: 231244854 [](ancestors = 231244854,231231728)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure. I changed them to be "label, features, weights". I personally lean more towards "labelColum, featureColumn and weightColumn"


In reply to: 231367774 [](ancestors = 231367774,231244854,231231728)

Copy link
Contributor

@artidoro artidoro Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find labelColumn, featureColumn, weightColumn more appealing. I find them more representative of what the arguments specify (the name of these columns). #Resolved

Copy link
Contributor

@shauheen shauheen Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, labelColumn, featureColumn, weightColumn sound better to me. #Resolved

@@ -31,7 +31,9 @@ public static class SdcaRegressionExtensions
/// if both are present and have different values.
/// The columns names, however need to be provided directly, not through the <paramref name="advancedSettings"/>.</param>
public static SdcaRegressionTrainer StochasticDualCoordinateAscent(this RegressionContext.RegressionTrainers ctx,
string label = DefaultColumnNames.Label, string features = DefaultColumnNames.Features, string weights = null,
Copy link
Contributor

@Zruty0 Zruty0 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= DefaultColumnNames.Label [](start = 20, length = 27)

why do we remove the defaults? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about them being error prone, better ask the user to define them explicitly.


In reply to: 231231911 [](ancestors = 231231911)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley do you have a preference? I would prefer to keep the defaults. Otherwise, why do we even have DefaultColumnNames in the API?


In reply to: 231234097 [](ancestors = 231234097,231231911)

Copy link
Member Author

@sfilipi sfilipi Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default arguments make sense for the case when the user doesn't supply anything, and the pipeline just works. This is not the case for the column names; if they are not "Label" and "Features" in the datasets the pipelines fail, sometimes with esoteric messages.

Most developers tend to leave defaults to whatever their are set in the first moment of being socialized to the API. Those defaults will backfire often.

I fully endorse the default for the numeric values: they guide the user about ranges, and they just work most of the time.
I am also ok with Default column names for the columns components produce and hard-code like "Score" "PredictedLabel"; but IMO hardcode the inputs is misleading and will hide errors like the ones we have fixed so far (not passing the column names at all, and things still work because our test datasets and the default col names match).


In reply to: 231367619 [](ancestors = 231367619,231234097,231231911)

Copy link
Contributor

@TomFinley TomFinley Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally it is a good policy to be cautious w.r.t. expanding defaults, as @sfilipi suggests. Being too aggressive with your defaults can lead to more confusion, not less.

I think the default arguments make sense for the case when the user doesn't supply anything, the pipeline just works. This is not the case for the column names;

I would argue that it is the case for column names more often than we might think. The text loader when no settings are defined has the default behavior, which has proven to be valuable, that it considers the first field a scalar Label and subsequent fields a vector Features.

Further, while our text parser is built so that it can accommodate a diverse set of inputs (if properly configured), we should not loose sight of the fact that generally file formats for machine learning are far more restrictive than this. Consider formats like SVM-light, the more esoteric Bing formats, (we have loaders for the previous that we haven't yet had the chance to open source) or the VW format. In these formats, "label" and "features" are priviledged concepts, and where we've written loaders for them, we name these things the columns Label and Features. Or the "wizard" case, where we apply heusitics to figure out how to load data, what else will the result be named but Label and Features?

Now consider two populations of people, under two conditions.

Condition A, with defaults!

  1. Someone uses SVM-light loader, or the Bing BIN loader, or the text loader under default settings. (Or, in future, a hypothetical VW loader.) These will provide two columns Label and Features. With defaults, they call Read() then Fit(). Things work.

  2. Now, you are talking about someone that overrides the default behavior of TextLoader (again, that default behavior ) and provides explicit names for their columns. They call Fit on this trainer. Things fail with the message that it could not find a label or feature column named Label or Features.

Condition B, without defaults!

  1. Same actions, but nwo things fail. They need to provide these column names, but how could they reasonably know what the column names are? Indeed, someone just coming to ML.NET for the first time and just fires up a session with SVM-light loader for the first time, how could we expect them to even know what a column is?

  2. Same action, things still fail.

It is far from clear to me that A2 is significantly more confusing than B2 -- beyond that to even hit A2 we need a user that is both (1) smart enough to figure out how to come up with names for columns but (2) not smart enough to figure out that things downstream might want those names. Even if such a person exists, the situation of A2 is recoverable.

We must, I think, all agree that B1 is borderline disastrously confusing. And A1 both work.

Some people have really confusing pipelines with multiple featurizers, custom data, and all this stuff. But some people just have data, already nicely structured, and just want to read and train. IDataView and Columns are sort of one of those concepts you need to know if you're doing complex featurization, but if you don't, you don't really need to know it, and I'd like to keep it that way if we can.

Therefore, my own sense is that @Zruty0 is correct, and we ought to have defaults.


In reply to: 231388508 [](ancestors = 231388508,231367619,231234097,231231911)

Copy link
Member Author

@sfilipi sfilipi Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a side, @TomFinley, appreciate the level of detail. My observation is that overriding the behavior of the TextLoader is actually the norm, and unless your immediate component is a trainerEstimator, you need the names of the columns to transform them further.

Examples of ML.Net issues with non-standard TextLoader:
#420
https://stackoverflow.com/questions/52689363/adding-data-from-new-textloader-into-exisitng-pipeline-ver-0-6
https://elbruno.com/2018/06/19/mlnet-loading-data-in-our-learningpipeline-with-list-lists-for-ever/
https://carlos.mendible.com/2018/06/10/simple-machine-learning-with-dotnet-core-sample/

I will leave this discussion active till tomorrow afternoon, to see if anyone wants to comment further about either side, and if it remains a two-one situation, I'll bitterly :) revert the changes.. #Resolved

Copy link
Member Author

@sfilipi sfilipi Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote more about my experience with the default column names in #1360.

The assumption that the users don't need to specify column names in the TextLoader holds if the dataset is all numeric. If they have a text or categorical column, you'd have to know what it is to transform it further.

Also, I worried that the TextLoader producing the "Label" and "Feature" columns has discoverability issues on its own, and it hurts adoption more than helping it.
Educating the developers through tutorial, and by making the columns required would give them a faster understanding how to evolve the pipeline.
Scikit required the column names, and H2O requires the feature column, it only assumes that the Label column is the first one.

/cc @GalOshri @CESARDELATORRE for opinions #Resolved

Zruty0
Zruty0 previously approved these changes Nov 7, 2018
Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0 Zruty0 dismissed their stale review November 7, 2018 02:59

revoking review

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi merged commit 851558d into dotnet:master Nov 10, 2018
@sfilipi sfilipi deleted the moreTrainerXtensions branch March 15, 2019 21:15
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants