More trainer extensions, bug fixes and consistency across trainer extensions #1524

sfilipi · 2018-11-04T16:29:48Z

Gam trainers placed in the right context, and removing arguments they weren't using from their signatures.
Fixes #1521 making the order of the parameters consistent.
Addresses part of #1318 by adding the FastForest extensions.

…ments from the extensions of their signatures. No clusters in advancedSettings now overiddes the number of clusters passed directly in KMeans.

removing the default name for Features and Label colum.

artidoro · 2018-11-05T18:57:29Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

@@ -27,8 +27,8 @@ public static class TreeExtensions
        /// <param name="learningRate">The learning rate.</param>
        /// <param name="advancedSettings">Algorithm advanced settings.</param>
        public static FastTreeRegressionTrainer FastTree(this RegressionContext.RegressionTrainers ctx,
-            string label = DefaultColumnNames.Label,
-            string features = DefaultColumnNames.Features,
+            string label,


I have seen in other files that features comes before label (SymSgdClassificationTrainer for example). We should pick one order and stick to it, otherwise it is very confusing. I would suggest features, label, optional weight. But either ways could work! #Resolved

Actually I think you already made the decision, label, features, optional weight, which is consistent with all other files in this PR. Maybe we should consider opening an issue and updating the order the arguments in other files in another PR?

In reply to: 230870480 [](ancestors = 230870480)

artidoro · 2018-11-05T19:07:15Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

@@ -94,35 +94,39 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight
        /// <summary>
        /// Initializes a new instance of <see cref="KMeansPlusPlusTrainer"/>
        /// </summary>
-        /// <param name="env">The private instance of <see cref="IHostEnvironment"/>.</param>
+        /// <param name="env">The local instance of <see cref="IHostEnvironment"/>.</param>


The local instance of [](start = 30, length = 52)

Maybe update to: The environment to use.
So that it easier to understand. #Resolved

I think we are using this comment most everywhere.

In reply to: 230873736 [](ancestors = 230873736)

Actually, I don't think so. 'The local instance of IHostEnvironment' is a really bad comment for a public method.

In reply to: 231233831 [](ancestors = 231233831,230873736)

what would you suggest we use @Zruty0?

In reply to: 231367726 [](ancestors = 231367726,231233831,230873736)

What Artidoro said.

In reply to: 231386902 [](ancestors = 231386902,231367726,231233831,230873736)

that doesn't sound more more helpful to me than the current one.. but 2 to 1.
I'll look for what do .net uses in those cases in other frameworks, and use that to update everywhere on my next PR.

In reply to: 231618928 [](ancestors = 231618928,231386902,231367726,231233831,230873736)

artidoro

Zruty0 · 2018-11-06T17:52:37Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

        {
            Host.CheckValue(args, nameof(args));

+            // override with the advanced settings.
            if (advancedSettings != null)


if [](start = 12, length = 2)

advancedSettings?.Invoke(args) #Closed

Zruty0 · 2018-11-06T17:53:26Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

        /// <param name="label">The label column.</param>
        /// <param name="features">The features column.</param>
        /// <param name="weights">The optional weights column.</param>
-        /// <param name="numTrees">Total number of decision trees to create in the ensemble.</param>
-        /// <param name="numLeaves">The maximum number of leaves per decision tree.</param>
        /// <param name="minDatapointsInLeafs">The minimal number of datapoints allowed in a leaf of the tree, out of the subsampled data.</param>


minDatapointsInLeafs [](start = 25, length = 20)

I sense inconsistent naming #Resolved

I can't see what you are refering to..

In reply to: 231226770 [](ancestors = 231226770)

MinDocsPerLeaf, MinDocumentsPerLeaf, MinDatapointsInLeafs

Also 'leaves', not 'leafs'

In reply to: 231232309 [](ancestors = 231232309,231226770)

oh lol. Apparently it's been like this for a long time.

In reply to: 231367379 [](ancestors = 231367379,231232309,231226770)

Zruty0 · 2018-11-06T18:07:02Z

src/Microsoft.ML.Ensemble/Trainer/Multiclass/MulticlassDataPartitionEnsembleTrainer.cs

@@ -59,7 +59,7 @@ public Arguments()
                BasePredictors = new[]
                {
                    ComponentFactoryUtils.CreateFromFunction(
-                        env => new MulticlassLogisticRegression(env, FeatureColumn, LabelColumn))
+                        env => new MulticlassLogisticRegression(env, LabelColumn,FeatureColumn))


,F [](start = 80, length = 2)

spacing #Closed

Zruty0 · 2018-11-06T18:07:57Z

src/Microsoft.ML.StandardLearners/Standard/LogisticRegression/MulticlassLogisticRegression.cs

@@ -83,7 +83,9 @@ public sealed class Arguments : ArgumentsBase
        /// <param name="memorySize">Memory size for <see cref="LogisticRegression"/>. Lower=faster, less accurate.</param>
        /// <param name="optimizationTolerance">Threshold for optimizer convergence.</param>
        /// <param name="advancedSettings">A delegate to apply all the advanced arguments to the algorithm.</param>
-        public MulticlassLogisticRegression(IHostEnvironment env, string featureColumn, string labelColumn,
+        public MulticlassLogisticRegression(IHostEnvironment env,
+            string labelColumn,


labelColumn [](start = 19, length = 11)

it used to be label and features in other trainers, right? #Resolved

Thanks for this.
I changed them to be label, features, weights per the description in #1318, but labelColumn, featuresColumn and weightColumn seems more correct since they are actually column names?

In reply to: 231231728 [](ancestors = 231231728)

but should we have it consistent?

In reply to: 231244854 [](ancestors = 231244854,231231728)

for sure. I changed them to be "label, features, weights". I personally lean more towards "labelColum, featureColumn and weightColumn"

In reply to: 231367774 [](ancestors = 231367774,231244854,231231728)

I also find labelColumn, featureColumn, weightColumn more appealing. I find them more representative of what the arguments specify (the name of these columns). #Resolved

FWIW, labelColumn, featureColumn, weightColumn sound better to me. #Resolved

Zruty0 · 2018-11-06T18:08:29Z

src/Microsoft.ML.StandardLearners/Standard/SdcaCatalog.cs

@@ -31,7 +31,9 @@ public static class SdcaRegressionExtensions
        /// if both are present and have different values.
        /// The columns names, however need to be provided directly, not through the <paramref name="advancedSettings"/>.</param>
        public static SdcaRegressionTrainer StochasticDualCoordinateAscent(this RegressionContext.RegressionTrainers ctx,
-        string label = DefaultColumnNames.Label, string features = DefaultColumnNames.Features, string weights = null,


= DefaultColumnNames.Label [](start = 20, length = 27)

why do we remove the defaults? #Resolved

we talked about them being error prone, better ask the user to define them explicitly.

In reply to: 231231911 [](ancestors = 231231911)

@TomFinley do you have a preference? I would prefer to keep the defaults. Otherwise, why do we even have DefaultColumnNames in the API?

In reply to: 231234097 [](ancestors = 231234097,231231911)

I think the default arguments make sense for the case when the user doesn't supply anything, and the pipeline just works. This is not the case for the column names; if they are not "Label" and "Features" in the datasets the pipelines fail, sometimes with esoteric messages.

Most developers tend to leave defaults to whatever their are set in the first moment of being socialized to the API. Those defaults will backfire often.

I fully endorse the default for the numeric values: they guide the user about ranges, and they just work most of the time.
I am also ok with Default column names for the columns components produce and hard-code like "Score" "PredictedLabel"; but IMO hardcode the inputs is misleading and will hide errors like the ones we have fixed so far (not passing the column names at all, and things still work because our test datasets and the default col names match).

In reply to: 231367619 [](ancestors = 231367619,231234097,231231911)

I think generally it is a good policy to be cautious w.r.t. expanding defaults, as @sfilipi suggests. Being too aggressive with your defaults can lead to more confusion, not less.

I think the default arguments make sense for the case when the user doesn't supply anything, the pipeline just works. This is not the case for the column names;

I would argue that it is the case for column names more often than we might think. The text loader when no settings are defined has the default behavior, which has proven to be valuable, that it considers the first field a scalar Label and subsequent fields a vector Features.

Further, while our text parser is built so that it can accommodate a diverse set of inputs (if properly configured), we should not loose sight of the fact that generally file formats for machine learning are far more restrictive than this. Consider formats like SVM-light, the more esoteric Bing formats, (we have loaders for the previous that we haven't yet had the chance to open source) or the VW format. In these formats, "label" and "features" are priviledged concepts, and where we've written loaders for them, we name these things the columns Label and Features. Or the "wizard" case, where we apply heusitics to figure out how to load data, what else will the result be named but Label and Features?

Now consider two populations of people, under two conditions.

Condition A, with defaults!

Someone uses SVM-light loader, or the Bing BIN loader, or the text loader under default settings. (Or, in future, a hypothetical VW loader.) These will provide two columns Label and Features. With defaults, they call Read() then Fit(). Things work.

Now, you are talking about someone that overrides the default behavior of TextLoader (again, that default behavior ) and provides explicit names for their columns. They call Fit on this trainer. Things fail with the message that it could not find a label or feature column named Label or Features.

Condition B, without defaults!

Same actions, but nwo things fail. They need to provide these column names, but how could they reasonably know what the column names are? Indeed, someone just coming to ML.NET for the first time and just fires up a session with SVM-light loader for the first time, how could we expect them to even know what a column is?

Same action, things still fail.

It is far from clear to me that A2 is significantly more confusing than B2 -- beyond that to even hit A2 we need a user that is both (1) smart enough to figure out how to come up with names for columns but (2) not smart enough to figure out that things downstream might want those names. Even if such a person exists, the situation of A2 is recoverable.

We must, I think, all agree that B1 is borderline disastrously confusing. And A1 both work.

Some people have really confusing pipelines with multiple featurizers, custom data, and all this stuff. But some people just have data, already nicely structured, and just want to read and train. IDataView and Columns are sort of one of those concepts you need to know if you're doing complex featurization, but if you don't, you don't really need to know it, and I'd like to keep it that way if we can.

Therefore, my own sense is that @Zruty0 is correct, and we ought to have defaults.

In reply to: 231388508 [](ancestors = 231388508,231367619,231234097,231231911)

Thanks for taking a side, @TomFinley, appreciate the level of detail. My observation is that overriding the behavior of the TextLoader is actually the norm, and unless your immediate component is a trainerEstimator, you need the names of the columns to transform them further.

Examples of ML.Net issues with non-standard TextLoader:
#420
https://stackoverflow.com/questions/52689363/adding-data-from-new-textloader-into-exisitng-pipeline-ver-0-6
https://elbruno.com/2018/06/19/mlnet-loading-data-in-our-learningpipeline-with-list-lists-for-ever/
https://carlos.mendible.com/2018/06/10/simple-machine-learning-with-dotnet-core-sample/

I will leave this discussion active till tomorrow afternoon, to see if anyone wants to comment further about either side, and if it remains a two-one situation, I'll bitterly :) revert the changes.. #Resolved

I wrote more about my experience with the default column names in #1360.

The assumption that the users don't need to specify column names in the TextLoader holds if the dataset is all numeric. If they have a text or categorical column, you'd have to know what it is to transform it further.

Also, I worried that the TextLoader producing the "Label" and "Feature" columns has discoverability issues on its own, and it hurts adoption more than helping it.
Educating the developers through tutorial, and by making the columns required would give them a faster understanding how to evolve the pipeline.
Scikit required the column names, and H2O requires the feature column, it only assumes that the Label column is the first one.

/cc @GalOshri @CESARDELATORRE for opinions #Resolved

Zruty0

revoking review

updating minDocumentsInleaves

Zruty0

sfilipi added 3 commits November 4, 2018 08:07

fixing GAM trainers extending the right context. Removing unused argu…

f698b96

…ments from the extensions of their signatures. No clusters in advancedSettings now overiddes the number of clusters passed directly in KMeans.

fixing dotnet#1521

d10ea0a

removing the default name for Features and Label colum.

more consistency

011de99

sfilipi added the API Issues pertaining the friendly API label Nov 4, 2018

sfilipi self-assigned this Nov 4, 2018

sfilipi requested review from Ivanidzo4ka, CESARDELATORRE, shauheen, artidoro, Zruty0 and shmoradims November 4, 2018 16:29

Fixing the tests to have the Label and Features columns required.

fb99d51

artidoro reviewed Nov 5, 2018

View reviewed changes

artidoro approved these changes Nov 5, 2018

View reviewed changes

sfilipi added 3 commits November 5, 2018 13:23

Fixing tests and the static SDCA extensions

b776880

Fixing order in tests

d4b9ce7

removing space that causes the docs CI to fail

191f28c

Zruty0 reviewed Nov 6, 2018

View reviewed changes

naming standartization

7d98033

consistent naming in arguments

e2a360b

Zruty0 previously approved these changes Nov 7, 2018

View reviewed changes

sfilipi added 2 commits November 6, 2018 22:10

changes from master

5f2fb8e

updating minDocumentsInleaves

changing the comment

d2aaf1f

artidoro mentioned this pull request Nov 7, 2018

Convert LdaTransform to IEstimator/ITransformer API #1410

Merged

sfilipi added 3 commits November 8, 2018 23:29

The Pandora Box

b18e2f4

Merge branch 'master' into moreTrainerXtensions

27b1842

merging from master

a84daf0

Zruty0 approved these changes Nov 9, 2018

View reviewed changes

sfilipi added 2 commits November 9, 2018 12:33

post merge XML comment change.

b2631b7

Merge branch 'master' into moreTrainerXtensions

1bacc1d

sfilipi merged commit 851558d into dotnet:master Nov 10, 2018

sfilipi mentioned this pull request Dec 7, 2018

MLContext extensions for trainers #1319

Closed

TomFinley mentioned this pull request Jan 7, 2019

Gentle suggestion for missing columns? #2050

Open

sfilipi mentioned this pull request Jan 16, 2019

KMeans and Implicit weight cleanup #2158

Merged

sfilipi deleted the moreTrainerXtensions branch March 15, 2019 21:15

ghost locked as resolved and limited conversation to collaborators Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More trainer extensions, bug fixes and consistency across trainer extensions #1524

More trainer extensions, bug fixes and consistency across trainer extensions #1524

sfilipi commented Nov 4, 2018

artidoro Nov 5, 2018 •

edited by sfilipi

Loading

artidoro Nov 5, 2018 •

edited

Loading

artidoro Nov 5, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018

Zruty0 Nov 7, 2018

sfilipi Nov 7, 2018

Zruty0 Nov 7, 2018

sfilipi Nov 7, 2018 •

edited

Loading

artidoro left a comment

Zruty0 Nov 6, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018 •

edited

Loading

Zruty0 Nov 7, 2018

sfilipi Nov 7, 2018

Zruty0 Nov 6, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018 •

edited

Loading

Zruty0 Nov 7, 2018

sfilipi Nov 7, 2018

artidoro Nov 7, 2018 •

edited by sfilipi

Loading

shauheen Nov 7, 2018 •

edited by sfilipi

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018

Zruty0 Nov 7, 2018

sfilipi Nov 7, 2018 •

edited

Loading

TomFinley Nov 7, 2018 •

edited

Loading

sfilipi Nov 7, 2018 •

edited

Loading

sfilipi Nov 8, 2018 •

edited

Loading

Zruty0 left a comment

Zruty0 left a comment

More trainer extensions, bug fixes and consistency across trainer extensions #1524

More trainer extensions, bug fixes and consistency across trainer extensions #1524

Conversation

sfilipi commented Nov 4, 2018

artidoro Nov 5, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

artidoro Nov 5, 2018 • edited Loading

Choose a reason for hiding this comment

artidoro Nov 5, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

artidoro left a comment

Choose a reason for hiding this comment

Zruty0 Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 Nov 6, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

sfilipi Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zruty0 Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 Nov 6, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

sfilipi Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artidoro Nov 7, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

shauheen Nov 7, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

Zruty0 Nov 6, 2018 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

TomFinley Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 left a comment

Choose a reason for hiding this comment

Zruty0 left a comment

Choose a reason for hiding this comment

artidoro Nov 5, 2018 •

edited by sfilipi

Loading

artidoro Nov 5, 2018 •

edited

Loading

artidoro Nov 5, 2018 •

edited by sfilipi

Loading

sfilipi Nov 7, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 6, 2018 •

edited

Loading

artidoro Nov 7, 2018 •

edited by sfilipi

Loading

shauheen Nov 7, 2018 •

edited by sfilipi

Loading

Zruty0 Nov 6, 2018 •

edited by sfilipi

Loading

sfilipi Nov 7, 2018 •

edited

Loading

TomFinley Nov 7, 2018 •

edited

Loading

sfilipi Nov 7, 2018 •

edited

Loading

sfilipi Nov 8, 2018 •

edited

Loading