Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to make consistent the parameters across the API #1521

Closed
CESARDELATORRE opened this issue Nov 3, 2018 · 0 comments
Closed

Need to make consistent the parameters across the API #1521

CESARDELATORRE opened this issue Nov 3, 2018 · 0 comments
Assignees

Comments

@CESARDELATORRE
Copy link
Contributor

Issue: Similar classes/estimators methods have different parameters' order for the same concepts.

It looks like a "silly" error, but not straightforward to catch because the column's name provided as parameters are not typed (just strings, logically, since column names could vary) you get an error in execution time but in addition to that because the pipeline works as lazy loading, you find the error later on when fitting the model, so you initially don't know what API is incorrectly configured making it hard to find these kind of "silly" errors which are prone to happen because of different parameter's order in similar APIs (such as Estimators and Transformers).

For instance, when migrating to the estimator by using mlContext, the following "silly" case happened to me:

// This line works
var trainer = new SdcaMultiClassTrainer(mlContext, "Features", "Label");
// This line doesn't work, but you'll find out later on when Fitting the model
var trainer = mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent("Features", "Label");

Error you get when fitting:

System.ArgumentOutOfRangeException
  HResult=0x80131502
  **Message=Training label column 'Features' type is not valid for multi-class: Vec<R4, 69981>. Type must be R4 or R8.**
  Source=Microsoft.ML.Data
//This line DOES work when changing to the right order of parameters accepted by the new API:
var trainer = mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent("Label", "Features");

But you can see that is not consistent in order with this very similar line, the same trainer:

// This line works
var trainer = new SdcaMultiClassTrainer(mlContext, "Features", "Label");

I initially thought that the trainer in the mlcontext catalog was not working for any reason as a very similar API was working by directly creating the class.. But it was simply because the order is vice-versa in the case of the new trainer surfaced on the mlContext. ;)

Improvement

  1. Obvious and already "work in progress": It will help if we are consistent in the order of the parameters across all the APIs. "Label" first, "Features" later, or vice-versa, but consistent.

  2. Not sure if somehow we could use strongly-typed types for the column names or some kind of check/control.. like if for the column names we were able to use the fields in the schema classes and the "Label" field was "marked" as "Label" with an attribute, that could also help, I mean something like the following:

var trainer = mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscentnameof(
                                                        nameof(MyObservation.MyLabel), 
                                                        "Features");

Then, the Label field is "marked" as Label:

    public class MyObservation
    {
        [Column(ordinal: "0")]
        public string ID;

        [Label]
        [Column(ordinal: "1")]
        public string MyLabel; // This is an issue label, for example "area-System.Threading"

        [Column(ordinal: "2")]
        public string Title;

        [Column(ordinal: "3")]
        public string Description;
    }

Not sure though..., this last approach might be too contrived and not realistic because in almost all the cases the Label is also a generated/Dictionarized field, as well as the "Features" column, both created/generated as new columns out of the original observation schema class, so this last attempt might not be feasible...

Any other possible check/control so errors when using strings for the column names could be less prone to happen?...

sfilipi added a commit to sfilipi/machinelearning-1 that referenced this issue Nov 4, 2018
removing the default name for Features and Label colum.
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants