Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicolumn mapping for some estimators #3066

Merged
merged 4 commits into from
Mar 25, 2019
Merged

Conversation

artidoro
Copy link
Contributor

@artidoro artidoro commented Mar 22, 2019

Adding multicolumn mapping for some estimators (as per list by @TomFinley and @glebuk):

  • OneHotEncodingEstimator
  • TypeConvertingEstimator
  • KeyToVectorMappingEstimator
  • ValueToKeyMappingEstimator
  • OneHotHashEncodingEstimator
  • MissingValueEstimator
  • FeatureSelectionCatalog.*
  • KeyToValueMappingEstiamtor

Leaving out:

  • TextFeaturizingEstimator (probably requires column specific settings most of the time)
  • NoramlizingEstiamtor (in experimental nuget)

Let me know if I should add more estimators.

Fixes #3068
Related to #2884

@artidoro artidoro force-pushed the multicolumn branch 3 times, most recently from d08a2f9 to 0896590 Compare March 22, 2019 20:39
@artidoro artidoro changed the title WIP: Multicolumn mapping for some estimators Multicolumn mapping for some estimators Mar 22, 2019
@artidoro artidoro self-assigned this Mar 22, 2019
@artidoro artidoro added this to the 0319 milestone Mar 22, 2019
/// <summary>
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source.
/// </summary>
public readonly string InputColumnName;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public readonly string InputColumnName [](start = 7, length = 39)

I'm slightly confuse.
We get rid of ColumnOptions because they were immutable, and now we add another immutable class... #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will invite you to check with Tom about this, I am just executing what he asked


In reply to: 268333577 [](ancestors = 268333577)

/// <summary>
/// Specifies input and output column names for a transformation.
/// </summary>
public sealed class InputOutputColumnPair
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputOutputColumnPair [](start = 24, length = 21)

what is difference between this one and ColumnOptions? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much, I would have used the bellow if it were me, but Tom asked it to be different, for some reason!


In reply to: 268334077 [](ancestors = 268334077)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a very good reason. ColumnOptions is a struct meant to serve a specific transformer base, and that is involved in the type heirarchy, and in particular something that captured all of the individual settings and state for each mapping. Our goals here were comparatively more modest: we just needed to . Well designed code does what it is designed to do, and in the simplest possible way. There is no need for this to be part of an elaborate type hierarchy -- this was in fact the mistake that led to the issue #2884 being filed, that we'd conflated two distinct techniques for the extremely bad reasoning that they both had to do with "column." (And, of course, if something deals with the same sort of object, obviously they belong in the same type heirarchy, right?)


In reply to: 268334389 [](ancestors = 268334389,268334077)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I'm not talking about ColumnOptions inside Estimator, (for example HashingEstimator.ColumnOptions.
I'm talking about ColumnOptions in this exact file few lines below (line 41)
Only difference I see is Input/output column names is public in this one instead of private in other, and name of class.
Ok, two difference, one below has implicit converter from tuple.

Can we delete ColumnOptions in this file?


In reply to: 268337035 [](ancestors = 268337035,268334389,268334077)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. What Artidoro and I had discussed was actually somewhat different. (Or we were talking about two separate things without realizing it.)


In reply to: 268339442 [](ancestors = 268339442,268337035,268334389,268334077)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we were talking about something different, glad we are on the same page.


In reply to: 268760195 [](ancestors = 268760195,268339442,268337035,268334389,268334077)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. The ColumnOptions I had been talking about (which is to say, the vast majority of things with that name) are as I said meant to serve a different purpose.

We should probably rename them back to ColumnInfo at some point but this can be delayed as they are internal... ummm except one. Whoops. Opened #3078. :) Aside from that, yeah.


In reply to: 268771510 [](ancestors = 268771510,268760195,268339442,268337035,268334389,268334077)

@artidoro
Copy link
Contributor Author

artidoro commented Mar 22, 2019

Since @ivanbasov asked for it the third commit shows the how this would look like if we removed the ColumnOptions class, and used InputOutputColumnPair.

Let me know what looks better! I will either eliminate this commit or keep it.

Notice that even in commit 3 InputOutputColumnPair is never used inside any transform. #Resolved

@codecov
Copy link

codecov bot commented Mar 22, 2019

Codecov Report

Merging #3066 into master will decrease coverage by 0.02%.
The diff coverage is 60.11%.

@@            Coverage Diff             @@
##           master    #3066      +/-   ##
==========================================
- Coverage   72.53%   72.51%   -0.03%     
==========================================
  Files         806      806              
  Lines      144282   144642     +360     
  Branches    16183    16197      +14     
==========================================
+ Hits       104661   104889     +228     
- Misses      35217    35342     +125     
- Partials     4404     4411       +7
Flag Coverage Δ
#Debug 72.51% <60.11%> (-0.03%) ⬇️
#production 68.11% <41.59%> (-0.05%) ⬇️
#test 88.8% <98.18%> (+0.04%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <0%> (-3.79%) ⬇️
src/Microsoft.ML.Transforms/NormalizerCatalog.cs 36.36% <0%> (-35.07%) ⬇️
...c/Microsoft.ML.ImageAnalytics/ExtensionsCatalog.cs 11.11% <0%> (-18.89%) ⬇️
...icrosoft.ML.Tests/Transformers/CategoricalTests.cs 100% <100%> (ø) ⬆️
...sts/Transformers/KeyToBinaryVectorEstimatorTest.cs 100% <100%> (ø) ⬆️
...Microsoft.ML.Tests/Transformers/NormalizerTests.cs 100% <100%> (ø) ⬆️
...oft.ML.Tests/Transformers/FeatureSelectionTests.cs 100% <100%> (ø) ⬆️
...ios/IrisPlantClassificationWithStringLabelTests.cs 98.63% <100%> (ø) ⬆️
...icrosoft.ML.Tests/Transformers/NAIndicatorTests.cs 100% <100%> (ø) ⬆️
...crosoft.ML.Tests/Transformers/ValueMappingTests.cs 100% <100%> (ø) ⬆️
... and 30 more

/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param>
/// <param name="outputKind">The expected kind of the output column.</param>
public static TypeConvertingEstimator ConvertType(this TransformsCatalog.ConversionTransforms catalog,
InputOutputColumnPair[] columns,
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns [](start = 36, length = 7)

Don't you need to check for null for the columns arg to avoid null reference exception? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the checks thanks for pointing out. I only fixed the extensions that are public. Not those that are internal.


In reply to: 268724517 [](ancestors = 268724517)

=> new KeyToValueMappingEstimator(CatalogUtils.GetEnvironment(catalog), ColumnOptions.ConvertToValueTuples(columns));
/// <param name="catalog">The conversion transform's catalog.</param>
/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param>
public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, InputOutputColumnPair[] columns)
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns [](start = 140, length = 7)

also check for null #Resolved

/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names.
/// </summary>
public static implicit operator ColumnOptions((string outputColumnName, string inputColumnName) value)
public InputOutputColumnPair(string outputColumnName, string inputColumnName = null)
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public InputOutputColumnPair(string outputColumnName, string inputColumnName = null) [](start = 8, length = 84)

Add another overload for the case when input name = output name. That would make it a lot clearer vs setting one to null. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I believe that everywhere in the codebase we have the same pattern string outputColumnName, string inputColumnName = null. This is found in all the mlContext extensions for transforms. I can add it here, but I think it would make more sense to stick to the general pattern.


In reply to: 268726871 [](ancestors = 268726871)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must be consistent. Sorry @glebuk!


In reply to: 268779384 [](ancestors = 268779384,268726871)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then.


In reply to: 268819893 [](ancestors = 268819893,268779384,268726871)

/// <summary>
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source.
/// </summary>
public readonly string InputColumnName;
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputColumnName [](start = 31, length = 15)

Shouldn't it be in reverse - the input be set, but output be optional and equal to input if out is null?? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per #2064 we use the outputColumnName as inputColumnName when inputColumnName is null. We are doing this across the code base.


In reply to: 268727730 [](ancestors = 268727730)

ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = ValueToKeyMappingEstimator.Defaults.Ordinality,
IDataView keyData = null)
{
var columnOptions = columns.Select(x => new OneHotEncodingEstimator.ColumnOptions(x.OutputColumnName, x.InputColumnName, outputKind, maximumNumberOfKeys, keyOrdinality)).ToArray();
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns [](start = 32, length = 7)

check for null here and elsewhere #Resolved

("out1", "VectorFloat"),
("out2", "VectorDouble")
columns: new[] {
new InputOutputColumnPair("out1", "VectorFloat"),
Copy link
Contributor

@glebuk glebuk Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputOutputColumnPair [](start = 28, length = 21)

why do we have to use the more verbose initializer here? Ideally we should use the old syntax is possible as it is a lot more compact. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the other one uses tuples. We have decided not to have tuples in the public surface any longer.


In reply to: 268729946 [](ancestors = 268729946)

@glebuk
Copy link
Contributor

glebuk commented Mar 25, 2019

    public void ValueMappingValueTypeIsVectorWorkout()

Add a test for when InputOutputColumnPair is null.


Refers to: test/Microsoft.ML.Tests/Transformers/ValueMappingTests.cs:523 in 802e4de. [](commit_id = 802e4de, deletion_comment = False)

/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names.
/// </summary>
public static implicit operator ColumnOptions((string outputColumnName, string inputColumnName) value)
public InputOutputColumnPair(string outputColumnName, string inputColumnName = null)
Copy link
Contributor

@TomFinley TomFinley Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputColumnName [](start = 44, length = 16)

Check non-empty on outputColumName probably. #Resolved

/// <summary>
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source.
/// </summary>
public readonly string InputColumnName;
Copy link
Contributor

@TomFinley TomFinley Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputColumnName [](start = 31, length = 15)

Should these be properties? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok making these properties!


In reply to: 268761458 [](ancestors = 268761458)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @artidoro !! I see the central thing still isn't using properties, but I am not certain that is absolutely essential. Might be nice though if you get to it.

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@artidoro
Copy link
Contributor Author

Thank you for reviewing, I am updating now with the latest changes.

@artidoro artidoro merged commit 5f9be36 into dotnet:master Mar 25, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants