Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of components through MLContext and cleanup (text transform) #2394

Merged

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Feb 3, 2019

Towards #1798 , #1758

The following transform estimators are being addressed:

  • TextFeaturizingEstimator

NOTE:

  • for fixing the text transform API we have to incorporate the API patterns we followed for fixing the learner estimators + the patterns we are following for fixing the transform estimators
  • This is because the text transform was using a Action delegate (like some of the learner estimators).

The changes are as follows :

  1. Two public extension methods, one for simple arguments and the other for advanced options
  2. Internalize constructors of estimators and transformers
  3. Pass Options objects as arguments instead of Action delegate
  4. Rename Settings to Options
  5. Rename Options objects as options (instead of args or advancedSettings used so far)
  6. Internalize Arguments since the public constructor uses Options. Also a few other fields have been made internal

@@ -120,7 +120,7 @@ public sealed class Arguments : TransformInputBase
public TextNormKind VectorNormalizer = TextNormKind.L2;
}

public sealed class Settings
public sealed class Options
Copy link
Contributor

@artidoro artidoro Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a summary tag for each of these settings, and for the Options class? You can take them from the constructor I believe. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 253319985 [](ancestors = 253319985)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public ITransformer Fit(IDataView input)

Can you add a summary tag for Fit? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:290 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public static ITransformer Create(IHostEnvironment env, ModelLoadContext ctx)

Could you make this private? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:465 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public SchemaShape GetOutputSchema(SchemaShape inputSchema)

Can you add a summary comment? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:473 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public const Language DefaultLanguage = Language.English;

Can you add a summary comment, or make this internal?
#Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:257 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public sealed class Column : ManyToOneColumn

I think this is only used in the Options class, can you make it internal? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:62 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@@ -79,7 +79,7 @@ internal bool TryUnparse(StringBuilder sb)
/// <summary>
/// This class exposes <see cref="NgramExtractorTransform"/>/<see cref="NgramHashExtractingTransformer"/> arguments.
/// </summary>
public sealed class Arguments : TransformInputBase
internal sealed class Arguments : TransformInputBase
Copy link
Contributor

@artidoro artidoro Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments [](start = 30, length = 9)

Can you rename to Options? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you notice in line #126 below, we have a public Options class, which the public API uses. Hence I kept it as is.

DO u prefer calling it something else, e.g. OptionsInternal ? But it would only be cosmetic, since its 'internal'


In reply to: 253320489 [](ancestors = 253320489)

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

Merging #2394 into master will decrease coverage by <.01%.
The diff coverage is 60.27%.

@@            Coverage Diff             @@
##           master    #2394      +/-   ##
==========================================
- Coverage   71.22%   71.22%   -0.01%     
==========================================
  Files         785      785              
  Lines      141030   140978      -52     
  Branches    16116    16113       -3     
==========================================
- Hits       100455   100412      -43     
+ Misses      36106    36095      -11     
- Partials     4469     4471       +2
Flag Coverage Δ
#Debug 71.22% <60.27%> (-0.01%) ⬇️
#production 67.58% <94.87%> (ø) ⬆️
#test 85.28% <20.58%> (-0.03%) ⬇️

@artidoro
Copy link
Contributor

artidoro commented Feb 5, 2019

Your PR is also related to #2026. I think you might fix the issue as part of this change.

@@ -136,10 +136,10 @@ public sealed class Settings
#pragma warning restore MSML_NoInstanceInitializers // No initializers on instance fields or properties
}
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing requires proper cleaning/documentation. Do you want to do it in this PR, or you prefer leave it for later cleaning? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR is to fix the public API

We are doing some cleaning/doc fixes on an opportunistic basis. But we should have a separate issue / PRs for "proper" cleaning and documentation


In reply to: 254013725 [](ancestors = 254013725)


/// <summary>
/// Transform several text columns into featurized float array that represents counts of ngrams and char-grams.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="advancedSettings">Advanced transform settings</param>
/// <param name="options">Advanced transform settings.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Advanced transform settings. [](start = 7, length = 63)

@sfilipi any chance we have a consensus regarding what to put here?
I can see /// <param name="options">Advanced arguments to the algorithm.</param>
/// <param name="options">Algorithm advanced options.</param>
/// <param name="options">Algorithm advanced settings.</param>
/// <param name="options">Advanced options to the algorithm.</param> #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #2154 to fix this across codebase (once we finalize what to put in here)


In reply to: 254018114 [](ancestors = 254018114)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the object's name is now Options, let;s go with:

/// Advanced options to the algorithm.

if you can search/replace in one of the PR that would be awesome!


In reply to: 254018114 [](ancestors = 254018114)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. for now i have fixed this one


In reply to: 254356441 [](ancestors = 254356441,254018114)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public ITransformer Fit(IDataView input)

fixed


In reply to: 460089568 [](ancestors = 460089568)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:290 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static ITransformer Create(IHostEnvironment env, ModelLoadContext ctx)

fixed


In reply to: 460090439 [](ancestors = 460090439)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:465 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public SchemaShape GetOutputSchema(SchemaShape inputSchema)

fixed


In reply to: 460090453 [](ancestors = 460090453)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:473 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public sealed class Column : ManyToOneColumn

fixed


In reply to: 460090541 [](ancestors = 460090541)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:62 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public const Language DefaultLanguage = Language.English;

fixed


In reply to: 460090484 [](ancestors = 460090484)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:257 in c10ca38. [](commit_id = c10ca38, deletion_comment = False)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abgoswam abgoswam requested a review from sfilipi February 5, 2019 23:55

/// <summary>
/// Transform several text columns into featurized float array that represents counts of ngrams and char-grams.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="advancedSettings">Advanced transform settings</param>
/// <param name="options">Advanced transform settings.</param>
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings [](start = 53, length = 8)

Let's not call the Options "settings" :) #Resolved


/// <summary>
/// Transform several text columns into featurized float array that represents counts of ngrams and char-grams.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="advancedSettings">Advanced transform settings</param>
/// <param name="options">Advanced transform settings.</param>
public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
IEnumerable<string> inputColumnNames,
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEnumerable inputColumnNames [](start = 12, length = 36)

I know it is not your change, but can we go with params, rather than an IEnumerable?

Looking at the sample, having to construct a List is overkill. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #2460 to solicit suggestions on proper API for this.

Would not want to block this PR in the meantime.


In reply to: 254357200 [](ancestors = 254357200)


/// <summary>
/// Transform several text columns into featurized float array that represents counts of ngrams and char-grams.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="advancedSettings">Advanced transform settings</param>
/// <param name="options">Advanced transform settings.</param>
public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog,
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeaturizeText [](start = 47, length = 13)

link the sample to this one, since it illustrates the usage of Options. #Resolved

s.KeepNumbers = false;
s.OutputTokens = true;
s.TextLanguage = TextFeaturizingEstimator.Language.English; // supports English, French, German, Dutch, Italian, Spanish, Japanese
var customized_pipeline = ml.Transforms.Text.FeaturizeText(customizedColumnName, new List<string> { "SentimentText" },
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new List { "SentimentText" } [](start = 93, length = 36)

umm :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue #2460 to discuss this


In reply to: 254357706 [](ancestors = 254357706)

/// <summary>
/// Advanced settings for the <see cref="TextFeaturizingEstimator"/>.
/// </summary>
public sealed class Options
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options [](start = 28, length = 7)

can you remove this, and use the Arguments above, like for everything else?
The class below is redundant.

make the Factories internal, so they are not visible. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want to do that.
Main reason why it's bad - is bunch of Factories we don't want to expose.


In reply to: 254358363 [](ancestors = 254358363)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should do some of this cleanup outside purview of this PR ...

I need to understand properly how the factories are used etc.


In reply to: 254461580 [](ancestors = 254461580,254358363)

UseWordExtractor = false,
}).Fit(loader).Transform(loader);

var trans = mlContext.Transforms.Text.ExtractWordEmbeddings("Features", "WordEmbeddings_TransformedText",
WordEmbeddingsExtractingTransformer.PretrainedModelKind.Sswe).Fit(text).Transform(text);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordEmbeddingsExtractingTransformer [](start = 16, length = 35)

I though you move it to estimator in your other PR, can you merge with master? #Closed

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[FeaturizeText](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/TextTransform.cs)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextTransform [](start = 101, length = 13)

you removing this example becaue it doesn't use this exact method? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap. removed it from here, and added to the API below (this was one of Senja's comments)


In reply to: 254802756 [](ancestors = 254802756)

@@ -120,26 +120,59 @@ internal sealed class Arguments : TransformInputBase
public TextNormKind VectorNormalizer = TextNormKind.L2;
}

public sealed class Settings
/// <summary>
/// Advanced settings for the <see cref="TextFeaturizingEstimator"/>.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings [](start = 21, length = 8)

can we call it options? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

am planning to have another PR which fixes this for all the learners/featurizers we converted to Options


In reply to: 254802978 [](ancestors = 254802978)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two small comments, can you address them?
Otherwise looks good!

@abgoswam abgoswam merged commit 834e471 into dotnet:master Feb 7, 2019
@abgoswam abgoswam deleted the abgoswam/transform_estimator_api_texttransform branch February 20, 2019 16:58
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants