Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made 'StopWordsRemover' in TextFeaturizer configurable again. #2962

Merged
merged 6 commits into from
Mar 18, 2019

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Mar 14, 2019

This PR addresses the StopWordsRemover issue in #838

This PR make available two options to remove stop words.

  1. PredefinedStopWordsRemover: removes stops words using built in technique in ML.NET.
  2. UseCustomStopWordsRemover: User provides the list of stop words to remove. Currently string[] is exposed to collect that list.

No support is added to load stop words from file. I think if we make No.2 take list as IEnumerable<string> instead of string[] then user can load list on their own. Let me know if reviewers have any thought.

@codecov
Copy link

codecov bot commented Mar 14, 2019

Codecov Report

Merging #2962 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master    #2962      +/-   ##
==========================================
- Coverage   72.26%   72.25%   -0.02%     
==========================================
  Files         802      800       -2     
  Lines      142901   142894       -7     
  Branches    16134    16137       +3     
==========================================
- Hits       103272   103245      -27     
- Misses      35210    35228      +18     
- Partials     4419     4421       +2
Flag Coverage Δ
#Debug 72.25% <50%> (-0.02%) ⬇️
#production 67.98% <48.83%> (-0.02%) ⬇️
#test 88.48% <100%> (ø) ⬆️
Impacted Files Coverage Δ
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.08% <100%> (ø) ⬆️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 85.78% <100%> (+0.25%) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 83.2% <43.58%> (-7.12%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...OnnxTransformer.StaticPipe/OnnxStaticExtensions.cs
...r.StaticPipe/DnnImageFeaturizerStaticExtensions.cs
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (+0.2%) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <0%> (+0.62%) ⬆️
... and 1 more

@@ -96,8 +129,42 @@ public sealed class Options : TransformInputBase
[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)]
public Language Language = DefaultLanguage;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language [](start = 28, length = 8)

it needs to be part of StopWordsRemovingTransformer #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Language parameter is used by a couple components in TextTransform e.g. it used by tokenizer also. I know currently we are using very naïve whitespace tokenizer which does not use it. However, in future if we develop a sophisticated tokenizer for different language then this parameter will be needed. What do you say?


In reply to: 265808271 [](ancestors = 265808271)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what kind of future lays ahead of us, but for now, this option just look silly.


In reply to: 265812513 [](ancestors = 265812513,265808271)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...so you think that it should not be here? it should instead be in PredefinedStopWordsRemoverOptions?


In reply to: 266564835 [](ancestors = 266564835,265812513,265808271)

/// <summary>
/// Defines the different type of stop words remover supported.
/// </summary>
public enum StopWordsRemoverType
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopWordsRemoverType [](start = 20, length = 20)

I think it would be much cleaner if we just introduce IStopWordRemover interface and inherit both stop word removers from it and make it part of public interface. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface make it difficult to find out different type of StopWordsRemover in the system at API level. When user will use this enum they immediately know that what type of StopWordsRemover are available and how to use them.


In reply to: 265808886 [](ancestors = 265808886)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done what you proposed for the interface thing.


In reply to: 265811781 [](ancestors = 265811781,265808886)

/// <summary>
/// List of stop words. It is used when <see cref="StopWordsRemover"/> is equal to <see cref="StopWordsRemoverType.UseCustomStopWordsRemover"/>.
/// </summary>
public string[] StopWords;
Copy link
Contributor

@artidoro artidoro Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 19, length = 6)

Why not IEnumerable<string> ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that EntryPoint will break because of this. The actual interface is still using string[]

I have mentioned that in the comments in the my PR. I am waiting for the people to comment on it. So that I can decide what to do.


In reply to: 266099423 [](ancestors = 266099423)

@artidoro
Copy link
Contributor

artidoro commented Mar 15, 2019

Could you add a test with a custom list of stop words? #Resolved

@zeahmed
Copy link
Contributor Author

zeahmed commented Mar 15, 2019

I had opened an issue already for adding more tests for TextFeaturizer. The rationale behind this is to test all the options in the text featurizer. So, your concern can be addressed in that issue.
#2967


In reply to: 473395783 [](ancestors = 473395783)

/// <summary>
/// Use stop words remover that can removes language-specific list of stop words (most common words) already defined in the system.
/// </summary>
public sealed class CustomStopWordsRemoverOptions : IStopWordsRemoverOptions
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CustomStopWordsRemoverOptions [](start = 28, length = 29)

Why this interface and class is part of text transform and not part of StopWordRemover itself? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is why it's not implemented same way as word ngram and char ngram implemented?


In reply to: 266565265 [](ancestors = 266565265)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be part of StopWordRemover. I will move it.
Word ngram and char ngram are do different options that can be exclusively applied in Text transform (two parameter for two different options). On the contrary, only one type of StopWordRemover can be applied (one parameter to take two different options). Any idea if it can be done better than this?


In reply to: 266565861 [](ancestors = 266565861,266565265)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit 6733515 into dotnet:master Mar 18, 2019
@zeahmed
Copy link
Contributor Author

zeahmed commented Mar 18, 2019

Thanks all for the useful feedback!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants