Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to convert a text classification sample from ML 0.4 to ML 0.11 and cannot find equivalent code for calling Transforms.Text.FeaturizeText #2994

Closed
bpietroiu opened this issue Mar 18, 2019 · 2 comments

Comments

@bpietroiu
Copy link

System information

  • OS version/distro: Win10
  • .NET Version (eg., dotnet --info): 4.7.1

Issue

  • What did you do?
    I want to convert a text classification sample from ML 0.4 to ML 0.11

  • What happened?
    I cannot find equivalent parameters for TextFeaturizer, there are no CharFeatureExtractor and WordFeatureExtractor NgramExtractors in 0.11 Transforms.Text.FeaturizeText arguments

  • What did you expect?

Source code / logs

0.4 version

                new TextFeaturizer("Features", "TextColumn")
                {
                    KeepDiacritics = false,
                    KeepPunctuations = false,
                    OutputTokens = true,
                    Language = TextFeaturizingEstimatorLanguage.English,
                    VectorNormalizer = TextFeaturizingEstimatorTextNormKind.L2,
                    TextCase = TextNormalizingEstimatorCaseNormalizationMode.Lower,
                    CharFeatureExtractor = new NGramNgramExtractor() {NgramLength = 3, AllLengths = false},
                    WordFeatureExtractor = new NGramNgramExtractor() {NgramLength = 3, AllLengths = true}
                }, 

0.11 version

var transform mlContext.Transforms.Text.FeaturizeText(outputColumnName: "TextColumnFeaturized",
                        options: new TextFeaturizingEstimator.Options()
                        {
                            KeepDiacritics = false,
                            KeepPunctuations = false,
                            OutputTokens = true,
                            TextLanguage = TextFeaturizingEstimator.Language.English,
                            VectorNormalizer = TextFeaturizingEstimator.TextNormKind.L2,
                            TextCase = TextNormalizingEstimator.CaseNormalizationMode.Lower,
                            UseWordExtractor = true,
                            UseCharExtractor = true
                        }, inputColumnNames: new []{ "TextColumn" });

I see that CharFeatureExtractor and WordFeatureExtractor parameters are gone, and instead two we have boolean properties, UseWordExtractor and UseCharExtractor.

The models trained using the above code and the same train data perform differently, namly the 0.4 version has a better classsification performace.

How do I configure a pipeline to achieve the same results as in 0.4 version?

Thank you!

@singlis
Copy link
Member

singlis commented Mar 18, 2019

Hi @bpietroiu,

You are correct -- there is no way to do this in 0.11. We have since filed a bug #2802, along with a fix that was merged a few days ago (#2911). This will be available in our next release and should provide the same functionality that you are looking for.

@antoniovs1029
Copy link
Member

Hi, so it seems this problem was fixed in a previous release. Please, feel free to reopen the issue if you're still having problems with this. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants