Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextFeaturizer cannot specify n-grams for words or characters #2802

Closed
rogancarr opened this issue Feb 28, 2019 · 6 comments · Fixed by #2911
Closed

TextFeaturizer cannot specify n-grams for words or characters #2802

rogancarr opened this issue Feb 28, 2019 · 6 comments · Fixed by #2911
Assignees
Labels
API Issues pertaining the friendly API
Milestone

Comments

@rogancarr
Copy link
Contributor

rogancarr commented Feb 28, 2019

One of the stated goals of the V1 API was:

  • I can modify settings in the TextFeaturizer to update the number of word-grams and char-grams used along with things like the normalization.

In the current API for TextFeaturizer, it is possible to create n-grams from words and/or characters (UseCharExtrator, UseWordExtractor) but it is not possible to specify what sorts of n-grams to make.

Related to #2711

@rogancarr rogancarr added the API Issues pertaining the friendly API label Feb 28, 2019
@rogancarr
Copy link
Contributor Author

Also, it is not clear if this is even possible using a workaround.

@najeeb-kazmi
Copy link
Member

Related to #838 ?

@rogancarr
Copy link
Contributor Author

@najeeb-kazmi Yes, it is. That one proposes a wider set of functionality than we need for the proposed V1 features.

@eerhardt
Copy link
Member

eerhardt commented Mar 2, 2019

Is this strictly adding new API? Can this be done without a public API breaking change? If so, I think we can remove it from Project 13, and it can be added after v1.0.

But if this requires a public API breaking change, then it can be left in Project 13.

@rogancarr
Copy link
Contributor Author

@eerhardt The answer is "it depends"

Take a look at the current options. We use binary flags to turn words and chars on and off:

// Create a training pipeline.
// TODO #2802: Update FeaturizeText to allow specifications of word-grams and char-grams.
var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new string[] { "SentimentText" },
    new TextFeaturizingEstimator.Options
    {
        UseCharExtractor = true,
        UseWordExtractor = true,
        VectorNormalizer = TextFeaturizingEstimator.TextNormKind.L1
    })
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(
new SdcaBinaryTrainer.Options { NumThreads = 1 }));

If we want to be able to choose n-grams, then it makes more sense to get rid of these flag variables and replace them with options (e.g. how many n-grams to use, whether to do all n-grams up to a cutoff).

@zeahmed zeahmed assigned zeahmed and unassigned zeahmed Mar 4, 2019
@shauheen shauheen added this to the 0319 milestone Mar 5, 2019
@vinodshanbhag
Copy link
Member

TLC text recipe defaults are bigram and tricharactergram. Is this the default for TextFeaturizer as well?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants