TextFeaturizer cannot specify n-grams for words or characters #2802

rogancarr · 2019-02-28T22:37:34Z

One of the stated goals of the V1 API was:

I can modify settings in the TextFeaturizer to update the number of word-grams and char-grams used along with things like the normalization.

In the current API for TextFeaturizer, it is possible to create n-grams from words and/or characters (UseCharExtrator, UseWordExtractor) but it is not possible to specify what sorts of n-grams to make.

Related to #2711

The text was updated successfully, but these errors were encountered:

rogancarr · 2019-02-28T22:39:50Z

Also, it is not clear if this is even possible using a workaround.

najeeb-kazmi · 2019-02-28T23:13:09Z

Related to #838 ?

rogancarr · 2019-03-01T17:24:45Z

@najeeb-kazmi Yes, it is. That one proposes a wider set of functionality than we need for the proposed V1 features.

eerhardt · 2019-03-02T20:57:19Z

Is this strictly adding new API? Can this be done without a public API breaking change? If so, I think we can remove it from Project 13, and it can be added after v1.0.

But if this requires a public API breaking change, then it can be left in Project 13.

rogancarr · 2019-03-04T18:41:10Z

@eerhardt The answer is "it depends"

Take a look at the current options. We use binary flags to turn words and chars on and off:

// Create a training pipeline.
// TODO #2802: Update FeaturizeText to allow specifications of word-grams and char-grams.
var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new string[] { "SentimentText" },
    new TextFeaturizingEstimator.Options
    {
        UseCharExtractor = true,
        UseWordExtractor = true,
        VectorNormalizer = TextFeaturizingEstimator.TextNormKind.L1
    })
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(
new SdcaBinaryTrainer.Options { NumThreads = 1 }));

If we want to be able to choose n-grams, then it makes more sense to get rid of these flag variables and replace them with options (e.g. how many n-grams to use, whether to do all n-grams up to a cutoff).

vinodshanbhag · 2019-03-06T02:54:28Z

TLC text recipe defaults are bigram and tricharactergram. Is this the default for TextFeaturizer as well?

rogancarr added the API Issues pertaining the friendly API label Feb 28, 2019

zeahmed assigned zeahmed and unassigned zeahmed Mar 4, 2019

shauheen added this to the 0319 milestone Mar 5, 2019

This was referenced Mar 6, 2019

Update default n-gram length for Text Transform to match default text recipe #2870

Closed

V1 Scenarios need to be covered by tests #2498

Open

zeahmed mentioned this issue Mar 11, 2019

Exposed ngram extraction options in TextFeaturizer #2911

Merged

zeahmed closed this as completed in #2911 Mar 13, 2019

singlis mentioned this issue Mar 18, 2019

I want to convert a text classification sample from ML 0.4 to ML 0.11 and cannot find equivalent code for calling Transforms.Text.FeaturizeText #2994

Closed

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextFeaturizer cannot specify n-grams for words or characters #2802

TextFeaturizer cannot specify n-grams for words or characters #2802

rogancarr commented Feb 28, 2019 •

edited

Loading

rogancarr commented Feb 28, 2019

najeeb-kazmi commented Feb 28, 2019

rogancarr commented Mar 1, 2019

eerhardt commented Mar 2, 2019

rogancarr commented Mar 4, 2019

vinodshanbhag commented Mar 6, 2019

TextFeaturizer cannot specify n-grams for words or characters #2802

TextFeaturizer cannot specify n-grams for words or characters #2802

Comments

rogancarr commented Feb 28, 2019 • edited Loading

rogancarr commented Feb 28, 2019

najeeb-kazmi commented Feb 28, 2019

rogancarr commented Mar 1, 2019

eerhardt commented Mar 2, 2019

rogancarr commented Mar 4, 2019

vinodshanbhag commented Mar 6, 2019

rogancarr commented Feb 28, 2019 •

edited

Loading