Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert LdaTransform to IEstimator/ITransformer API #1410

Merged
merged 38 commits into from
Nov 20, 2018

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Oct 27, 2018

This PR converts LdaTransform to the IEstimator/ITransformer API paradigm (a work item related to #754)

  • LdaTransform is now subclass of OneToOneTransformerBase
  • Deleted the previous wrapped version.
  • Added pigsty extensions
  • Added onFit delegate to return LdaState info to user

Created separate issue #1411 for providing a convenient way of providing LDA topic summary info to the user

@@ -17,12 +17,19 @@
using Microsoft.ML.Runtime.Internal.Utilities;
using Microsoft.ML.Runtime.Model;
using Microsoft.ML.Runtime.TextAnalytics;
using Microsoft.ML.Core.Data;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctrl+R+G (aka sort using) #Closed


// The type T asociated with the delegate will be the actual value type once #863 goes in.
// However, until such time as #863 goes in, it would be too awkward to attempt to extract the metadata.
// For now construct the useless object then pass it into the delegate.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't seems right. #Resolved

"Latent Dirichlet Allocation Transform", LdaTransform.LoaderSignature)]

[assembly: LoadableClass(typeof(IRowMapper), typeof(LdaTransform), null, typeof(SignatureLoadRowMapper),
"Latent Dirichlet Allocation Transform", LdaTransform.LoaderSignature)]

namespace Microsoft.ML.Runtime.TextAnalytics
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime [](start = 23, length = 7)

Can you merge with master, I'm sure Senja changed this namespace to something better. #Resolved

@@ -646,7 +646,7 @@ public void TestLdaTransformEmptyDocumentException()

try
{
var lda = new LdaTransform(Env, args, srcView);
var lda = LdaTransform.Create(Env, args, srcView);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LdaTransform [](start = 26, length = 12)

Can you switch to estimator call?
Basically var lda = LdaEstimator(env, cols).Fit(srcView).Transform(scrView); #Resolved

/// <param name="numSummaryTermPerTopic">The number of words to summarize the topic</param>
/// <param name="numBurninIterations">The number of burn-in iterations</param>
/// <param name="resetRandomGenerator">Reset the random number generator for each document</param>
/// <param name="onFit">Called upon fitting with the learnt enumeration on the dataset.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprinkle it with dots at the end of sentence. #Resolved

Contracts.Assert(0 <= iinfo && iinfo < _parent.ColumnPairs.Length);
disposer = null;

var test = TestType(_srcTypes[iinfo]);
Copy link
Member Author

@abgoswam abgoswam Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestType [](start = 27, length = 8)

@ivan. Should we put this check here v/s putting it in the Mapper constructor ? Does it make a difference ? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make difference at what point it will fail. I would prefer to test types in constructor.


In reply to: 230899143 [](ancestors = 230899143)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to have TestType idiom please


In reply to: 232105072 [](ancestors = 232105072,230899143)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if column type in incorrect we now throw ExceptSchemaMismatch


In reply to: 232437787 [](ancestors = 232437787,232105072,230899143)

private struct Config
{
public readonly int NumTopic;
public readonly Single AlphaSum;
Copy link
Member

@wschin wschin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public readonly Single AlphaSum;
public readonly float AlphaSum;
``` #Resolved

{
public readonly int NumTopic;
public readonly Single AlphaSum;
public readonly Single Beta;
Copy link
Member

@wschin wschin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public readonly Single Beta;
public readonly float Beta;
``` #Resolved

@@ -22,6 +22,7 @@
using System.Text;
using Xunit;
using Xunit.Abstractions;
using static Microsoft.ML.Transforms.Text.LdaTransform;
Copy link
Contributor

@artidoro artidoro Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort the usings. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control+R Control+G (remove and sort Using) gives this order itself


In reply to: 231324047 [](ancestors = 231324047)

}

/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' />
/// <param name="input">The column to apply to.</param>
Copy link
Member

@wschin wschin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what this transform is doing briefly? #Resolved

/// <param name="onFit">Called upon fitting with the learnt enumeration on the dataset.</param>
public static Vector<float> ToLdaTopicVector(this Vector<float> input,
int numTopic = LdaEstimator.Defaults.NumTopic,
Single alphaSum = LdaEstimator.Defaults.AlphaSum,
Copy link
Member

@wschin wschin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Single alphaSum = LdaEstimator.Defaults.AlphaSum,
float alphaSum = LdaEstimator.Defaults.AlphaSum,
``` #Resolved

public static Vector<float> ToLdaTopicVector(this Vector<float> input,
int numTopic = LdaEstimator.Defaults.NumTopic,
Single alphaSum = LdaEstimator.Defaults.AlphaSum,
Single beta = LdaEstimator.Defaults.Beta,
Copy link
Member

@wschin wschin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Single beta = LdaEstimator.Defaults.Beta,
float beta = LdaEstimator.Defaults.Beta,
``` #Resolved

@artidoro
Copy link
Contributor

artidoro commented Nov 7, 2018

Could you add mlcontext extensions?
You can follow this PR as an example: #1524. The whitening PR also does that. #Resolved

@@ -155,8 +160,10 @@ public bool TryUnparse(StringBuilder sb)
}
}

private sealed class ColInfoEx
public sealed class ColumnInfo
Copy link
Contributor

@artidoro artidoro Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine this with transformInfo? You can just add this public constructor and change the name to columnInfo, while keeping the other methods internal/private. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason why you can't delete TransformInfo class and use ColumnInfo instead.


In reply to: 231632966 [](ancestors = 231632966)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 232104746 [](ancestors = 232104746,231632966)

getSrc(ref src);
lda.Output(in src, ref dst, numBurninIter, reset);
};
return new LdaTransform(_host, input, _columns);
Copy link
Contributor

@artidoro artidoro Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 231635318 [](ancestors = 231635318)

}
using (var ch = Host.Start("Train"))
{
Train(ch, input, _ldas);
Copy link
Contributor

@artidoro artidoro Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Train [](start = 16, length = 5)

Pete suggested that we make training a private static method in the transform, and that we have a constructor for the transform that will accept the trained parameters as an input.

This should make it easier to allow initialization of the transform using parameters coming from an external training routine. It is not something that we are exposing now, however.

This allows the transform to only keep the fields storing trained parameters. For example _types should no longer be needed as a field in the transform. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented something like this for whitening.


In reply to: 231639657 [](ancestors = 231639657)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks for giving the context about this approach.


In reply to: 231640188 [](ancestors = 231640188,231639657)


NumThread = item.NumThreads ?? args.NumThreads ?? 0;
ectx.CheckUserArg(NumThread >= 0, nameof(item.NumThreads), "Must be positive or zero.");
public sealed class LdaState : IDisposable
Copy link
Contributor

@artidoro artidoro Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

Can this be made internal/private? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like some 'LDA guts' and should under no circumstances be exposed to the user.


In reply to: 231641222 [](ancestors = 231641222)

Copy link
Member Author

@abgoswam abgoswam Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this internal. Instead added a class LdaTopicSummary to provide topic summary information back to users (when needed)


In reply to: 232437742 [](ancestors = 232437742,231641222)

@abgoswam
Copy link
Member Author

abgoswam commented Nov 15, 2018

@Zruty0 I have addressed your PR comments. Could you kindly take a look ? #Resolved

@shauheen
Copy link
Contributor

shauheen commented Nov 19, 2018

ping @Zruty0 #Resolved

using System;
using System.Collections.Generic;

namespace Microsoft.ML.Transforms.Text
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Transforms.Text [](start = 10, length = 28)

Pete will ask you to put static extensions into Microsoft.ML.StaticPipe; #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sure will


In reply to: 234771139 [](ancestors = 234771139)

ectx.CheckUserArg(NumBurninIter >= 0, nameof(item.NumBurninIterations), "Must be non-negative.");
Contracts.CheckValue(input, nameof(input));
Contracts.CheckValueOrNull(output);
Contracts.CheckUserArg(numTopic > 0, nameof(numTopic), "Must be positive.");
Copy link
Contributor

@Zruty0 Zruty0 Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckUserArg [](start = 26, length = 12)

Unless parsing command line, use CheckParam instead of CheckUserArg #Resolved

ctx.CheckAtModel(GetVersionInfo());
h.CheckValue(input, nameof(input));
// For each topic, provide information about the (item, score) pairs.
public readonly Dictionary<int, List<Tuple<int, float>>> ItemScoresPerTopic;
Copy link
Contributor

@Zruty0 Zruty0 Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<Tuple<int, float> [](start = 44, length = 22)

Tuple<int, float> -> (int Item, float Score) (use named tuple instead of tuples)
List -> `ImmutableArray
#Resolved

ctx.CheckAtModel(GetVersionInfo());
h.CheckValue(input, nameof(input));
// For each topic, provide information about the (item, score) pairs.
public readonly Dictionary<int, List<Tuple<int, float>>> ItemScoresPerTopic;
Copy link
Contributor

@Zruty0 Zruty0 Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the int key in this dictionary? Is it a sequential integer? If yes, make it into ImmutableArray. If it's not a sequential integer, describe what it is, and make it into an IReadOnlyDictionary #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it ImmutableArray since its a sequential integer


In reply to: 234857025 [](ancestors = 234857025)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0
Copy link
Contributor

Zruty0 commented Nov 20, 2018

Approval conditional on fixing all the remaining open comments :)


/// <summary>
/// Initializes a new instance of <see cref="LatentDirichletAllocationEstimator"/>.
/// </summary>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use summary form method above.
It should be user friendly. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 235162180 [](ancestors = 235162180)


ResetRandomGenerator = item.ResetRandomGenerator ?? args.ResetRandomGenerator;
internal ColumnInfo(Column item, Arguments args)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal ColumnInfo(Column item, Arguments args) [](start = 12, length = 48)

you can just call constructor above, and that way you don't need to duplicate code.
internal ColumnInfo(Column item, Arguments args): this( item.Source, item.Name, args.NumTopic, ....)
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 235162967 [](ancestors = 235162967)

}
}

public const string LoaderSignature = "LdaTransform";
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

internal #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 235163681 [](ancestors = 235163681)


// The following call fails because of the following issue
// https://github.com/dotnet/machinelearning/issues/969
// In this test it manifests because of the WordBagEstimator in the estimator chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if you uncomment this?
I've recently convert WordBag to estimator so it should work properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I uncomment this, I get an error

Schema mismatch for input column 'text': expected Text, got R4 Parameter name: inputSchema

The WordBagWorkout test still comments out the TestEstimatorCore (line #186 above) .


In reply to: 235166598 [](ancestors = 235166598)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@shauheen shauheen merged commit 1a9e7aa into dotnet:master Nov 20, 2018
@abgoswam abgoswam deleted the abgoswam/LDA_conversion branch January 13, 2019 18:12
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants