Release 0.7.0 · salesforce/TransmogrifAI

Bug fixes:

Fix flaky ModelInsight tests #407
Remove logging of tokens of text fields #420, #438, #447, #474
Add validation prepare call before model selection when no DAG is passed #424, #429
Fix Days.daysBetween int overflow #471

New features / updates:

Downsample the number of training samples to maxTrainingSample for regression #413 and multi-class classification #414
Refactor InsightLOCOTest #412
Enable more loss types for OpLinearRegression #421
Add property-based tests for regression model selection #427
Add option to calculate LOCO for dates/texts by leaving out their entire vector #418
Add Chinese and Korean examples to TextTokenizerTest #442
Add support for ignoring text that looks like IDs in SmartTextVectorizer #448, #455
Add a unary estimator for detecting names in text fields and transforming to likely gender #445
Allow result features to be removed by raw feature filter #458
Metadata changes for sensitive feature information #457
Add MinVarianceFilter which checks that computed features have a minimum variance #463, #465
Allow TextStats length distribution to be token-based and refactor for testability #464
Use Spark job grouping to distinguish steps of the machine learning flow #467, #468, #470
Add categorical detection to be coverage based in addition to unique count based #473
Remove duplicate features using sanity checker feature to feature correlations #476, #479
Lift the upper bound on number of hash features #477
Enable Html stripping on text-like features #478

Dependency updates (#402, #466):

Update Apache Spark version to 2.4.5
Avro is a built-in data source in Spark 2.4, so no longer using the spark-avro package
Avro to 1.8.2
XGBoost to 0.90
MLeap to 0.14.0
json4s to 3.5.3
JUnit to 4.12
chill to 0.9.3
gradle-avro-plugin to 0.16.0

Miscellaneous:

Provide feedback