Skip to content

Commit

Permalink
Merge branch 'master' into update_notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
JessicaXYWang authored Jul 13, 2023
2 parents 20faaa6 + 46a56e1 commit 30ddb26
Show file tree
Hide file tree
Showing 152 changed files with 23,742 additions and 301 deletions.
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ SynapseML (previously known as MMLSpark), is an open-source library that simplif

With SynapseML, you can build scalable and intelligent systems to solve challenges in domains such as anomaly detection, computer vision, deep learning, text analytics, and others. SynapseML can train and evaluate models on single-node, multi-node, and elastically resizable clusters of computers. This lets you scale your work without wasting resources. SynapseML is usable across Python, R, Scala, Java, and .NET. Furthermore, its API abstracts over a wide variety of databases, file systems, and cloud data stores to simplify experiments no matter where data is located.

SynapseML requires Scala 2.12, Spark 3.2+, and Python 3.8+.
SynapseML requires Scala 2.12, Spark 3.2+, and Python 3.8+.

| Topics | Links |
| :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Build | [![Build Status](https://msdata.visualstudio.com/A365/_apis/build/status/microsoft.SynapseML?branchName=master)](https://msdata.visualstudio.com/A365/_build/latest?definitionId=17563&branchName=master) [![codecov](https://codecov.io/gh/Microsoft/SynapseML/branch/master/graph/badge.svg)](https://codecov.io/gh/Microsoft/SynapseML) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) |
| Version | [![Version](https://img.shields.io/badge/version-0.11.1-blue)](https://github.com/Microsoft/SynapseML/releases) [![Release Notes](https://img.shields.io/badge/release-notes-blue)](https://github.com/Microsoft/SynapseML/releases) [![Snapshot Version](https://mmlspark.blob.core.windows.net/icons/badges/master_version3.svg)](#sbt) |
| Docs | [![Scala Docs](https://img.shields.io/static/v1?label=api%20docs&message=scala&color=blue&logo=scala)](https://mmlspark.blob.core.windows.net/docs/0.11.1/scala/index.html#package) [![PySpark Docs](https://img.shields.io/static/v1?label=api%20docs&message=python&color=blue&logo=python)](https://mmlspark.blob.core.windows.net/docs/0.11.1/pyspark/index.html) [![Academic Paper](https://img.shields.io/badge/academic-paper-7fdcf7)](https://arxiv.org/abs/1810.08744) |
| Version | [![Version](https://img.shields.io/badge/version-0.11.2-blue)](https://github.com/Microsoft/SynapseML/releases) [![Release Notes](https://img.shields.io/badge/release-notes-blue)](https://github.com/Microsoft/SynapseML/releases) [![Snapshot Version](https://mmlspark.blob.core.windows.net/icons/badges/master_version3.svg)](#sbt) |
| Docs | [![Scala Docs](https://img.shields.io/static/v1?label=api%20docs&message=scala&color=blue&logo=scala)](https://mmlspark.blob.core.windows.net/docs/0.11.2/scala/index.html#package) [![PySpark Docs](https://img.shields.io/static/v1?label=api%20docs&message=python&color=blue&logo=python)](https://mmlspark.blob.core.windows.net/docs/0.11.2/pyspark/index.html) [![Academic Paper](https://img.shields.io/badge/academic-paper-7fdcf7)](https://arxiv.org/abs/1810.08744) |
| Support | [![Gitter](https://badges.gitter.im/Microsoft/MMLSpark.svg)](https://gitter.im/Microsoft/MMLSpark?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) [![Mail](https://img.shields.io/badge/mail-synapseml--support-brightgreen)](mailto:[email protected]) |
| Binder | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/microsoft/SynapseML/v0.11.1?labpath=notebooks%2Ffeatures) |
| Binder | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/microsoft/SynapseML/v0.11.2?labpath=notebooks%2Ffeatures) |
<!-- markdownlint-disable MD033 -->
<details open>
<summary>
Expand Down Expand Up @@ -94,7 +94,7 @@ In Azure Synapse notebooks please place the following in the first cell of your
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1-spark3.3",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2-spark3.3",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
Expand All @@ -110,7 +110,7 @@ In Azure Synapse notebooks please place the following in the first cell of your
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.3.1",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2,org.apache.spark:spark-avro_2.12:3.3.1",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
Expand All @@ -130,15 +130,15 @@ cloud](http://community.cloud.databricks.com), create a new [library from Maven
coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages)
in your workspace.

For the coordinates use: `com.microsoft.azure:synapseml_2.12:0.11.1`
For the coordinates use: `com.microsoft.azure:synapseml_2.12:0.11.2`
with the resolver: `https://mmlspark.azureedge.net/maven`. Ensure this library is
attached to your target cluster(s).

Finally, ensure that your Spark cluster has at least Spark 3.2 and Scala 2.12. If you encounter Netty dependency issues please use DBR 10.1.

You can use SynapseML in both your Scala and PySpark notebooks. To get started with our example notebooks import the following databricks archive:

`https://mmlspark.blob.core.windows.net/dbcs/SynapseMLExamplesv0.11.1.dbc`
`https://mmlspark.blob.core.windows.net/dbcs/SynapseMLExamplesv0.11.2.dbc`

### Microsoft Fabric

Expand All @@ -151,7 +151,7 @@ In Microsoft Fabric notebooks please place the following in the first cell of yo
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1-spark3.3",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2-spark3.3",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
Expand All @@ -167,7 +167,7 @@ In Microsoft Fabric notebooks please place the following in the first cell of yo
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.3.1",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2,org.apache.spark:spark-avro_2.12:3.3.1",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
Expand All @@ -186,7 +186,7 @@ the above example, or from python:
```python
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.11.1") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.11.2") \
.getOrCreate()
import synapse.ml
```
Expand All @@ -197,9 +197,9 @@ SynapseML can be conveniently installed on existing Spark clusters via the
`--packages` option, examples:

```bash
spark-shell --packages com.microsoft.azure:synapseml_2.12:0.11.1
pyspark --packages com.microsoft.azure:synapseml_2.12:0.11.1
spark-submit --packages com.microsoft.azure:synapseml_2.12:0.11.1 MyApp.jar
spark-shell --packages com.microsoft.azure:synapseml_2.12:0.11.2
pyspark --packages com.microsoft.azure:synapseml_2.12:0.11.2
spark-submit --packages com.microsoft.azure:synapseml_2.12:0.11.2 MyApp.jar
```

### SBT
Expand All @@ -208,7 +208,7 @@ If you are building a Spark application in Scala, add the following lines to
your `build.sbt`:

```scala
libraryDependencies += "com.microsoft.azure" % "synapseml_2.12" % "0.11.1"
libraryDependencies += "com.microsoft.azure" % "synapseml_2.12" % "0.11.2"
```

### Apache Livy and HDInsight
Expand All @@ -222,7 +222,7 @@ Excluding certain packages from the library may be necessary due to current issu
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind"
}
}
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ publishDotnetBase := {
packDotnetAssemblyCmd(join(dotnetBaseDir, "target").getAbsolutePath, dotnetBaseDir)
val packagePath = join(dotnetBaseDir,
// Update the version whenever there's a new release
"target", s"SynapseML.DotnetBase.${dotnetedVersion("0.11.1")}.nupkg").getAbsolutePath
"target", s"SynapseML.DotnetBase.${dotnetedVersion("0.11.2")}.nupkg").getAbsolutePath
publishDotnetAssemblyCmd(packagePath, genSleetConfig.value)
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,8 @@ trait MapsAsyncReply extends HasAsyncReply {
val statusRequest = new HttpGet()
statusRequest.setURI(location)
statusRequest.setHeader("User-Agent", s"synapseml/${BuildInfo.version}${HeaderValues.PlatformInfo}")
val resp = convertAndClose(sendWithRetries(client, statusRequest, getBackoffs))
val resp = convertAndClose(sendWithRetries(
client, statusRequest, getBackoffs, extraCodesToRetry = Set(404))) // scalastyle:off magic.number
statusRequest.releaseConnection()
val status = resp.statusLine.statusCode
if (status == 202) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ trait FormRecognizerV3Utils extends TestBase {
}

def resultAssert(result: Array[Row], str1: String, str2: String): Unit = {
assert(result.head.getString(2).startsWith(str1))
assert(result.head.getString(2).contains(str1))
assert(result.head.getSeq(3).head.asInstanceOf[HashMap.HashTrieMap[String, _]]
.keys.toSeq.sortWith(_ < _).mkString(",") === str2)
.keys.toSeq.sortWith(_ < _).mkString(",").contains(str2))
}

def documentTest(model: AnalyzeDocument, df: DataFrame): DataFrame = {
Expand Down Expand Up @@ -136,10 +136,8 @@ class AnalyzeDocumentSuite extends TransformerFuzzing[AnalyzeDocument] with Form
for (result <- Seq(result1, result2)) {
resultAssert(
result,
"USA\nWASHINGTON\n20 1234567XX1101\nDRIVER LICENSE\nFEDERAL LIMITS APPLY\n4d LIC#WDLABCD456DG 9CLASS\n" +
"DONOR\n1 TALBOT\n2 LIAM R.\n3 DOB 01/06/1958 8 123 STREET ADDRESS YOUR CITY WA 99999-1234\n",
"Address,CountryRegion,DateOfBirth,DateOfExpiration,DateOfIssue,DocumentDiscriminator," +
"DocumentNumber,Endorsements,EyeColor,FirstName,Height,LastName,Region,Restrictions,Sex,Weight")
"123 STREET ADDRESS",
"DateOfExpiration")
}
}

Expand All @@ -156,9 +154,8 @@ class AnalyzeDocumentSuite extends TransformerFuzzing[AnalyzeDocument] with Form
val result2 = modelsTest(bytesAnalyzeBusinessCards, bytesDF3, useBytes = true)
for (result <- Seq(result1, result2)) {
resultAssert(result,
"Dr. Avery Smith Senior Researcher Cloud & Al Department",
"Addresses,CompanyNames,ContactNames," +
"Departments,Emails,Faxes,JobTitles,MobilePhones,Websites,WorkPhones")
"Dr. Avery Smith",
"Addresses")
}
}

Expand All @@ -176,10 +173,8 @@ class AnalyzeDocumentSuite extends TransformerFuzzing[AnalyzeDocument] with Form
for (result <- Seq(result1, result2)) {
resultAssert(
result,
"Contoso\nAddress:\n1 Redmond way Suite\n6000 Redmond, WA\n99243\n" +
"Invoice For: Microsoft\n1020 Enterprise Way",
"CustomerAddress,CustomerAddressRecipient," +
"CustomerName,DueDate,InvoiceDate,InvoiceId,InvoiceTotal,Items,VendorAddress,VendorName")
"1020 Enterprise Way",
"CustomerAddress")
}
}

Expand All @@ -197,9 +192,8 @@ class AnalyzeDocumentSuite extends TransformerFuzzing[AnalyzeDocument] with Form
for (result <- Seq(result1, result2)) {
resultAssert(
result,
"Contoso\nContoso\n123 Main Street\nRedmond, WA 98052",
"Items,MerchantAddress,MerchantName,MerchantPhoneNumber," +
"Subtotal,Total,TotalTax,TransactionDate,TransactionTime")
"123 Main Street",
"TransactionDate")
}
}

Expand Down
2 changes: 1 addition & 1 deletion core/src/main/dotnet/src/dotnetBase.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<IsPackable>true</IsPackable>

<Description>SynapseML .NET Base</Description>
<Version>0.11.1</Version>
<Version>0.11.2</Version>
</PropertyGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ object DotnetCodegen {
|
| <ItemGroup>
| <PackageReference Include="Microsoft.Spark" Version="2.1.1" />
| <PackageReference Include="SynapseML.DotnetBase" Version="0.11.1" />
| <PackageReference Include="SynapseML.DotnetBase" Version="0.11.2" />
| <PackageReference Include="IgnoresAccessChecksToGenerator" Version="0.4.0" PrivateAssets="All" />
| $newtonsoftDep
| </ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,14 @@ object HandlingUtils extends SparkLogging {
//scalastyle:off cyclomatic.complexity
private[ml] def sendWithRetries(client: CloseableHttpClient,
request: HttpRequestBase,
retriesLeft: Array[Int]): CloseableHttpResponse = {
retriesLeft: Array[Int],
extraCodesToRetry: Set[Int] = Set()
): CloseableHttpResponse = {
try {
val response = client.execute(request)
val code = response.getStatusLine.getStatusCode
//scalastyle:off magic.number
val succeeded = code match {
val dontRetry = code match {
case 200 => true
case 201 => true
case 202 => true
Expand All @@ -108,20 +110,23 @@ object HandlingUtils extends SparkLogging {
Thread.sleep(h.getValue.toLong * 1000)
}
false
case 400 =>
true
case _ =>
case code =>
logWarning(s"got error $code: ${response.getStatusLine.getReasonPhrase} on ${
request match {
case p: HttpPost => p.getURI + " " +
Try(IOUtils.toString(p.getEntity.getContent, "UTF-8")).getOrElse("")
case _ => request.getURI
}
}")
false

if (extraCodesToRetry(code)) {
false
} else {
code.toString.startsWith("4") // Retry only when code isn't a 4XX
}
}
//scalastyle:on magic.number
if (succeeded || retriesLeft.isEmpty) {
if (dontRetry || retriesLeft.isEmpty) {
response
} else {
response.close()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ object DotnetTestGen {
| <IncludeAssets>runtime; build; native; contentfiles; analyzers</IncludeAssets>
| </PackageReference>
| <PackageReference Include="Microsoft.Spark" Version="2.1.1" />
| <PackageReference Include="SynapseML.DotnetBase" Version="0.11.1" />
| <PackageReference Include="SynapseML.DotnetBase" Version="0.11.2" />
| <PackageReference Include="SynapseML.DotnetE2ETest" Version="${conf.dotnetVersion}" />
| <PackageReference Include="SynapseML.$curProject" Version="${conf.dotnetVersion}" />
| $referenceCore
Expand Down
Loading

0 comments on commit 30ddb26

Please sign in to comment.