microsoft · mhamilton723 · Jul 17, 2023 · Jun 6, 2023 · Jun 6, 2023 · Jun 6, 2023
@@ -1,27 +1,50 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Classification - Before and After SynapseML\n",
-    "\n",
-    "### 1. Introduction\n",
-    "\n",
-    "<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>\n",
-    "\n",
-    "In this tutorial, we perform the same classification task in two\n",
+    "# Classification - before and after SynapseML"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": [
+     "hide-synapse-internal"
+    ]
+   },
+   "source": [
+    "<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this article, you perform the same classification task in two\n",
     "different ways: once using plain **`pyspark`** and once using the\n",
     "**`synapseml`** library.  The two methods yield the same performance,\n",
-    "but one of the two libraries is drastically simpler to use and iterate\n",
-    "on (can you guess which one?).\n",
+    "but highlights the simplicity of using `synapseml` compared to `pyspark`.\n",
     "\n",
-    "The task is simple: Predict whether a user's review of a book sold on\n",
-    "Amazon is good (rating > 3) or bad based on the text of the review.  We\n",
-    "accomplish this by training LogisticRegression learners with different\n",
+    "The task is to predict whether a customer's review of a book sold on\n",
+    "Amazon is good (rating > 3) or bad based on the text of the review. You\n",
+    "accomplish it by training LogisticRegression learners with different\n",
     "hyperparameters and choosing the best model."
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "Import necessary Python libraries and get a spark session."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -35,12 +58,13 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 2. Read the data\n",
+    "## Read the data\n",
     "\n",
-    "We download and read in the data. We show a sample below:"
+    "Download and read in the data."
    ]
   },
   {
@@ -56,16 +80,16 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 3. Extract more features and process data\n",
+    "## Extract features and process data\n",
     "\n",
-    "Real data however is more complex than the above dataset. It is common\n",
-    "for a dataset to have features of multiple types: text, numeric,\n",
-    "categorical.  To illustrate how difficult it is to work with these\n",
-    "datasets, we add two numerical features to the dataset: the **word\n",
-    "count** of the review and the **mean word length**."
+    "Real data is more complex than the above dataset. It's common\n",
+    "for a dataset to have features of multiple types, such as text, numeric, and\n",
+    "categorical. To illustrate how difficult it's to work with these\n",
+    "datasets, add two numerical features to the dataset: the **word count** of the review and the **mean word length**."
    ]
   },
   {
@@ -142,25 +166,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4a. Classify using pyspark\n",
+    "## Classify using pyspark\n",
     "\n",
     "To choose the best LogisticRegression classifier using the `pyspark`\n",
-    "library, need to *explicitly* perform the following steps:\n",
+    "library, you need to *explicitly* perform the following steps:\n",
     "\n",
     "1. Process the features:\n",
     "   * Tokenize the text column\n",
     "   * Hash the tokenized column into a vector using hashing\n",
-    "   * Merge the numeric features with the vector in the step above\n",
+    "   * Merge the numeric features with the vector\n",
     "2. Process the label column: cast it into the proper type.\n",
     "3. Train multiple LogisticRegression algorithms on the `train` dataset\n",
     "   with different hyperparameters\n",
     "4. Compute the area under the ROC curve for each of the trained models\n",
     "   and select the model with the highest metric as computed on the\n",
     "   `test` dataset\n",
-    "5. Evaluate the best model on the `validation` set\n",
-    "\n",
-    "As you can see below, there is a lot of work involved and a lot of\n",
-    "steps where something can go wrong!"
+    "5. Evaluate the best model on the `validation` set"
    ]
   },
   {
@@ -235,16 +256,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4b. Classify using synapseml\n",
+    "## Classify using SynapseML\n",
     "\n",
-    "Life is a lot simpler when using `synapseml`!\n",
+    "The pipeline can be simplified by using SynapseML:\n",
     "\n",
     "1. The **`TrainClassifier`** Estimator featurizes the data internally,\n",
     "   as long as the columns selected in the `train`, `test`, `validation`\n",
     "   dataset represent the features\n",
     "\n",
     "2. The **`FindBestModel`** Estimator finds the best model from a pool of\n",
-    "   trained models by finding the model which performs best on the `test`\n",
+    "   trained models by finding the model that performs best on the `test`\n",
     "   dataset given the specified metric\n",
     "\n",
     "3. The **`ComputeModelStatistics`** Transformer computes the different\n",

@@ -11,7 +11,17 @@
     }
    },
    "source": [
-    "# A 5-minute tour of SynapseML"
+    "# Build your first SynapseML model\n",
+    "This tutorial provides a brief introduction to SynapseML. In particular, we use SynapseML to create two different pipelines for sentiment analysis. The first pipeline combines a text featurization stage with LightGBM regression to predict ratings based on review text from a dataset containing book reviews from Amazon. The second pipeline shows how to use prebuilt models through the Azure Cognitive Services to solve this problem without training data."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up the environment\n",
+    "Import SynapseML libraries and initialize your Spark session."
    ]
   },
   {
@@ -39,6 +49,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
     "nteract": {
@@ -48,7 +59,8 @@
     }
    },
    "source": [
-    "# Step 1: Load our Dataset"
+    "## Load a dataset\n",
+    "Load your dataset and split it into train and test sets."
    ]
   },
   {
@@ -77,6 +89,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
@@ -87,7 +100,8 @@
     }
    },
    "source": [
-    "# Step 2: Make our Model"
+    "## Create the training pipeline\n",
+    "Create a pipeline that featurizes data using `TextFeaturizer` from the `synapse.ml.featurize.text` library and derives a rating using the `LightGBMRegressor` function."
    ]
   },
   {
@@ -116,6 +130,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
@@ -126,7 +141,8 @@
     }
    },
    "source": [
-    "# Step 3: Predict!"
+    "## Predict the output of the test data\n",
+    "Call the `transform` function on the model to predict and display the output of the test data as a dataframe."
    ]
   },
   {
@@ -146,6 +162,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
@@ -156,7 +173,8 @@
     }
    },
    "source": [
-    "# Alternate route: Let the Cognitive Services handle it"
+    "## Use Cognitive Services to transform data in one step\n",
+    "Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Cognitive Services to transform your data in one step."
    ]
   },
   {
@@ -181,7 +199,9 @@
     "model = TextSentiment(\n",
     "    textCol=\"text\",\n",
     "    outputCol=\"sentiment\",\n",
-    "    subscriptionKey=find_secret(\"cognitive-api-key\"),\n",
+    "    subscriptionKey=find_secret(\n",
+    "        \"cognitive-api-key\"\n",
+    "    ),  # Replace the call to find_secret with your key as a python string.\n",
     ").setLocation(\"eastus\")\n",
     "\n",
     "display(model.transform(test))"