Skip to content

Commit

Permalink
docs: update notebooks - bring back fabric reviewers changes. (micros…
Browse files Browse the repository at this point in the history
…oft#1979)

* update doc for fabric

* use previous multivariate anomaly detection notebook

* revert change

* bring back reviewers changes

* use master isolationForestNotebook

* format and doc bug fix

* fix Multivariate Anomaly Detection doc version

* Update notebooks/features/cognitive_services/CognitiveServices - Multivariate Anomaly Detection.ipynb

Co-authored-by: Mark Hamilton <[email protected]>

* Update notebooks/features/lightgbm/LightGBM - Overview.ipynb

Co-authored-by: Mark Hamilton <[email protected]>

* Update notebooks/features/classification/Classification - Before and After SynapseML.ipynb

* Update notebooks/features/responsible_ai/Interpretability - Tabular SHAP explainer.ipynb

---------
  • Loading branch information
JessicaXYWang committed Sep 14, 2023
1 parent e17756e commit c2ce58e
Show file tree
Hide file tree
Showing 11 changed files with 615 additions and 955 deletions.
Original file line number Diff line number Diff line change
@@ -1,27 +1,50 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification - Before and After SynapseML\n",
"\n",
"### 1. Introduction\n",
"\n",
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>\n",
"\n",
"In this tutorial, we perform the same classification task in two\n",
"# Classification - before and after SynapseML"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"hide-synapse-internal"
]
},
"source": [
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this article, you perform the same classification task in two\n",
"different ways: once using plain **`pyspark`** and once using the\n",
"**`synapseml`** library. The two methods yield the same performance,\n",
"but one of the two libraries is drastically simpler to use and iterate\n",
"on (can you guess which one?).\n",
"but highlights the simplicity of using `synapseml` compared to `pyspark`.\n",
"\n",
"The task is simple: Predict whether a user's review of a book sold on\n",
"Amazon is good (rating > 3) or bad based on the text of the review. We\n",
"accomplish this by training LogisticRegression learners with different\n",
"The task is to predict whether a customer's review of a book sold on\n",
"Amazon is good (rating > 3) or bad based on the text of the review. You\n",
"accomplish it by training LogisticRegression learners with different\n",
"hyperparameters and choosing the best model."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"Import necessary Python libraries and get a spark session."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -35,12 +58,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Read the data\n",
"## Read the data\n",
"\n",
"We download and read in the data. We show a sample below:"
"Download and read in the data."
]
},
{
Expand All @@ -56,16 +80,16 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Extract more features and process data\n",
"## Extract features and process data\n",
"\n",
"Real data however is more complex than the above dataset. It is common\n",
"for a dataset to have features of multiple types: text, numeric,\n",
"categorical. To illustrate how difficult it is to work with these\n",
"datasets, we add two numerical features to the dataset: the **word\n",
"count** of the review and the **mean word length**."
"Real data is more complex than the above dataset. It's common\n",
"for a dataset to have features of multiple types, such as text, numeric, and\n",
"categorical. To illustrate how difficult it's to work with these\n",
"datasets, add two numerical features to the dataset: the **word count** of the review and the **mean word length**."
]
},
{
Expand Down Expand Up @@ -142,25 +166,22 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4a. Classify using pyspark\n",
"## Classify using pyspark\n",
"\n",
"To choose the best LogisticRegression classifier using the `pyspark`\n",
"library, need to *explicitly* perform the following steps:\n",
"library, you need to *explicitly* perform the following steps:\n",
"\n",
"1. Process the features:\n",
" * Tokenize the text column\n",
" * Hash the tokenized column into a vector using hashing\n",
" * Merge the numeric features with the vector in the step above\n",
" * Merge the numeric features with the vector\n",
"2. Process the label column: cast it into the proper type.\n",
"3. Train multiple LogisticRegression algorithms on the `train` dataset\n",
" with different hyperparameters\n",
"4. Compute the area under the ROC curve for each of the trained models\n",
" and select the model with the highest metric as computed on the\n",
" `test` dataset\n",
"5. Evaluate the best model on the `validation` set\n",
"\n",
"As you can see below, there is a lot of work involved and a lot of\n",
"steps where something can go wrong!"
"5. Evaluate the best model on the `validation` set"
]
},
{
Expand Down Expand Up @@ -235,16 +256,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4b. Classify using synapseml\n",
"## Classify using SynapseML\n",
"\n",
"Life is a lot simpler when using `synapseml`!\n",
"The pipeline can be simplified by using SynapseML:\n",
"\n",
"1. The **`TrainClassifier`** Estimator featurizes the data internally,\n",
" as long as the columns selected in the `train`, `test`, `validation`\n",
" dataset represent the features\n",
"\n",
"2. The **`FindBestModel`** Estimator finds the best model from a pool of\n",
" trained models by finding the model which performs best on the `test`\n",
" trained models by finding the model that performs best on the `test`\n",
" dataset given the specified metric\n",
"\n",
"3. The **`ComputeModelStatistics`** Transformer computes the different\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,17 @@
}
},
"source": [
"# A 5-minute tour of SynapseML"
"# Build your first SynapseML model\n",
"This tutorial provides a brief introduction to SynapseML. In particular, we use SynapseML to create two different pipelines for sentiment analysis. The first pipeline combines a text featurization stage with LightGBM regression to predict ratings based on review text from a dataset containing book reviews from Amazon. The second pipeline shows how to use prebuilt models through the Azure Cognitive Services to solve this problem without training data."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up the environment\n",
"Import SynapseML libraries and initialize your Spark session."
]
},
{
Expand Down Expand Up @@ -39,6 +49,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
Expand All @@ -48,7 +59,8 @@
}
},
"source": [
"# Step 1: Load our Dataset"
"## Load a dataset\n",
"Load your dataset and split it into train and test sets."
]
},
{
Expand Down Expand Up @@ -77,6 +89,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -87,7 +100,8 @@
}
},
"source": [
"# Step 2: Make our Model"
"## Create the training pipeline\n",
"Create a pipeline that featurizes data using `TextFeaturizer` from the `synapse.ml.featurize.text` library and derives a rating using the `LightGBMRegressor` function."
]
},
{
Expand Down Expand Up @@ -116,6 +130,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -126,7 +141,8 @@
}
},
"source": [
"# Step 3: Predict!"
"## Predict the output of the test data\n",
"Call the `transform` function on the model to predict and display the output of the test data as a dataframe."
]
},
{
Expand All @@ -146,6 +162,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -156,7 +173,8 @@
}
},
"source": [
"# Alternate route: Let the Cognitive Services handle it"
"## Use Cognitive Services to transform data in one step\n",
"Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Cognitive Services to transform your data in one step."
]
},
{
Expand All @@ -181,7 +199,9 @@
"model = TextSentiment(\n",
" textCol=\"text\",\n",
" outputCol=\"sentiment\",\n",
" subscriptionKey=find_secret(\"cognitive-api-key\"),\n",
" subscriptionKey=find_secret(\n",
" \"cognitive-api-key\"\n",
" ), # Replace the call to find_secret with your key as a python string.\n",
").setLocation(\"eastus\")\n",
"\n",
"display(model.transform(test))"
Expand Down
Loading

0 comments on commit c2ce58e

Please sign in to comment.