Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update notebooks - bring back fabric reviewers changes. #1979

Merged
merged 14 commits into from
Jul 17, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,27 +1,50 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification - Before and After SynapseML\n",
"\n",
"### 1. Introduction\n",
"\n",
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>\n",
"\n",
"In this tutorial, we perform the same classification task in two\n",
"# Classification - before and after SynapseML"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"hide-synapse-internal"
]
},
"source": [
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this article, you perform the same classification task in two\n",
"different ways: once using plain **`pyspark`** and once using the\n",
"**`synapseml`** library. The two methods yield the same performance,\n",
"but one of the two libraries is drastically simpler to use and iterate\n",
"on (can you guess which one?).\n",
"but highlights the simplicity of using `synapseml` compared to `pyspark`.\n",
"\n",
"The task is simple: Predict whether a user's review of a book sold on\n",
"Amazon is good (rating > 3) or bad based on the text of the review. We\n",
"accomplish this by training LogisticRegression learners with different\n",
"The task is to predict whether a customer's review of a book sold on\n",
"Amazon is good (rating > 3) or bad based on the text of the review. You\n",
"accomplish it by training LogisticRegression learners with different\n",
"hyperparameters and choosing the best model."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"Import necessary Python libraries and get a spark session."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -35,12 +58,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Read the data\n",
"## Read the data\n",
"\n",
"We download and read in the data. We show a sample below:"
"Download and read in the data."
]
},
{
Expand All @@ -56,16 +80,16 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Extract more features and process data\n",
"## Extract features and process data\n",
"\n",
"Real data however is more complex than the above dataset. It is common\n",
"for a dataset to have features of multiple types: text, numeric,\n",
"categorical. To illustrate how difficult it is to work with these\n",
"datasets, we add two numerical features to the dataset: the **word\n",
"count** of the review and the **mean word length**."
"Real data is more complex than the above dataset. It's common\n",
"for a dataset to have features of multiple types, such as text, numeric, and\n",
"categorical. To illustrate how difficult it's to work with these\n",
"datasets, add two numerical features to the dataset: the **word count** of the review and the **mean word length**."
]
},
{
Expand Down Expand Up @@ -142,25 +166,22 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4a. Classify using pyspark\n",
"## Classify using pyspark\n",
"\n",
"To choose the best LogisticRegression classifier using the `pyspark`\n",
"library, need to *explicitly* perform the following steps:\n",
"library, you need to *explicitly* perform the following steps:\n",
"\n",
"1. Process the features:\n",
" * Tokenize the text column\n",
" * Hash the tokenized column into a vector using hashing\n",
" * Merge the numeric features with the vector in the step above\n",
" * Merge the numeric features with the vector\n",
"2. Process the label column: cast it into the proper type.\n",
"3. Train multiple LogisticRegression algorithms on the `train` dataset\n",
" with different hyperparameters\n",
"4. Compute the area under the ROC curve for each of the trained models\n",
" and select the model with the highest metric as computed on the\n",
" `test` dataset\n",
"5. Evaluate the best model on the `validation` set\n",
"\n",
"As you can see below, there is a lot of work involved and a lot of\n",
"steps where something can go wrong!"
"5. Evaluate the best model on the `validation` set"
]
},
{
Expand Down Expand Up @@ -235,16 +256,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4b. Classify using synapseml\n",
"## Classify using SynapseML\n",
"\n",
"Life is a lot simpler when using `synapseml`!\n",
"The steps needed with `synapseml` are simpler:\n",
mhamilton723 marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"1. The **`TrainClassifier`** Estimator featurizes the data internally,\n",
" as long as the columns selected in the `train`, `test`, `validation`\n",
" dataset represent the features\n",
"\n",
"2. The **`FindBestModel`** Estimator finds the best model from a pool of\n",
" trained models by finding the model which performs best on the `test`\n",
" trained models by finding the model that performs best on the `test`\n",
" dataset given the specified metric\n",
"\n",
"3. The **`ComputeModelStatistics`** Transformer computes the different\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,17 @@
}
},
"source": [
"# A 5-minute tour of SynapseML"
"# Build your first SynapseML model\n",
"This tutorial provides a brief introduction to SynapseML. In particular, we use SynapseML to create two different pipelines for sentiment analysis. The first pipeline combines a text featurization stage with LightGBM regression to predict ratings based on review text from a dataset containing book reviews from Amazon. The second pipeline shows how to use prebuilt models through the Azure Cognitive Services to solve this problem without training data."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up the environment\n",
"Import SynapseML libraries and initialize your Spark session."
]
},
{
Expand Down Expand Up @@ -39,6 +49,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
Expand All @@ -48,7 +59,8 @@
}
},
"source": [
"# Step 1: Load our Dataset"
"## Load a dataset\n",
"Load your dataset and split it into train and test sets."
]
},
{
Expand Down Expand Up @@ -77,6 +89,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -87,7 +100,8 @@
}
},
"source": [
"# Step 2: Make our Model"
"## Create the training pipeline\n",
"Create a pipeline that featurizes data using `TextFeaturizer` from the `synapse.ml.featurize.text` library and derives a rating using the `LightGBMRegressor` function."
]
},
{
Expand Down Expand Up @@ -116,6 +130,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -126,7 +141,8 @@
}
},
"source": [
"# Step 3: Predict!"
"## Predict the output of the test data\n",
"Call the `transform` function on the model to predict and display the output of the test data as a dataframe."
]
},
{
Expand All @@ -146,6 +162,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
Expand All @@ -156,7 +173,8 @@
}
},
"source": [
"# Alternate route: Let the Cognitive Services handle it"
"## Use Cognitive Services to transform data in one step\n",
"Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Cognitive Services to transform your data in one step."
]
},
{
Expand All @@ -181,7 +199,9 @@
"model = TextSentiment(\n",
" textCol=\"text\",\n",
" outputCol=\"sentiment\",\n",
" subscriptionKey=find_secret(\"cognitive-api-key\"),\n",
" subscriptionKey=find_secret(\n",
" \"cognitive-api-key\"\n",
" ), # Replace the call to find_secret with your key as a python string.\n",
").setLocation(\"eastus\")\n",
"\n",
"display(model.transform(test))"
Expand Down
Loading