diff --git a/docs/notebooks/balance_data_with_conditional_data_generation.ipynb b/docs/notebooks/balance_data_with_conditional_data_generation.ipynb deleted file mode 100644 index 95d24765..00000000 --- a/docs/notebooks/balance_data_with_conditional_data_generation.ipynb +++ /dev/null @@ -1,461 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UTRxpSlaczHY" - }, - "source": [ - "# Balancing datasets with conditional data generation\n", - "\n", - "Imbalanced datasets are a common problem in machine learning. There are several different scenarios where an imbalanced dataset can lead to a less than optimal model solution. One scenario is when you're training a multi-class classifier and one or more of the classes have fewer training examples than the others. This can sometimes lead to a model that may look like it's doing well overall,when really the accuracy of the underepresented classes is inferior to that of the classes with good representation.\n", - "\n", - "Another scenario is when the training data has imbalanced demographic data. Part of what the Fair AI movement is about is ensuring that AI models do equally well on all demographic slices.\n", - "\n", - "One approach to improve representational biases in data is through by conditioning Gretel's synthetic data model to generate more examples of different classes of data.\n", - "\n", - "You can use the approach to replace the original data with a balanced synthetic dataset or you can use it to augment the existing dataset, producing just enough synthetic data such that when added back into the original data, the imbalance is resolved.\n", - "\n", - "In this notebook, we're going to step you through how to use Gretel synthetics to resolve demographic bias in a dataset. We will be creating a new synthetic dataset that can be used in place of the original one.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "An3JaXtu_15j" - }, - "source": [ - "## Begin by authenticating\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VEM6kjRsczHd" - }, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "ZQ-TmAdwczHd", - "outputId": "4a8c2b52-950a-4c07-d9ee-b80293238f43" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dDfOuvA5_15n" - }, - "source": [ - "## Load and view the dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 424 - }, - "id": "YRTunFZ2_15n", - "outputId": "dc403944-03f8-4007-f47a-1d38eb1e81e9" - }, - "outputs": [], - "source": [ - "a = pd.read_csv(\n", - " \"https://gretel-public-website.s3.amazonaws.com/datasets/experiments/healthcare_dataset_a.csv\"\n", - ")\n", - "\n", - "a\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sLkVPQlh_15o" - }, - "source": [ - "## Isolate the fields that require balancing\n", - "\n", - "- We'll balance \"RACE\", \"ETHNICITY\", and \"GENDER\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "XN-KytoT_15p", - "outputId": "8d40c38d-80b7-4613-c206-e3d889c8cf69" - }, - "outputs": [], - "source": [ - "a[\"RACE\"].value_counts()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "sqpSM_EU_15q", - "outputId": "aba9a196-68ec-403d-b47f-9f4a358dc669" - }, - "outputs": [], - "source": [ - "a[\"ETHNICITY\"].value_counts()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "xZZ7v8Hf_15q", - "outputId": "3358425a-5d46-43a4-ad51-0f7915f463cb" - }, - "outputs": [], - "source": [ - "a[\"GENDER\"].value_counts()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1Eisd9JU_15r" - }, - "source": [ - "## Create a seed file\n", - "\n", - "- Create a csv with one column for each balance field and one record for each combination of the balance field values.\n", - "- Replicate the seeds to reach the desired synthetic data size.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "iOi2i3qr_15s" - }, - "outputs": [], - "source": [ - "import itertools\n", - "\n", - "# Choose your balance columns\n", - "balance_columns = [\"GENDER\", \"ETHNICITY\", \"RACE\"]\n", - "\n", - "# How many total synthetic records do you want\n", - "gen_lines = len(a)\n", - "\n", - "# Get the list of values for each seed field and the\n", - "# overall percent we'll need for each seed value combination\n", - "categ_val_lists = []\n", - "seed_percent = 1\n", - "for field in balance_columns:\n", - " values = set(pd.Series(a[field].dropna()))\n", - " category_cnt = len(values)\n", - " categ_val_lists.append(list(values))\n", - " seed_percent = seed_percent * 1 / category_cnt\n", - "seed_gen_cnt = seed_percent * gen_lines\n", - "\n", - "# Get the combo seeds we'll need. This is all combinations of all\n", - "# seed field values\n", - "seed_fields = []\n", - "for combo in itertools.product(*categ_val_lists):\n", - " seed_dict = {}\n", - " i = 0\n", - " for field in balance_columns:\n", - " seed_dict[field] = combo[i]\n", - " i += 1\n", - " seed = {}\n", - " seed[\"seed\"] = seed_dict\n", - " seed[\"cnt\"] = seed_gen_cnt\n", - " seed_fields.append(seed)\n", - "\n", - "# Create a dataframe with the seed values used to condition the synthetic model\n", - "gender_all = []\n", - "ethnicity_all = []\n", - "race_all = []\n", - "for seed in seed_fields:\n", - " gender = seed[\"seed\"][\"GENDER\"]\n", - " ethnicity = seed[\"seed\"][\"ETHNICITY\"]\n", - " race = seed[\"seed\"][\"RACE\"]\n", - " cnt = seed[\"cnt\"]\n", - " for i in range(int(cnt)):\n", - " gender_all.append(gender)\n", - " ethnicity_all.append(ethnicity)\n", - " race_all.append(race)\n", - "\n", - "df_seed = pd.DataFrame(\n", - " {\"GENDER\": gender_all, \"ETHNICITY\": ethnicity_all, \"RACE\": race_all}\n", - ")\n", - "\n", - "# Save the seed dataframe to a file\n", - "seedfile = \"/tmp/balance_seeds.csv\"\n", - "df_seed.to_csv(seedfile, index=False, header=True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VVaGfSFc_15t" - }, - "source": [ - "## Create a synthetic config file\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "BInkOazF_15u" - }, - "outputs": [], - "source": [ - "# Grab the default Synthetic Config file\n", - "from gretel_client.projects.models import read_model_config\n", - "\n", - "config = read_model_config(\"synthetics/default\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Z3hDdxFn_15u" - }, - "outputs": [], - "source": [ - "# Adjust the desired number of synthetic records to generated\n", - "\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"][\"num_records\"] = len(a)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uneHBVfN_15v" - }, - "outputs": [], - "source": [ - "# Adjust params for complex dataset\n", - "\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"data_upsample_limit\"] = 10000\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RR0AHEBR_15v" - }, - "source": [ - "## Include a seeding task in the config\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Qq-wkWq0_15v" - }, - "outputs": [], - "source": [ - "task = {\"type\": \"seed\", \"attrs\": {\"fields\": balance_columns}}\n", - "config[\"models\"][0][\"synthetics\"][\"task\"] = task\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IbDnimMH_15w" - }, - "source": [ - "## Train a synthetic model\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Yvf9EI85_15w", - "outputId": "bcbed207-3a60-478a-9e65-88d54a45c9b2" - }, - "outputs": [], - "source": [ - "from gretel_client import projects\n", - "from gretel_client.helpers import poll\n", - "\n", - "training_path = \"training_data.csv\"\n", - "a.to_csv(training_path)\n", - "\n", - "project = projects.create_or_get_unique_project(name=\"balancing-data-example\")\n", - "model = project.create_model_obj(model_config=config, data_source=training_path)\n", - "\n", - "model.submit_cloud()\n", - "poll(model)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "X--V8DHl_15w" - }, - "source": [ - "## Generate data using the balance seeds\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "PeZPWdXT_15x", - "outputId": "ec54477f-a64d-4686-f7ce-9a4b355ed53f" - }, - "outputs": [], - "source": [ - "rh = model.create_record_handler_obj(\n", - " data_source=seedfile, params={\"num_records\": len(df_seed)}\n", - ")\n", - "rh.submit_cloud()\n", - "poll(rh)\n", - "synth_df = pd.read_csv(rh.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "synth_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GFoJ8niJ_15x" - }, - "source": [ - "## Validate the balanced demographic data\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "CXdorzf1_15x", - "outputId": "6732a6b0-b72f-48e0-db74-b7b0cdc40ff4" - }, - "outputs": [], - "source": [ - "synth_df[\"GENDER\"].value_counts()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "yxrQujl0_15x", - "outputId": "69ef1869-865e-4cff-e51e-c3447778619c" - }, - "outputs": [], - "source": [ - "synth_df[\"ETHNICITY\"].value_counts()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Ghc2mEQg_15y", - "outputId": "710efabf-b480-4dbb-f145-2b717c6a5a11" - }, - "outputs": [], - "source": [ - "synth_df[\"RACE\"].value_counts()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5152iEX1_15y" - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "Gretel - Balancing datasets with conditional data generation", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/balance_uci_heart_disease.ipynb b/docs/notebooks/balance_uci_heart_disease.ipynb deleted file mode 100644 index d763d7c8..00000000 --- a/docs/notebooks/balance_uci_heart_disease.ipynb +++ /dev/null @@ -1,419 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BujHsjP2zY6m" - }, - "source": [ - "This notebook demonstrates using Gretel.ai's conditional sampling to balance the gender attributes in a popular healthcare dataset, resulting in both better ML model accuracy, and potentially a more ethically fair training set.\n", - "\n", - "The Heart Disease dataset published by University of California Irvine is one of the top 5 datasets on the data science competition site Kaggle, with 9 data science tasks listed and 1,014+ notebook kernels created by data scientists. It is a series of health 14 attributes and is labeled with whether the patient had a heart disease or not, making it a great dataset for prediction.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hbBXoBVyvkZ4" - }, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install gretel_client xgboost" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "PR_EA4Z-v8WM", - "outputId": "89e66d2d-a793-4ba0-9c83-0ff8e67fe79e" - }, - "outputs": [], - "source": [ - "from gretel_client import configure_session\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 423 - }, - "id": "YMg9nX6SczHe", - "outputId": "0be46d67-6f51-47f2-8ed3-ca380744c280" - }, - "outputs": [], - "source": [ - "# Load and preview dataset\n", - "\n", - "import pandas as pd\n", - "\n", - "# Create from Kaggle dataset using an 70/30% split.\n", - "train = pd.read_csv(\n", - " \"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/uci-heart-disease/heart_train.csv\"\n", - ")\n", - "test = pd.read_csv(\n", - " \"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/uci-heart-disease/heart_test.csv\"\n", - ")\n", - "\n", - "train\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 560 - }, - "id": "BTeNPvgKvkZ6", - "outputId": "d5c4c979-918c-4a48-d959-f8d47d937706" - }, - "outputs": [], - "source": [ - "# Plot distributions in real world data\n", - "\n", - "pd.options.plotting.backend = \"plotly\"\n", - "\n", - "df = train.sex.copy()\n", - "df = df.replace(0, \"female\").replace(1, \"male\")\n", - "\n", - "print(\n", - " f\"We will need to augment training set with an additional {train.sex.value_counts()[1] - train.sex.value_counts()[0]} records to balance gender class\"\n", - ")\n", - "df.value_counts().sort_values().plot(kind=\"barh\", title=\"Real world distribution\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "tvKsT56cjOFO", - "outputId": "b0ed60db-3f8d-419f-f32f-32b680164fdd" - }, - "outputs": [], - "source": [ - "# Train a synthetic model on the training set\n", - "\n", - "from gretel_client import projects\n", - "from gretel_client.projects.models import read_model_config\n", - "from gretel_client.helpers import poll\n", - "\n", - "# Create a project and model configuration.\n", - "project = projects.create_or_get_unique_project(name=\"uci-heart-disease\")\n", - "\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "# Here we prepare an object to specify the conditional data generation task.\n", - "fields = [\"sex\"]\n", - "task = {\"type\": \"seed\", \"attrs\": {\"fields\": fields}}\n", - "config[\"models\"][0][\"synthetics\"][\"task\"] = task\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"] = {\"num_records\": 500}\n", - "config[\"models\"][0][\"synthetics\"][\"privacy_filters\"] = {\n", - " \"similarity\": None,\n", - " \"outliers\": None,\n", - "}\n", - "\n", - "\n", - "# Fit the model on the training set\n", - "model = project.create_model_obj(model_config=config)\n", - "train.to_csv(\"train.csv\", index=False)\n", - "model.data_source = \"train.csv\"\n", - "model.submit_cloud()\n", - "\n", - "poll(model)\n", - "\n", - "synthetic = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "synthetic\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "VJMSsKsJj52c", - "outputId": "9a29ff2f-660e-4569-d2d7-3130192581e4" - }, - "outputs": [], - "source": [ - "# Conditionaly sample records from the synthetic data model using `seeds`\n", - "# to augment the real world training data\n", - "\n", - "\n", - "num_rows = 5000\n", - "seeds = pd.DataFrame(index=range(num_rows), columns=[\"sex\"]).fillna(0)\n", - "delta = train.sex.value_counts()[1] - train.sex.value_counts()[0]\n", - "seeds[\"sex\"][int((num_rows + delta) / 2) :] = 1\n", - "seeds.sample(frac=1).to_csv(\"seeds.csv\", index=False)\n", - "\n", - "rh = model.create_record_handler_obj(\n", - " data_source=\"seeds.csv\", params={\"num_records\": len(seeds)}\n", - ")\n", - "rh.submit_cloud()\n", - "\n", - "poll(rh)\n", - "\n", - "synthetic = pd.read_csv(rh.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "augmented = pd.concat([synthetic, train])\n", - "augmented\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 560 - }, - "id": "ZG3TEyfxvkZ8", - "outputId": "8689cafd-019f-4880-bb0f-b260895af564" - }, - "outputs": [], - "source": [ - "# Plot distributions in the synthetic data\n", - "\n", - "\n", - "print(\n", - " f\"Augmented synthetic dataset with an additional {delta} records to balance gender class\"\n", - ")\n", - "df = augmented.sex.copy()\n", - "df = df.replace(0, \"female\").replace(1, \"male\")\n", - "df.value_counts().sort_values().plot(\n", - " kind=\"barh\", title=\"Augmented dataset distribution\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 756 - }, - "id": "f-nDGh46vkZ8", - "outputId": "5716d609-e1c4-46f5-9add-a8d6910ef556" - }, - "outputs": [], - "source": [ - "# Compare real world vs. synthetic accuracies using popular classifiers\n", - "\n", - "import matplotlib.pyplot as plt\n", - "\n", - "from sklearn.ensemble import RandomForestClassifier\n", - "from sklearn.neighbors import KNeighborsClassifier\n", - "from sklearn.tree import DecisionTreeClassifier\n", - "from sklearn.svm import SVC\n", - "from xgboost import XGBClassifier\n", - "\n", - "import plotly.express as px\n", - "\n", - "\n", - "def classification_accuracy(data_type, dataset, test) -> dict:\n", - "\n", - " accuracies = []\n", - " x_cols = [\n", - " \"age\",\n", - " \"sex\",\n", - " \"cp\",\n", - " \"trestbps\",\n", - " \"chol\",\n", - " \"fbs\",\n", - " \"restecg\",\n", - " \"thalach\",\n", - " \"exang\",\n", - " \"oldpeak\",\n", - " \"slope\",\n", - " \"ca\",\n", - " \"thal\",\n", - " ]\n", - " y_col = \"target\"\n", - "\n", - " rf = RandomForestClassifier(n_estimators=1000, random_state=1)\n", - " rf.fit(dataset[x_cols], dataset[y_col])\n", - " acc = rf.score(test[x_cols], test[y_col]) * 100\n", - " accuracies.append([data_type, \"RandomForest\", acc])\n", - " print(\" -- Random Forest: {:.2f}%\".format(acc))\n", - "\n", - " svm = SVC(random_state=1)\n", - " svm.fit(dataset[x_cols], dataset[y_col])\n", - " acc = svm.score(test[x_cols], test[y_col]) * 100\n", - " accuracies.append([data_type, \"SVM\", acc])\n", - " print(\" -- SVM: {:.2f}%\".format(acc))\n", - "\n", - " knn = KNeighborsClassifier(n_neighbors=2) # n_neighbors means k\n", - " knn.fit(dataset[x_cols], dataset[y_col])\n", - " acc = knn.score(test[x_cols], test[y_col]) * 100\n", - " accuracies.append([data_type, \"KNN\", acc])\n", - " print(\" -- KNN: {:.2f}%\".format(acc))\n", - "\n", - " dtc = DecisionTreeClassifier()\n", - " dtc.fit(dataset[x_cols], dataset[y_col])\n", - " acc = dtc.score(test[x_cols], test[y_col]) * 100\n", - " accuracies.append([data_type, \"DecisionTree\", acc])\n", - " print(\" -- Decision Tree Test Accuracy {:.2f}%\".format(acc))\n", - "\n", - " xgb = XGBClassifier(use_label_encoder=False, eval_metric=\"error\")\n", - " xgb.fit(dataset[x_cols], dataset[y_col])\n", - " acc = xgb.score(test[x_cols], test[y_col]) * 100\n", - " accuracies.append([data_type, \"XGBoost\", acc])\n", - " print(\" -- XGBoostClassifier: {:.2f}%\".format(acc))\n", - "\n", - " return accuracies\n", - "\n", - "\n", - "print(\"Calculating real world accuracies\")\n", - "realworld_acc = classification_accuracy(\"real world\", train, test)\n", - "print(\"Calculating synthetic accuracies\")\n", - "synthetic_acc = classification_accuracy(\"synthetic\", augmented, test)\n", - "\n", - "comparison = pd.DataFrame(\n", - " realworld_acc + synthetic_acc, columns=[\"data_type\", \"algorithm\", \"acc\"]\n", - ")\n", - "colours = {\n", - " \"synthetic\": \"#3EC1CD\",\n", - " \"synthetic1\": \"#FCB94D\",\n", - " \"real world\": \"#9ee0e6\",\n", - " \"real world1\": \"#fddba5\",\n", - "}\n", - "\n", - "fig = px.bar(\n", - " comparison,\n", - " x=\"algorithm\",\n", - " y=\"acc\",\n", - " color=\"data_type\",\n", - " color_discrete_map=colours,\n", - " barmode=\"group\",\n", - " text_auto=\".4s\",\n", - " title=\"Real World vs. Synthetic Data for all classes\",\n", - ")\n", - "fig.update_layout(legend_title_text=\"Real world v. Synthetic\")\n", - "fig.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "z8XG1abginmY", - "outputId": "5d1ae12f-6cdc-45d7-9198-ef8abee12e46" - }, - "outputs": [], - "source": [ - "print(\"Calculating real world class accuracies\")\n", - "realworld_male = classification_accuracy(\n", - " \"realworld_male\", train, test.loc[test[\"sex\"] == 1]\n", - ")\n", - "realworld_female = classification_accuracy(\n", - " \"realworld_female\", train, test.loc[test[\"sex\"] == 0]\n", - ")\n", - "print(\"Calculating synthetic class accuracies\")\n", - "synthetic_male = classification_accuracy(\n", - " \"synthetic_male\", augmented, test.loc[test[\"sex\"] == 1]\n", - ")\n", - "synthetic_female = classification_accuracy(\n", - " \"synthetic_female\", augmented, test.loc[test[\"sex\"] == 0]\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 542 - }, - "id": "5xky1T471Gec", - "outputId": "7def9d19-34e4-4df4-e7c3-9dd9e9f6b8bb" - }, - "outputs": [], - "source": [ - "# Plot male (majority class) heart disease detection accuracies (real world vs. synthetic)\n", - "colours = {\n", - " \"synthetic_male\": \"#3EC1CD\",\n", - " \"synthetic_female\": \"#FCB94D\",\n", - " \"realworld_male\": \"#9ee0e6\",\n", - " \"realworld_female\": \"#fddba5\",\n", - "}\n", - "\n", - "comparison = pd.DataFrame(\n", - " realworld_male + synthetic_male + realworld_female + synthetic_female,\n", - " columns=[\"data_type\", \"algorithm\", \"acc\"],\n", - ")\n", - "fig = px.bar(\n", - " comparison,\n", - " x=\"algorithm\",\n", - " y=\"acc\",\n", - " color=\"data_type\",\n", - " color_discrete_map=colours,\n", - " barmode=\"group\",\n", - " text_auto=\".4s\",\n", - " title=\"Real World vs. Synthetic Accuracy for Male and Female Heart Disease Detection\",\n", - ")\n", - "fig.update_layout(legend_title_text=\"Real world v. Synthetic\")\n", - "fig.show()\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "include_colab_link": true, - "name": "balance_uci_heart_disease", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/boost_massively_imbalanced_set.ipynb b/docs/notebooks/boost_massively_imbalanced_set.ipynb deleted file mode 100644 index db276dad..00000000 --- a/docs/notebooks/boost_massively_imbalanced_set.ipynb +++ /dev/null @@ -1,468 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "6f65414e", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2a2362aa", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install pyyaml numpy pandas sklearn smart_open xgboost\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "47ce9e55", - "metadata": {}, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31028201", - "metadata": {}, - "outputs": [], - "source": [ - "# Create imbalanced train and test data\n", - "# We will use sklearn's make_classification to create a test dataset.\n", - "# Or, load your own dataset as a Pandas DataFrame.\n", - "\n", - "CLASS_COLUMN = \"Class\" # the labeled classification column\n", - "CLASS_VALUE = 1 # the minority classification label to boost\n", - "MAX_NEIGHBORS = 5 # number of KNN neighbors to use per positive datapoint\n", - "SYNTHETIC_PERCENT = 10 # generate SYNTHETIC_PERCENT records vs. source data\n", - "\n", - "# Create imbalanced test dataset\n", - "import pandas as pd\n", - "from sklearn.datasets import make_classification\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "\n", - "n_features = 15\n", - "n_recs = 10000\n", - "\n", - "\n", - "def create_dataset(n_features: int) -> pd.DataFrame:\n", - " \"\"\"Use sklearn to create a massively imbalanced dataset\"\"\"\n", - " X, y = make_classification(\n", - " n_samples=n_recs,\n", - " n_features=n_features,\n", - " n_informative=10,\n", - " n_classes=2,\n", - " weights=[0.95],\n", - " flip_y=0.0,\n", - " random_state=42,\n", - " )\n", - "\n", - " df = pd.DataFrame(X, columns=[f\"feature_{x}\" for x in range(n_features)])\n", - " df = df.round(6)\n", - " df[CLASS_COLUMN] = y\n", - " return df\n", - "\n", - "\n", - "dataset = create_dataset(n_features=n_features)\n", - "train, test = train_test_split(dataset, test_size=0.2)\n", - "\n", - "train.head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c69188ab", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from sklearn.neighbors import NearestNeighbors\n", - "\n", - "# Split positive and negative datasets\n", - "positive = train[train[CLASS_COLUMN] == CLASS_VALUE]\n", - "print(f\"Positive records shape (rows, columns): {positive.shape}\")\n", - "\n", - "# Train a nearest neighbor model on the negative dataset\n", - "neighbors = NearestNeighbors(n_neighbors=MAX_NEIGHBORS, algorithm=\"ball_tree\")\n", - "neighbors.fit(train)\n", - "\n", - "# Locate the nearest neighbors to the positive (minority) set,\n", - "# and add to the training set.\n", - "nn = neighbors.kneighbors(positive, MAX_NEIGHBORS, return_distance=False)\n", - "nn_idx = list(set([item for sublist in nn for item in sublist]))\n", - "nearest_neighbors = train.iloc[nn_idx, :]\n", - "\n", - "oversample = pd.concat([positive] * 5)\n", - "training_set = pd.concat([oversample, nearest_neighbors]).sample(frac=1)\n", - "\n", - "training_set.head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dabde917", - "metadata": {}, - "outputs": [], - "source": [ - "from smart_open import open\n", - "import yaml\n", - "\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"boost-imbalanced-synthetic\")\n", - "\n", - "# If you want to use a different config or modify it before creating the model,\n", - "# try something like this (yes, we have other stock configs in that repo)\n", - "# from gretel_client.projects.models import read_model_config\n", - "# config = read_model_config(\"synthetics/default\")\n", - "\n", - "# Get a csv to work with, just dump out the training_set.\n", - "training_set.to_csv(\"train.csv\", index=False)\n", - "\n", - "# Here we just use a shortcut to specify the default synthetics config.\n", - "# Yes, you can use other shortcuts to point at some of the other stock configs.\n", - "model = project.create_model_obj(\n", - " model_config=\"synthetics/default\", data_source=\"train.csv\"\n", - ")\n", - "\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "poll(model)\n", - "\n", - "recs_to_generate = int(len(dataset.values) * (SYNTHETIC_PERCENT / 100.0))\n", - "\n", - "# Use the model to generate synthetic data.\n", - "record_handler = model.create_record_handler_obj(\n", - " params={\"num_records\": recs_to_generate, \"max_invalid\": recs_to_generate}\n", - ")\n", - "record_handler.submit_cloud()\n", - "\n", - "poll(record_handler)\n", - "\n", - "synthetic_df = pd.read_csv(record_handler.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "synthetic = synthetic_df[\n", - " synthetic_df[CLASS_COLUMN] == CLASS_VALUE\n", - "] # Keep only positive examples\n", - "\n", - "synthetic.head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "28cdc48b", - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.concat(\n", - " [\n", - " train.assign(Type=\"train\"),\n", - " test.assign(Type=\"test\"),\n", - " synthetic.assign(Type=\"synthetic\"),\n", - " ]\n", - ")\n", - "df.reset_index(inplace=True)\n", - "df.to_csv(\"combined-boosted-df.csv\")\n", - "project.upload_artifact(\"combined-boosted-df.csv\")\n", - "\n", - "# Save to local CSV\n", - "synthetic.to_csv(\"boosted-synthetic.csv\", index=False)\n", - "project.upload_artifact(\"boosted-synthetic.csv\")\n", - "\n", - "print(f\"View this project at: {project.get_console_url()}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2a811d5", - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize distribution of positive and negative examples in our \n", - "# normal vs. boosted datasets\n", - "\n", - "%matplotlib inline\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "\n", - "\n", - "def visualize_distributions(test: pd.DataFrame, train: pd.DataFrame, synthetic: pd.DataFrame):\n", - " \"\"\" Plot the distribution of positive (e.g. fraud) vs negative \n", - " e.g. (non-fraud) examples. \n", - " \"\"\"\n", - " fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))\n", - " fig = plt.figure(1, figsize=(12, 9))\n", - "\n", - " dataframes = {\n", - " \"test\": test,\n", - " \"train\": train,\n", - " \"boosted\": pd.concat([train, synthetic])\n", - " }\n", - "\n", - " idx = 0\n", - " for name, df in dataframes.items():\n", - " df.Class.value_counts().plot.bar(ax=axes[idx], title=name)\n", - " idx+=1\n", - "\n", - "visualize_distributions(test, train, synthetic)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba4dcd2b", - "metadata": {}, - "outputs": [], - "source": [ - "## Use PCA to visualize highly dimensional data\n", - "\n", - "# We will label each data class as:\n", - "# * Training negative: 0\n", - "# * Training positive: 1\n", - "# * Synthetic positive: 2 (our synthetic data points used to boost training data)\n", - "# * Test positive: 3 (not cheating here, we already trained the classifier)\n", - "\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.decomposition import PCA\n", - "\n", - "\n", - "def create_visualization_dataframe(train: pd.DataFrame) -> pd.DataFrame:\n", - " # Build a new visualization dataframe from our training data\n", - " train_vis = train\n", - "\n", - " # Add in positive synthetic results\n", - " train_vis = pd.merge(train, synthetic, indicator=True, how=\"outer\")\n", - " train_vis.loc[(train_vis._merge == \"right_only\"), \"Class\"] = 2\n", - " train_vis = train_vis.drop(columns=[\"_merge\"])\n", - "\n", - " # Add in positive results from the test set\n", - " train_vis = pd.merge(\n", - " train_vis, test[test[\"Class\"] == 1], indicator=True, how=\"outer\"\n", - " )\n", - " train_vis.loc[\n", - " (train_vis._merge == \"right_only\") | (train_vis._merge == \"both\"), \"Class\"\n", - " ] = 3\n", - " train_vis = train_vis.drop(columns=[\"_merge\"])\n", - " return train_vis\n", - "\n", - "\n", - "def visualize_pca_2d(train_vis: pd.DataFrame):\n", - " X = train_vis.iloc[:, :-1]\n", - " y = train_vis[\"Class\"]\n", - "\n", - " fig = plt.figure(1, figsize=(12, 9))\n", - " plt.clf()\n", - " plt.cla()\n", - "\n", - " pca = PCA(n_components=2)\n", - " x_std = StandardScaler().fit_transform(X)\n", - " projected = pca.fit_transform(x_std)\n", - "\n", - " labels = [\"Train Negative\", \"Train Positive\", \"Synthetic Positive\", \"Test Positive\"]\n", - " size_map = {0: 25, 1: 50, 2: 75, 3: 50}\n", - " sizes = [size_map[x] for x in y]\n", - "\n", - " scatter = plt.scatter(\n", - " projected[:, 0], projected[:, 1], c=y, s=sizes, cmap=plt.cm.plasma, alpha=0.8\n", - " )\n", - " plt.title = f\"PCA plot of {n_features}-dimension classification dataset\"\n", - " plt.legend(handles=scatter.legend_elements()[0], labels=labels)\n", - " plt.show()\n", - "\n", - "\n", - "# Visualize PCA distribution in 2D\n", - "train_vis = create_visualization_dataframe(train)\n", - "visualize_pca_2d(train_vis)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a020a8d7", - "metadata": {}, - "outputs": [], - "source": [ - "# Plot PCA scatter in 3 dimensions\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from mpl_toolkits.mplot3d import Axes3D\n", - "from sklearn import datasets\n", - "from sklearn.decomposition import PCA\n", - "from sklearn import decomposition\n", - "from sklearn import datasets\n", - "\n", - "\n", - "def visualize_pca_3d(train_vis: pd.DataFrame):\n", - " X = train_vis.iloc[:, :-1]\n", - " y = train_vis[\"Class\"]\n", - "\n", - " np.random.seed(5)\n", - "\n", - " fig = plt.figure(1, figsize=(12, 9))\n", - " plt.clf()\n", - " ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)\n", - " plt.cla()\n", - " pca = decomposition.PCA(n_components=3)\n", - " labels = [\"Train Negative\", \"Train Positive\", \"Synthetic Positive\", \"Test Positive\"]\n", - " size_map = {0: 25, 1: 50, 2: 75, 3: 50}\n", - " sizes = [size_map[x] for x in y]\n", - "\n", - " pca.fit(X)\n", - " X = pca.transform(X)\n", - "\n", - " scatter = ax.scatter(\n", - " X[:, 0], X[:, 1], X[:, 2], c=y, s=sizes, cmap=plt.cm.plasma, alpha=1.0\n", - " )\n", - "\n", - " plt.legend(handles=scatter.legend_elements()[0], labels=labels)\n", - " plt.show()\n", - "\n", - "\n", - "# Visualize PCA distribution in 3D\n", - "visualize_pca_3d(train_vis)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bb88db63", - "metadata": {}, - "outputs": [], - "source": [ - "# Train an XGBoost model and compare accuracies on the original (normal)\n", - "# vs. augmented training data (train + synthetic) datasets.\n", - "\n", - "from sklearn.metrics import accuracy_score\n", - "from xgboost import XGBClassifier\n", - "\n", - "\n", - "def train_classifier(name: str, train: pd.DataFrame, test: pd.DataFrame):\n", - " \"\"\"Train our predictor with XGBoost\"\"\"\n", - "\n", - " # Encode labels and categorical variables before training prediction model\n", - " X_train = train.iloc[:, :-1]\n", - " y_train = train[\"Class\"]\n", - " X_test = test.iloc[:, :-1]\n", - " y_test = test[\"Class\"]\n", - "\n", - " model = XGBClassifier()\n", - " model.fit(X_train, y_train)\n", - " y_pred = model.predict(X_test)\n", - " accuracy = accuracy_score(y_test, y_pred)\n", - " np.set_printoptions(precision=2)\n", - " print(\"%s : XGBoost Model prediction accuracy: %.2f%%\" % (name, accuracy * 100.0))\n", - " return model, y_pred\n", - "\n", - "\n", - "# Train models on normal and augmented data\n", - "model_normal, y_pred = train_classifier(\"normal\", train, test)\n", - "model_boosted, y_pred = train_classifier(\"boosted\", pd.concat([train, synthetic]), test)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6d82610f", - "metadata": {}, - "outputs": [], - "source": [ - "# A confusion matrix gives better insight into per-class performance\n", - "# than overall model accuracy.\n", - "\n", - "# As a thought experiment, consider creating a model to predict whether\n", - "# an account will submit an insurance claim. Our goal is to maximize\n", - "# accuracy at predicting the minority (positive) set, above those who\n", - "# will not submit a claim. Try to maximize the diagonal (TP) elements of the\n", - "# confusion matrix, particularly the bottom right.\n", - "\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from sklearn.metrics import plot_confusion_matrix\n", - "\n", - "\n", - "def print_confusion_matrix(name: str, model: pd.DataFrame, test: pd.DataFrame):\n", - " \"\"\"Plot normalized and non-normalized confusion matrices\"\"\"\n", - " print(\"\")\n", - " print(\"\")\n", - " print(f\"Plotting confusion matrices for: {name} model\")\n", - " fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))\n", - " fig = plt.figure(1, figsize=(12, 9))\n", - " X_test = test.iloc[:, :-1]\n", - " y_test = test[\"Class\"]\n", - "\n", - " titles_options = [\n", - " (f\"{name} : Confusion matrix, without normalization\", None),\n", - " (f\"{name} : Normalized confusion matrix\", \"true\"),\n", - " ]\n", - "\n", - " idx = 0\n", - " for title, normalize in titles_options:\n", - " disp = plot_confusion_matrix(\n", - " model,\n", - " X_test,\n", - " y_test,\n", - " display_labels=[\"Negative\", \"Positive\"],\n", - " cmap=plt.cm.Blues,\n", - " normalize=normalize,\n", - " ax=axes[idx],\n", - " )\n", - " disp.ax_.set_title(title)\n", - " idx += 1\n", - "\n", - " plt.show()\n", - "\n", - "\n", - "print_confusion_matrix(\"normal\", model_normal, test)\n", - "print_confusion_matrix(\"boosted\", model_boosted, test)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/notebooks/create_synthetic_data_from_time_series.ipynb b/docs/notebooks/create_synthetic_data_from_time_series.ipynb deleted file mode 100644 index 9c72e8a1..00000000 --- a/docs/notebooks/create_synthetic_data_from_time_series.ipynb +++ /dev/null @@ -1,264 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GUnwBU_zYz2D" - }, - "source": [ - "# Synthesize Time Series data from your own DataFrame\n", - "\n", - "This Blueprint demonstrates how to create synthetic time series data with Gretel. We assume that within the dataset\n", - "there is at least:\n", - "\n", - "1. A specific column holding time data points\n", - "\n", - "2. One or more columns that contain measurements or numerical observations for each point in time.\n", - "\n", - "For this Blueprint, we will generate a very simple sine wave as our time series data.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "b4-JFrb-Yz2G" - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install numpy matplotlib pandas\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pHShf3MdYz2I" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "moLu6jA3Yz2I" - }, - "outputs": [], - "source": [ - "# Create a simple timeseries with a sine and cosine wave\n", - "\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "\n", - "day = 24 * 60 * 60\n", - "year = 365.2425 * day\n", - "\n", - "\n", - "def load_dataframe() -> pd.DataFrame:\n", - " \"\"\"Create a time series x sin wave dataframe.\"\"\"\n", - " df = pd.DataFrame(columns=[\"date\", \"sin\", \"cos\", \"const\"])\n", - "\n", - " df.date = pd.date_range(start=\"2017-01-01\", end=\"2021-07-01\", freq=\"4h\")\n", - " df.sin = 1 + np.sin(df.date.astype(\"int64\") // 1e9 * (2 * np.pi / year))\n", - " df.sin = (df.sin * 100).round(2)\n", - "\n", - " df.cos = 1 + np.cos(df.date.astype(\"int64\") // 1e9 * (2 * np.pi / year))\n", - " df.cos = (df.cos * 100).round(2)\n", - "\n", - " df.date = df.date.apply(lambda d: d.strftime(\"%Y-%m-%d\"))\n", - "\n", - " df.const = \"abcxyz\"\n", - "\n", - " return df\n", - "\n", - "\n", - "train_df = load_dataframe()\n", - "train_df.set_index(\"date\").plot(figsize=(12, 8))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "p7IlPWWPb38C" - }, - "source": [ - "# Fine-tuning hyper-parameters for time-series\n", - "\n", - "In this cell, we define the `date` field as the time_field for our task, and `sin` and `cos` as trend fields where we wish to model the differences between each time step.\n", - "\n", - "## Hyper parameters\n", - "\n", - "- `vocab_size` is set to 0 to use character-based tokenization vs. sentencepiece\n", - "- `predict_batch_size` is set to 1, which reduces generation speed but maximimizes use of model to replay long-term dependencies from the training sequences\n", - "- `validation_split` is set to False, as randomly sampled time-series records will have an information leakage problem between the train and test sets.\n", - "- `learning_rate` is set to 0.001, which increases training time but gives the model additional time to learn.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "F1q3ighmYz2J" - }, - "outputs": [], - "source": [ - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "from gretel_client.projects.models import read_model_config\n", - "\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"time-series-synthetic\")\n", - "\n", - "# Pull down the default synthetic config. We will modify it slightly.\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "\n", - "# Here we create an object to specify the timeseries task.\n", - "time_field = \"date\"\n", - "trend_fields = [\"sin\", \"cos\"]\n", - "\n", - "task = {\n", - " \"type\": \"time_series\",\n", - " \"attrs\": {\"time_field\": time_field, \"trend_fields\": trend_fields},\n", - "}\n", - "\n", - "config[\"models\"][0][\"synthetics\"][\"task\"] = task\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"epochs\"] = 100\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"vocab_size\"] = 0\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"learning_rate\"] = 1e-3\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"predict_batch_size\"] = 1\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"validation_split\"] = False\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"reset_states\"] = True\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"overwrite\"] = True\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"][\"num_records\"] = train_df.shape[0]\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"][\"max_invalid\"] = train_df.shape[0]\n", - "\n", - "# Get a csv to work with, just dump out the train_df.\n", - "train_df.to_csv(\"train.csv\", index=False)\n", - "\n", - "model = project.create_model_obj(model_config=config, data_source=\"train.csv\")\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "poll(model)\n", - "\n", - "synthetic = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "synthetic\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DoT24lMpYz2K" - }, - "outputs": [], - "source": [ - "# Does the synthetic data look similar? Yep!\n", - "fig, axs = plt.subplots(1, 2, figsize=(20, 6))\n", - "for k, v in enumerate(trend_fields):\n", - " train_df[[\"date\", v]].set_index(\"date\").plot(ax=axs[k], ls=\"--\")\n", - " synthetic[[\"date\", v]].set_index(\"date\").plot(ax=axs[k], alpha=0.7)\n", - " axs[k].legend([\"training\", \"synthetic\"], loc=\"lower right\")\n", - " axs[k].set_title(v)\n", - "plt.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zfe_3m68ajwn" - }, - "outputs": [], - "source": [ - "# For time series data we dump out the date column to seed the record handler.\n", - "train_df[\"date\"].to_csv(\"date_seeds.csv\", index=False)\n", - "\n", - "# Use the model to generate more synthetic data.\n", - "record_handler = model.create_record_handler_obj(\n", - " params={\"num_records\": 5000, \"max_invalid\": 5000},\n", - " data_source=\"date_seeds.csv\",\n", - ")\n", - "\n", - "record_handler.submit_cloud()\n", - "\n", - "poll(record_handler)\n", - "\n", - "# Create a second synthetic dataframe\n", - "synthetic_2 = pd.read_csv(record_handler.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "synthetic_2\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wZxrdBOdaxxk" - }, - "outputs": [], - "source": [ - "# Does the synthetic data look similar? Yep!\n", - "fig, axs = plt.subplots(1, 2, figsize=(20, 6))\n", - "for k, v in enumerate(trend_fields):\n", - " train_df[[\"date\", v]].set_index(\"date\").plot(ax=axs[k], ls=\"--\")\n", - " synthetic[[\"date\", v]].set_index(\"date\").plot(ax=axs[k], alpha=0.7)\n", - " synthetic_2[[\"date\", v]].set_index(\"date\").plot(ax=axs[k], alpha=0.7)\n", - " axs[k].legend([\"training\", \"synthetic\", \"synthetic_2\"], loc=\"lower right\")\n", - " axs[k].set_title(v)\n", - "plt.show()\n" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "create_synthetic_data_from_time_series.ipynb", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/docs/notebooks/local_classify.ipynb b/docs/notebooks/local_classify.ipynb deleted file mode 100644 index 9ebd6155..00000000 --- a/docs/notebooks/local_classify.ipynb +++ /dev/null @@ -1,207 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Classify and label content locally\n", - "\n", - "This notebook walks through training a classification model and labeling PII locally in your environment.\n", - "\n", - "Follow the instructions here to set up your local environment: https://docs.gretel.ai/environment-setup\n", - "\n", - "Prerequisites:\n", - "\n", - "- Python 3.9+ (`python --version`).\n", - "- Ensure that Docker is running (`docker info`).\n", - "- The Gretel client SDK is installed and configured (`pip install -U gretel-client; gretel configure`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZLAlOI5f_zh2" - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "import yaml\n", - "from smart_open import open\n", - "import pandas as pd\n", - "\n", - "from gretel_client import submit_docker_local\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "\n", - "data_source = \"https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/example-datasets/bike-customer-orders.csv\"\n", - "\n", - "# Policy to search for sensitive data\n", - "# including a custom regular expression based search\n", - "config = \"\"\"\n", - "schema_version: 1.0\n", - "models:\n", - " - classify:\n", - " data_source: \"_\"\n", - " labels:\n", - " - person_name\n", - " - location\n", - " - phone_number\n", - " - date_time\n", - " - birthdate\n", - " - gender\n", - " - acme/*\n", - " \n", - "label_predictors:\n", - " namespace: acme\n", - " regex:\n", - " user_id:\n", - " patterns:\n", - " - score: high\n", - " regex: ^user_[\\d]{5}$\n", - "\"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load and preview the DataFrame to train the classification model on.\n", - "\n", - "df = pd.read_csv(data_source, nrows=500)\n", - "df.to_csv(\"training_data.csv\", index=False)\n", - "df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 582 - }, - "id": "xq2zj-6h_zh5", - "outputId": "0587ddc8-ccb6-455b-f961-9392b4736d69" - }, - "outputs": [], - "source": [ - "project = create_or_get_unique_project(name=\"local-classify\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nvOhfvS4_zh5" - }, - "outputs": [], - "source": [ - "# the following cell will create the classification model and\n", - "# run a sample of the data set through the model. this sample\n", - "# can be used to ensure the model is functioning correctly\n", - "# before continuing.\n", - "classify = project.create_model_obj(\n", - " model_config=yaml.safe_load(config), data_source=\"training_data.csv\"\n", - ")\n", - "\n", - "run = submit_docker_local(classify, output_dir=\"tmp/\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "EAZLMwmG_zh6" - }, - "outputs": [], - "source": [ - "# review the sampled classification report\n", - "report = json.loads(open(\"tmp/report_json.json.gz\").read())\n", - "pd.DataFrame(report[\"metadata\"][\"fields\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hL0COKZo_zh6" - }, - "outputs": [], - "source": [ - "# next let's classify the remaining records using the model\n", - "# that was just created.\n", - "classify_records = classify.create_record_handler_obj(data_source=\"training_data.csv\")\n", - "\n", - "run = submit_docker_local(\n", - " classify_records, model_path=\"tmp/model.tar.gz\", output_dir=\"tmp/\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "eVPQySOg_zh6" - }, - "outputs": [], - "source": [ - "report = json.loads(open(\"tmp/report_json.json.gz\").read())\n", - "pd.DataFrame(report[\"metadata\"][\"fields\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load results\n", - "results = pd.read_json(\"tmp/data.gz\", lines=True)\n", - "\n", - "# Examine labels found in the first record\n", - "results.iloc[0].to_dict()\n" - ] - } - ], - "metadata": { - "colab": { - "name": "local_jobs.ipynb", - "provenance": [] - }, - "interpreter": { - "hash": "9bba8f6ed2feafdad698ed6a1926c15a7650a75eedae60d223f34187f1656d66" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.3" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} diff --git a/docs/notebooks/local_synthetics.ipynb b/docs/notebooks/local_synthetics.ipynb deleted file mode 100644 index 8890b0f3..00000000 --- a/docs/notebooks/local_synthetics.ipynb +++ /dev/null @@ -1,166 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train a Gretel.ai synthetic data model locally\n", - "\n", - "This notebook walks through training a model and generating synthetic data locally in your environment.\n", - "\n", - "Follow the instructions here to set up your local environment and GPU: https://docs.gretel.ai/environment-setup\n", - "\n", - "Prerequisites:\n", - "\n", - "- Python 3.9+ (`python --version`).\n", - "- GPU with CUDA configured highly recommended (`nvidia-smi`).\n", - "- Ensure that Docker is running (`docker info`.\n", - "- The Gretel client SDK is installed and configured (`pip install -U gretel-client; gretel configure`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "\n", - "from smart_open import open\n", - "import pandas as pd\n", - "\n", - "from gretel_client import submit_docker_local\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "\n", - "data_source = \"https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load and preview the DataFrame to train the synthetic model on.\n", - "\n", - "df = pd.read_csv(data_source)\n", - "df.to_csv(\"training_data.csv\", index=False)\n", - "df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load config and set training parameters\n", - "from gretel_client.projects.models import read_model_config\n", - "\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"epochs\"] = 50\n", - "config[\"models\"][0][\"synthetics\"][\"data_source\"] = \"training_data.csv\"\n", - "\n", - "print(json.dumps(config, indent=2))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a project and train the synthetic data model\n", - "\n", - "project = create_or_get_unique_project(name=\"synthetic-data-local\")\n", - "model = project.create_model_obj(model_config=config)\n", - "run = submit_docker_local(model, output_dir=\"tmp/\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# View the generated synthetic data\n", - "\n", - "synthetic_df = pd.read_csv(\"tmp/data_preview.gz\", compression=\"gzip\")\n", - "synthetic_df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# View report that shows the statistical performance between the training and synthetic data\n", - "\n", - "import IPython\n", - "\n", - "IPython.display.HTML(data=open(\"tmp/report.html.gz\").read(), metadata=dict(isolated=True))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Use the trained model to create additional synthetic data\n", - "\n", - "record_handler = model.create_record_handler_obj(params={\"num_records\": 100})\n", - "\n", - "run = submit_docker_local(\n", - " record_handler, model_path=\"tmp/model.tar.gz\", output_dir=\"tmp/\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic_df_new = pd.read_csv(\"tmp/data.gz\", compression=\"gzip\")\n", - "synthetic_df_new\n" - ] - } - ], - "metadata": { - "interpreter": { - "hash": "9bba8f6ed2feafdad698ed6a1926c15a7650a75eedae60d223f34187f1656d66" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/notebooks/local_transform.ipynb b/docs/notebooks/local_transform.ipynb deleted file mode 100644 index 131f11a7..00000000 --- a/docs/notebooks/local_transform.ipynb +++ /dev/null @@ -1,237 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Label and Transform content locally\n", - "\n", - "This notebook walks through training a transformation model and redacting PII locally in your environment.\n", - "\n", - "Follow the instructions here to set up your local environment: https://docs.gretel.ai/environment-setup\n", - "\n", - "Prerequisites:\n", - "\n", - "- Python 3.9+ (`python --version`).\n", - "- Ensure that Docker is running (`docker info`).\n", - "- The Gretel client SDK is installed and configured (`pip install -U gretel-client; gretel configure`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZLAlOI5f_zh2" - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "import yaml\n", - "from smart_open import open\n", - "import pandas as pd\n", - "\n", - "from gretel_client import submit_docker_local\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "\n", - "data_source = \"https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/example-datasets/bike-customer-orders.csv\"\n", - "\n", - "# Simple policy to redact PII types with a character.\n", - "# Dates are shifted +/- 20 days based on the CustomerID field.\n", - "# Income is bucketized to 5000 number increments.\n", - "\n", - "config = \"\"\"\n", - "schema_version: 1.0\n", - "models:\n", - " - transforms:\n", - " data_source: \"_\"\n", - " policies:\n", - " - name: remove_pii\n", - " rules:\n", - " - name: fake_or_redact_pii\n", - " conditions:\n", - " value_label:\n", - " - person_name\n", - " - phone_number\n", - " - gender\n", - " - birth_date\n", - " transforms:\n", - " - type: redact_with_char\n", - " attrs:\n", - " char: X\n", - " - name: dateshifter\n", - " conditions:\n", - " field_label:\n", - " - date\n", - " - datetime\n", - " - birth_date\n", - " transforms:\n", - " - type: dateshift\n", - " attrs:\n", - " min: 20\n", - " max: 20\n", - " formats: \"%Y-%m-%d\"\n", - " field_name: \"CustomerID\" \n", - " - name: bucketize-income\n", - " conditions:\n", - " field_name:\n", - " - YearlyIncome\n", - " transforms:\n", - " - type: numberbucket\n", - " attrs:\n", - " min: 0\n", - " max: 1000000\n", - " nearest: 5000\n", - "\"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load and preview the DataFrame to train the transform model on.\n", - "\n", - "df = pd.read_csv(data_source, nrows=500)\n", - "df.to_csv(\"training_data.csv\", index=False)\n", - "df.head(5)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 582 - }, - "id": "xq2zj-6h_zh5", - "outputId": "0587ddc8-ccb6-455b-f961-9392b4736d69" - }, - "outputs": [], - "source": [ - "project = create_or_get_unique_project(name=\"local-transform\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nvOhfvS4_zh5" - }, - "outputs": [], - "source": [ - "# The following cell will create the transform model and\n", - "# run a sample of the data set through the model. this sample\n", - "# can be used to ensure the model is functioning correctly\n", - "# before continuing.\n", - "transform = project.create_model_obj(\n", - " model_config=yaml.safe_load(config), data_source=\"training_data.csv\"\n", - ")\n", - "\n", - "run = submit_docker_local(transform, output_dir=\"tmp/\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "EAZLMwmG_zh6" - }, - "outputs": [], - "source": [ - "# Review the sampled classification report\n", - "# to get an overview of detected data types\n", - "report = json.loads(open(\"tmp/report_json.json.gz\").read())\n", - "pd.DataFrame(report[\"metadata\"][\"fields\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hL0COKZo_zh6" - }, - "outputs": [], - "source": [ - "# Next let's transform the remaining records using the transformation\n", - "# policy and model that was just created.\n", - "transform_records = transform.create_record_handler_obj(data_source=\"training_data.csv\")\n", - "\n", - "run = submit_docker_local(\n", - " transform_records, model_path=\"tmp/model.tar.gz\", output_dir=\"tmp/\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "eVPQySOg_zh6" - }, - "outputs": [], - "source": [ - "# View the transformation report\n", - "report = json.loads(open(\"tmp/report_json.json.gz\").read())\n", - "pd.DataFrame(report[\"metadata\"][\"fields\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# View the transformed data\n", - "results = pd.read_csv(\"tmp/data.gz\")\n", - "results.head(5)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "colab": { - "name": "local_jobs.ipynb", - "provenance": [] - }, - "interpreter": { - "hash": "9bba8f6ed2feafdad698ed6a1926c15a7650a75eedae60d223f34187f1656d66" - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/minimal-synthetic-data.ipynb b/docs/notebooks/minimal-synthetic-data.ipynb deleted file mode 100644 index ce7584cf..00000000 --- a/docs/notebooks/minimal-synthetic-data.ipynb +++ /dev/null @@ -1,137 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "id": "iovURYt3d_pa" - }, - "outputs": [], - "source": [ - "pip install -U gretel-client pandas" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "id": "PryXC9MZd_pb" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "id": "94NFYFbEd_pc" - }, - "outputs": [], - "source": [ - "# Create a project and set model configuration\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "project = create_or_get_unique_project(name=\"mlworld\")\n", - "\n", - "from gretel_client.projects.models import read_model_config\n", - "config = read_model_config(\"synthetics/default\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "id": "gK2B5viId_pc" - }, - "outputs": [], - "source": [ - "# Load and preview the DataFrame to train the synthetic model on.\n", - "import pandas as pd\n", - "\n", - "dataset_path = \"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv\"\n", - "df = pd.read_csv(dataset_path)\n", - "df.to_csv(\"training_data.csv\", index=False)\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "id": "z1Ff1N3xd_pc" - }, - "outputs": [], - "source": [ - "from gretel_client.helpers import poll\n", - "\n", - "model = project.create_model_obj(model_config=config, data_source=\"training_data.csv\")\n", - "model.submit_cloud()\n", - "\n", - "poll(model)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "lDNP0xAid_pd" - }, - "outputs": [], - "source": [ - "# View the synthetic data\n", - "\n", - "synthetic_df = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "\n", - "synthetic_df" - ] - } - ], - "metadata": { - "interpreter": { - "hash": "71cb0d5e65981f6fa5659bfbb000a9cb81b1de06a40d22b09746b990f4d79987" - }, - "kernelspec": { - "display_name": "Python 3.9.10 ('gretel': venv)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - }, - "colab": { - "name": "mlopsworld.ipynb", - "provenance": [] - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/docs/notebooks/numerics_and_downstream_ml.ipynb b/docs/notebooks/numerics_and_downstream_ml.ipynb deleted file mode 100644 index 91f26ea7..00000000 --- a/docs/notebooks/numerics_and_downstream_ml.ipynb +++ /dev/null @@ -1,322 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installation and instructions\n", - "\n", - "This notebook walks through using a Gretel synthetic model to generate synthetic data and the open source PyCaret library to evaluate the quality of your synthetic data for machine learning use cases. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# %%capture\n", - "!pip install numpy pandas matplotlib pycaret\n", - "!pip install -U gretel-client" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Log in to Gretel using our API key" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", validate=True, clear=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Load data\n", - "\n", - "We're going to explore using synthetic data as input to a downstream classification task. " - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "df = pd.read_csv(\"https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/grocery_orders.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since we are going to train both a synthetic data generating model and a downstream classification model, we need to hold out a small validation set that doesn't get seen by the synthetic model or the classification model to test the eventual classification performance of a classification model trained purely on synthetic data and validated on unseen real data" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "train_df, valid_df = train_test_split(df, test_size=0.05)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train a synthetic model and look at the generated data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "from gretel_client.projects.models import read_model_config\n", - "\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"downstream-ML\")\n", - "\n", - "# Choose high-dimensionality config since we have 100+ columns\n", - "config = read_model_config(\"synthetics/tabular-actgan\")\n", - "\n", - "# Get a csv to work with, just dump out the train_df.\n", - "train_df.to_csv(\"train.csv\", index=False)\n", - "\n", - "model = project.create_model_obj(model_config=config, data_source=\"train.csv\")\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "poll(model)\n", - "\n", - "synthetic = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "synthetic.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "from gretel_client.evaluation import QualityReport" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic.to_csv(\"synthetic.csv\", index=False)\n", - "report = QualityReport(data_source=\"synthetic.csv\", ref_data=\"train.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "report.run()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(report.peek())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Downstream usecase\n", - "\n", - "One huge benefit of synthetic data, outside of privacy preservation, is utility. The data isn't fake, it has all the same correlations as the original data - which means it can be used as input to a machine learning model. We train several classifiers and observe performance on various folds of the data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pycaret.classification import setup, compare_models, evaluate_model, predict_model, create_model, plot_model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic_df = synthetic.drop(['order_id'], axis=1)\n", - "train_df = train_df.drop(['order_id'], axis=1)\n", - "valid_df = valid_df.drop(['order_id'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic_train_data, synthetic_test_data = train_test_split(synthetic_df, test_size=0.2)\n", - "original_train_data, original_test_data = train_test_split(train_df, test_size=0.2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We want to predict whether a customer will buy frozen pizza (and how many). This turns into a multi-class classifiation problem. We use the Pycaret library to test a huge number of hypothesis classes. This will take a few minutes to fit many different models on a variety of folds" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s = setup(synthetic_train_data, target='frozen pizza')\n", - "best = compare_models()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We then see how our \"Best\" classification model performs on the original data when trained on the synthetic data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "test_predictions = predict_model(best, data=original_test_data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "valid_predictions = predict_model(best, data=valid_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic_predictions = predict_model(best, data=synthetic_test_data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s = setup(original_train_data, target='frozen pizza')\n", - "best = compare_models()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "test_predictions = predict_model(best, data=original_test_data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "valid_predictions = predict_model(best, data=valid_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "synthetic_predictions = predict_model(best, data=synthetic_test_data)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.0" - }, - "orig_nbformat": 4, - "vscode": { - "interpreter": { - "hash": "1264641a2296bed54b65447ff0d3f452674f070f0748798274bc429fe6ce8efd" - } - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/notebooks/rapid_data_generation_with_amplify.ipynb b/docs/notebooks/rapid_data_generation_with_amplify.ipynb deleted file mode 100644 index b50c37cd..00000000 --- a/docs/notebooks/rapid_data_generation_with_amplify.ipynb +++ /dev/null @@ -1,281 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "sugXH-2KDYdE" - }, - "source": [ - "# Generate high volumes of data rapidly with Gretel Amplify\n", - "\n", - "* This notebook demonstrates how to **generate lots of data fast** using Gretel Amplify\n", - "* To run this notebook, you will need an API key from the [Gretel console](https://console.gretel.cloud/dashboard).\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yOYfJXYREOSI" - }, - "source": [ - "## Getting Started\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VEM6kjRsczHd" - }, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kQYlGEMbDEBv" - }, - "outputs": [], - "source": [ - "# Imports\n", - "import json\n", - "import pandas as pd\n", - "from re import findall\n", - "\n", - "from gretel_client import configure_session\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.projects.models import read_model_config\n", - "from gretel_client.helpers import poll" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "HWg6t3ko-I2-" - }, - "outputs": [], - "source": [ - "# @title\n", - "from re import findall\n", - "\n", - "\n", - "def get_output_stats(logs):\n", - " i = len(logs)-1\n", - " output_recs = 0\n", - " while True:\n", - " ctx = len(logs[i]['ctx'])\n", - " if ctx != 0:\n", - " output_recs = int(findall('\\d*\\.?\\d+', logs[i]['msg'])[0])\n", - " output_size = logs[i]['ctx']['final_size_mb']\n", - " gen_time = logs[i]['ctx']['amplify_time_min']*60\n", - " throughput_MBps = logs[i]['ctx']['throughput_mbps']\n", - "\n", - " return(output_recs, output_size, gen_time, throughput_MBps)\n", - " break\n", - " i -= 1\n", - "\n", - "\n", - "def stats(model):\n", - "\n", - " # Statistics\n", - "\n", - " stats = get_output_stats(model.logs)\n", - "\n", - " target_size = model.model_config['models'][0]['amplify']['params']['target_size_mb']\n", - " output_recs = stats[0]\n", - " output_size = stats[1]\n", - " time = model.billing_details['total_time_seconds']\n", - " recs_per_sec = output_recs/time\n", - " total_MBps = output_size/time\n", - " gen_time = stats[2]\n", - " gen_recs_per_sec = output_recs/gen_time\n", - " throughput_MBps = stats[3]\n", - "\n", - " print('\\033[1m' + \"Statistics\" '\\033[0m')\n", - " print(\"Target Size: \\t\\t{} MB\".format(target_size))\n", - " print(\"Output Rows: \\t\\t{} records\".format(output_recs))\n", - " print(\"Output Size: \\t\\t{:.2f} MB\".format(output_size))\n", - " print(\"Total Time: \\t\\t{:.2f} seconds\".format(time))\n", - " print(\"Total Speed: \\t\\t{:.2f} records/s\".format(recs_per_sec))\n", - " print(\"Total Speed: \\t\\t{:.2f} MBps\".format(total_MBps))\n", - " print(\"Generation Time: \\t{:.2f} seconds\".format(gen_time))\n", - " print(\"Generation Speed: \\t{:.2f} records/s\".format(gen_recs_per_sec))\n", - " print(\"Generation Speed: \\t{:.2f} MBps\".format(throughput_MBps))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "rjBbbGyNO2PO" - }, - "outputs": [], - "source": [ - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "# Specify your Gretel API Key\n", - "configure_session(api_key=\"prompt\", cache=\"no\", validate=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2mXcFk2Cy0lC" - }, - "source": [ - "## Load and preview data\n", - "\n", - "For this demo, we'll use a [US Census dataset](https://github.com/gretelai/gretel-blueprints/blob/main/sample_data/us-adult-income.csv) as our input data. This dataset contains 14,000 records, 15 fields, and is about 1.68 MB in size. \n", - "\n", - "If you want to use another dataset, just replace the URL. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "Rgx85TgkPJsY" - }, - "outputs": [], - "source": [ - "url = 'https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv'\n", - "df = pd.read_csv(url)\n", - "print('\\033[1m'+ \"Input Data - US Adult Income\" +'\\033[0m')\n", - "print('Number of records: {}'.format(len(df)))\n", - "print('Size: {:.2f} MB'.format(df.memory_usage(index=True).sum()/1e6))\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2kKGDsEezMVY" - }, - "source": [ - "## Set target output size\n", - "\n", - "There are two ways to indicate the amount of data your want to generate with Amplify. You can use the `num_records` config parameter to specify the number of records to produce. Or, you can use the `target_size_mb` parameter to designate the desired output size in megabytes. The maximum value for `target_size_mb` is 5000 (5GB). Only one parameter can be specified. To read more about the Amplify config, you can check out our docs [here](https://docs.gretel.ai/gretel.ai/synthetics/models/amplify).\n", - "\n", - "In this example, we want to generate 5GB of data so we'll set the `target_size_mb` parameter to be `5000`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cpfJzWa8pENd" - }, - "outputs": [], - "source": [ - "# Pull Amplify model config \n", - "config = read_model_config(\"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/amplify.yml\")\n", - "\n", - "# Set config parameters\n", - "\n", - "config['models'][0]['amplify']['params']['target_size_mb'] = 5000 # 5 GB\n", - "config['name'] = \"amplify-demo\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "X19N2FOTxpEv" - }, - "source": [ - "## Create and run model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "GOIbGmCXtGS5" - }, - "outputs": [], - "source": [ - "# Designate project\n", - "project = create_or_get_unique_project(name=\"amplify\")\n", - "\n", - "# Create and submit model \n", - "model = project.create_model_obj(model_config=config, data_source=df)\n", - "model.submit_cloud()\n", - "poll(model)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XdRDFW1izjuR" - }, - "source": [ - "## View results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "govCEdQ2VxU-" - }, - "outputs": [], - "source": [ - "# Generation statistics\n", - "stats(model)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "CWUvcfzXvptx" - }, - "outputs": [], - "source": [ - "# Output data\n", - "amp = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "amp" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# SQS Report\n", - "import IPython\n", - "from smart_open import open\n", - "\n", - "IPython.display.HTML(data=open(model.get_artifact_link(\"report\")).read(), metadata=dict(isolated=True))" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "private_outputs": true, - "provenance": [] - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file diff --git a/docs/notebooks/redact_pii.ipynb b/docs/notebooks/redact_pii.ipynb deleted file mode 100644 index 92758954..00000000 --- a/docs/notebooks/redact_pii.ipynb +++ /dev/null @@ -1,198 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UTRxpSlaczHY" - }, - "source": [ - "# Redact PII\n", - "\n", - "In this blueprint, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VEM6kjRsczHd" - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install pyyaml Faker pandas\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZQ-TmAdwczHd" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create our configuration with our Transforms Policies and Rules.\n", - "config = \"\"\"schema_version: \"1.0\"\n", - "name: \"Redact PII\"\n", - "models:\n", - " - transforms:\n", - " data_source: \"_\"\n", - " policies:\n", - " - name: remove_pii\n", - " rules:\n", - " - name: fake_or_redact_pii\n", - " conditions:\n", - " value_label:\n", - " - person_name\n", - " - credit_card_number\n", - " - phone_number\n", - " - us_social_security_number\n", - " - email_address\n", - " - custom/*\n", - " transforms:\n", - " - type: fake\n", - " - type: redact_with_char\n", - " attrs:\n", - " char: X\n", - "label_predictors:\n", - " namespace: custom\n", - " regex:\n", - " user_id:\n", - " patterns:\n", - " - score: high\n", - " regex: 'user_[\\d]{5}'\n", - "\"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from faker import Faker\n", - "\n", - "# Use Faker to make training and test data.\n", - "def fake_pii_csv(filename, lines=100):\n", - " fake = Faker()\n", - " with open(filename, \"w\") as f:\n", - " f.write(\"id,name,email,phone,visa,ssn,user_id\\n\")\n", - " for i in range(lines):\n", - " _name = fake.name()\n", - " _email = fake.email()\n", - " _phone = fake.phone_number()\n", - " _cc = fake.credit_card_number()\n", - " _ssn = fake.ssn()\n", - " _id = f'user_{fake.numerify(text=\"#####\")}'\n", - " f.write(f\"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\\n\")\n", - "\n", - "\n", - "fake_pii_csv(\"train.csv\")\n", - "fake_pii_csv(\"test.csv\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import yaml\n", - "\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"redact-pii-transform\")\n", - "\n", - "model = project.create_model_obj(\n", - " model_config=yaml.safe_load(config), data_source=\"train.csv\"\n", - ")\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "\n", - "poll(model)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Use the model to generate synthetic data.\n", - "record_handler = model.create_record_handler_obj(data_source=\"test.csv\")\n", - "\n", - "record_handler.submit_cloud()\n", - "\n", - "poll(record_handler)\n", - "\n", - "# Compare results. Here is our \"before.\"\n", - "train_df = pd.read_csv(\"test.csv\")\n", - "print(\"test.csv head, before redaction\")\n", - "print(train_df.head())\n", - "\n", - "# And here is our \"after.\"\n", - "transformed = pd.read_csv(record_handler.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "print(\"test.csv head, after redaction\")\n", - "transformed.head()\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "smart-seed-blueprint", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/retain_values_with_conditional_data_generation.ipynb b/docs/notebooks/retain_values_with_conditional_data_generation.ipynb deleted file mode 100644 index d7a25335..00000000 --- a/docs/notebooks/retain_values_with_conditional_data_generation.ipynb +++ /dev/null @@ -1,198 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UTRxpSlaczHY" - }, - "source": [ - "# Retaining primary keys and field values with conditional data generation\n", - "\n", - "Gretel supports a feature known as model conditioning (seeding) that will generate rows based on partial values from your training data. This is useful when you want to manually specify certain field values in the synthetic data, and let Gretel synthesize the rest of the row for you.\n", - "\n", - "Use Cases for conditional data generation with Gretel:\n", - "\n", - "- Create synthetic data that has the same number of rows as the training data\n", - "- You want to preserve some of the original row data (primary keys, dates, important categorical data).\n", - "\n", - "When using conditional generation with Gretel's \"seed\" task, the model will generate one sample for each row of the seed dataframe, sorted in the same order.\n", - "\n", - "In the example below, we'll use a combination of a primary key `client_id` and categorical fields `age` and `gender` as conditional inputs to the synthetic model, generating a new dataframe with the same primary key and categorical fields, but with the rest of the dataframe containing synthetically generated values.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VEM6kjRsczHd" - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install pyyaml smart_open pandas\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZQ-TmAdwczHd" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "YMg9nX6SczHe" - }, - "outputs": [], - "source": [ - "# Load and preview dataset\n", - "\n", - "import pandas as pd\n", - "\n", - "dataset_path = \"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer_finance_data.csv\"\n", - "\n", - "# We will pull down the training data to drop an ID column. This will help give us a better model.\n", - "training_df = pd.read_csv(dataset_path)\n", - "\n", - "try:\n", - " training_df.drop(\"disp_id\", axis=\"columns\", inplace=True)\n", - "except KeyError:\n", - " pass # incase we already dropped it\n", - "\n", - "training_df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tvKsT56cjOFO" - }, - "outputs": [], - "source": [ - "from gretel_client.projects.models import read_model_config\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"conditional-data-example\")\n", - "\n", - "# Pull down the default synthetic config. We will modify it slightly.\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "# Here we prepare an object to specify the conditional data generation task.\n", - "# In this example, we will retain the values for the seed fields below,\n", - "# use their values as inputs to the synthetic model.\n", - "fields = [\"client_id\", \"age\", \"gender\"]\n", - "task = {\"type\": \"seed\", \"attrs\": {\"fields\": fields}}\n", - "config[\"models\"][0][\"synthetics\"][\"task\"] = task\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"] = {\"num_records\": len(training_df)}\n", - "\n", - "\n", - "# Fit the model on the training set\n", - "training_df.to_csv(\"train.csv\", index=False)\n", - "model = project.create_model_obj(model_config=config, data_source=\"train.csv\")\n", - "\n", - "model.submit_cloud()\n", - "\n", - "poll(model)\n", - "\n", - "synthetic = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "synthetic.head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "He82umP5jOFP" - }, - "outputs": [], - "source": [ - "# Generate report that shows the statistical performance between the training and synthetic data\n", - "\n", - "import IPython\n", - "from smart_open import open\n", - "\n", - "IPython.display.HTML(data=open(model.get_artifact_link(\"report\")).read(), metadata=dict(isolated=True))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VJMSsKsJj52c" - }, - "outputs": [], - "source": [ - "# Use the model to generate additional synthetic data.\n", - "\n", - "seeds = training_df[fields]\n", - "seeds.to_csv(\"seeds.csv\", index=False)\n", - "\n", - "rh = model.create_record_handler_obj(\n", - " data_source=\"seeds.csv\", params={\"num_records\": len(seeds)}\n", - ")\n", - "rh.submit_cloud()\n", - "\n", - "poll(rh)\n", - "\n", - "synthetic_next = pd.read_csv(rh.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "synthetic_next\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "Gretel - Retaining primary keys and field values with conditional data generation", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/synthetic_data_walkthrough.ipynb b/docs/notebooks/synthetic_data_walkthrough.ipynb deleted file mode 100644 index a8b83fc5..00000000 --- a/docs/notebooks/synthetic_data_walkthrough.ipynb +++ /dev/null @@ -1,319 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UTRxpSlaczHY" - }, - "source": [ - "# Create synthetic data with the Python SDK\n", - "\n", - "This notebook utilizes Gretel's SDK and APIs to create a synthetic version of a popular machine learning financial dataset.\n", - "\n", - "To run this notebook, you will need an API key from the Gretel console, at https://console.gretel.cloud.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VEM6kjRsczHd" - }, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install pyyaml smart_open pandas\n", - "!pip install -U gretel-client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "ZQ-TmAdwczHd", - "outputId": "03aa9c40-01f8-4711-a80b-52322721ee4c" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "import pandas as pd\n", - "from gretel_client import configure_session\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "fmHDICI1oPS5" - }, - "outputs": [], - "source": [ - "# Create a project\n", - "\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "\n", - "project = create_or_get_unique_project(name=\"walkthrough-synthetic\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4PD5B0U06ALs" - }, - "source": [ - "## Create the synthetic data configuration\n", - "\n", - "Load the default configuration template. This template will work well for most datasets. View other templates at https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "uIu3hkzoCzGz", - "outputId": "94c32679-4a9c-4af3-95d2-1fbda2e617ed" - }, - "outputs": [], - "source": [ - "import json\n", - "from gretel_client.projects.models import read_model_config\n", - "\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "# Set the model epochs to 50\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"epochs\"] = 50\n", - "\n", - "print(json.dumps(config, indent=2))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "s9LTh7GO6VIu" - }, - "source": [ - "## Load and preview the source dataset\n", - "\n", - "Specify a data source to train the model on. This can be a local file, web location, or HDFS file.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 571 - }, - "id": "YMg9nX6SczHe", - "outputId": "18d0a1f8-07cd-4811-a385-9c159a58b26a" - }, - "outputs": [], - "source": [ - "# Load and preview dataset to train the synthetic model on.\n", - "import pandas as pd\n", - "\n", - "model = project.create_model_obj(\n", - " model_config=config,\n", - " data_source=\"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv\",\n", - ")\n", - "\n", - "pd.read_csv(model.data_source)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WxnH8th-65Dh" - }, - "source": [ - "## Train the synthetic model\n", - "\n", - "In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "O4-E_F0qczHe", - "outputId": "6b82092d-ded1-43f0-f1ac-115dd8992956" - }, - "outputs": [], - "source": [ - "from gretel_client.helpers import poll\n", - "\n", - "model.submit_cloud()\n", - "\n", - "poll(model)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2bgWKArX7QGf" - }, - "source": [ - "# View the generated synthetic data\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 538 - }, - "id": "sPM-gaU6czHf", - "outputId": "e29e9b29-06f2-40a3-de4d-5f9e6d41b621" - }, - "outputs": [], - "source": [ - "# View the synthetic data\n", - "\n", - "synthetic_df = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "\n", - "synthetic_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "69XYfU9k7fq4" - }, - "source": [ - "# View the synthetic data quality report\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "zX8qsizqczHg", - "jupyter": { - "outputs_hidden": true - }, - "outputId": "2daf44a8-13f5-4e2c-cccc-b26a5a59d461", - "tags": [] - }, - "outputs": [], - "source": [ - "# Generate report that shows the statistical performance between the training and synthetic data\n", - "\n", - "import IPython\n", - "from smart_open import open\n", - "\n", - "IPython.display.HTML(data=open(model.get_artifact_link(\"report\")).read(), metadata=dict(isolated=True))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6IkWOnVQ7oo1" - }, - "source": [ - "# Generate unlimited synthetic data\n", - "\n", - "You can now use the trained synthetic model to generate as much synthetic data as you like.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "X0bI0OpI6W3Y", - "outputId": "7faf358b-e3af-4e3f-8368-aeb940d19c42" - }, - "outputs": [], - "source": [ - "# Generate more records from the model\n", - "\n", - "record_handler = model.create_record_handler_obj(\n", - " params={\"num_records\": 100, \"max_invalid\": 500}\n", - ")\n", - "\n", - "record_handler.submit_cloud()\n", - "\n", - "poll(record_handler)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 554 - }, - "id": "uUIErjQ7CzGy", - "outputId": "4d1518e2-ee5f-4f00-cab5-81c75b54e9ca" - }, - "outputs": [], - "source": [ - "synthetic_df = pd.read_csv(record_handler.get_artifact_link(\"data\"), compression=\"gzip\")\n", - "\n", - "synthetic_df.head()\n" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "include_colab_link": true, - "name": "Synthetic Data Walkthrough", - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/notebooks/time_series_generation_poc.ipynb b/docs/notebooks/time_series_generation_poc.ipynb deleted file mode 100644 index 3346634d..00000000 --- a/docs/notebooks/time_series_generation_poc.ipynb +++ /dev/null @@ -1,504 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "view-in-github" - }, - "source": [ - "\"Open\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Xbv1HhS1dXQq" - }, - "source": [ - "# Time Series Proof of of Concept\n", - "\n", - "This blueprint demonstrates a full proof of concept for creating a synthetic financial time-series dataset and evaluating its privacy and accuracy for a predictive task\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QXBi_RW5dXQs" - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install -U gretel-client\n", - "!pip install numpy pandas statsmodels matplotlib seaborn\n", - "!pip install -U scikit-learn\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "W3eKIatM1mo4", - "outputId": "56320388-d8b7-405f-f8c0-b8e5d1c4742e" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import statsmodels as sm\n", - "from statsmodels.tsa.statespace import sarimax\n", - "from sklearn.metrics import mean_squared_error\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "import time\n", - "\n", - "from typing import List, Dict\n", - "from gretel_client import configure_session\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "-Kyza7XJdXQt", - "outputId": "b87e0d03-9120-4aaf-ad21-dd211a960cca" - }, - "outputs": [], - "source": [ - "# Specify your Gretel API key\n", - "\n", - "pd.set_option(\"max_colwidth\", None)\n", - "\n", - "configure_session(api_key=\"prompt\", cache=\"yes\", validate=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 424 - }, - "id": "73ciMrkldXQu", - "outputId": "e0d39781-e93d-4d08-fa88-139e70e4b662" - }, - "outputs": [], - "source": [ - "# Load timeseries example to a dataframe\n", - "\n", - "data_source = \"https://gretel-public-website.s3.amazonaws.com/datasets/credit-timeseries-dataset.csv\"\n", - "original_df = pd.read_csv(data_source)\n", - "original_df.to_csv(\"original.csv\", index=False)\n", - "original_df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "thFAMQaDuE8X" - }, - "outputs": [], - "source": [ - "# Gretel Transforms Configuration\n", - "config = \"\"\"\n", - "schema_version: \"1.0\"\n", - "models:\n", - " - transforms:\n", - " data_source: \"__tmp__\"\n", - " policies:\n", - " - name: shiftnumbers\n", - " rules:\n", - " - name: shiftnumbers\n", - " conditions:\n", - " field_name:\n", - " - account_balance\n", - " - credit_amt\n", - " - debit_amt\n", - " - net_amt\n", - " transforms:\n", - " - type: numbershift\n", - " attrs:\n", - " min: 1\n", - " max: 100\n", - " field_name:\n", - " - date\n", - " - district_id\n", - "\"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "GgPpzZP9uKPx", - "outputId": "e2bfa43f-4a6a-4f91-ccc5-6604007a1eea" - }, - "outputs": [], - "source": [ - "# De-identify the original dataset using the policy above\n", - "import yaml\n", - "\n", - "from gretel_client.projects import create_or_get_unique_project\n", - "from gretel_client.helpers import poll\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"numbershift-transform\")\n", - "\n", - "model = project.create_model_obj(\n", - " model_config=yaml.safe_load(config), data_source=data_source\n", - ")\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "poll(model)\n", - "\n", - "record_handler = model.create_record_handler_obj(data_source=data_source)\n", - "record_handler.submit_cloud()\n", - "poll(record_handler)\n", - "\n", - "deid_df = pd.read_csv(record_handler.get_artifact_link(\"data\"), compression=\"gzip\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 386 - }, - "id": "xFtDkVV_yYjU", - "outputId": "21dfaa6b-899c-4d0a-cbbb-2ca8585716b4" - }, - "outputs": [], - "source": [ - "# View the transformation report\n", - "\n", - "import json\n", - "from smart_open import open\n", - "\n", - "report = json.loads(open(model.get_artifact_link(\"report_json\")).read())\n", - "pd.DataFrame(report[\"metadata\"][\"fields\"])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 424 - }, - "id": "VnCiJT43wc1p", - "outputId": "a1d2fad7-563a-4df3-cec4-92fa937dd14c" - }, - "outputs": [], - "source": [ - "# Here we sort and remove \"net_amt\" as it's a derived column,\n", - "# We will add back in after the data is synthesized\n", - "train_df = deid_df.copy()\n", - "\n", - "train_df.sort_values(\"date\", inplace=True)\n", - "train_cols = list(train_df.columns)\n", - "train_cols.remove(\"net_amt\")\n", - "train_df = train_df.filter(train_cols)\n", - "\n", - "# Here we noticed that some number have extremely long precision,\n", - "# so we round the data\n", - "train_df = train_df.round(1)\n", - "train_df.to_csv(\"train.csv\", index=False)\n", - "train_df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "3tBtsrRQiawq", - "outputId": "52121882-aa72-41a1-d34b-62f3cb71147b" - }, - "outputs": [], - "source": [ - "from gretel_client.projects.models import read_model_config\n", - "\n", - "# Create a project and model configuration.\n", - "project = create_or_get_unique_project(name=\"ts-5544-regular-seed\")\n", - "\n", - "# Pull down the default synthetic config. We will modify it slightly.\n", - "config = read_model_config(\"synthetics/default\")\n", - "\n", - "# Set up the seed fields\n", - "seed_fields = [\"date\", \"district_id\"]\n", - "\n", - "task = {\n", - " \"type\": \"seed\",\n", - " \"attrs\": {\n", - " \"fields\": seed_fields,\n", - " },\n", - "}\n", - "\n", - "# Fine tune model parameters. These are the parameters we found to work best. This is \"Run 20\" in the document\n", - "config[\"models\"][0][\"synthetics\"][\"task\"] = task\n", - "\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"vocab_size\"] = 20\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"learning_rate\"] = 0.005\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"epochs\"] = 100\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"gen_temp\"] = 0.8\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"reset_states\"] = True\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"dropout_rate\"] = 0.5\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"gen_temp\"] = 0.8\n", - "config[\"models\"][0][\"synthetics\"][\"params\"][\"early_stopping\"] = True\n", - "config[\"models\"][0][\"synthetics\"][\"privacy_filters\"][\"similarity\"] = None\n", - "config[\"models\"][0][\"synthetics\"][\"privacy_filters\"][\"outliers\"] = None\n", - "config[\"models\"][0][\"synthetics\"][\"generate\"][\"num_records\"] = train_df.shape[0]\n", - "\n", - "# Get a csv to work with, just dump out the train_df.\n", - "deid_df.to_csv(\"train.csv\", index=False)\n", - "\n", - "# Initiate a new model with the chosen config\n", - "model = project.create_model_obj(model_config=config, data_source=\"train.csv\")\n", - "\n", - "# Upload the training data. Train the model.\n", - "model.submit_cloud()\n", - "poll(model)\n", - "\n", - "synthetic = pd.read_csv(model.get_artifact_link(\"data_preview\"), compression=\"gzip\")\n", - "synthetic\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "4GyCx1wuyB0n" - }, - "outputs": [], - "source": [ - "# Add back in the derived column \"net_amt\"\n", - "net_amt = synthetic[\"credit_amt\"] - synthetic[\"debit_amt\"]\n", - "synthetic[\"net_amt\"] = net_amt\n", - "\n", - "# Save off the new synthetic data\n", - "synthetic.to_csv(\"synthetic.csv\", index=False, header=True)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "JS0hXcJ-Y7Oo", - "outputId": "1ea4100d-99c7-4164-dfa5-27d2394e8c53" - }, - "outputs": [], - "source": [ - "# View the Synthetic Performance Report\n", - "import IPython\n", - "from smart_open import open\n", - "\n", - "IPython.display.HTML(data=open(model.get_artifact_link(\"report\")).read(), metadata=dict(isolated=True))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "gi4d2NuuKQGV", - "outputId": "547d3129-676c-4bda-8e44-aea19f38453b" - }, - "outputs": [], - "source": [ - "import matplotlib\n", - "import matplotlib.pyplot as plt\n", - "\n", - "\n", - "def plot_district_averages(\n", - " synthetic: pd.DataFrame, training: pd.DataFrame, district_id: int\n", - ") -> pd.DataFrame:\n", - "\n", - " synthetic_data = synthetic.loc[synthetic[\"district_id\"] == district_id]\n", - " synthetic_data = synthetic_data.set_index(\"date\")\n", - "\n", - " training_data = training.loc[training[\"district_id\"] == district_id]\n", - " training_data = training_data.set_index(\"date\")\n", - "\n", - " combined = synthetic_data.join(\n", - " training_data, lsuffix=\"_synthetic\", rsuffix=\"_original\"\n", - " )\n", - " plt.suptitle(\"District #\" + str(district_id))\n", - "\n", - " for col in [\"credit_amt\", \"debit_amt\", \"account_balance\"]:\n", - " fig = combined.plot(y=[f\"{col}_synthetic\", f\"{col}_original\"], figsize=(12, 8))\n", - " plt.title(\"Time Series for District #\" + str(district_id))\n", - "\n", - " return combined\n", - "\n", - "\n", - "combined = plot_district_averages(synthetic, train_df, 13)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "24fAgRdLomsn" - }, - "outputs": [], - "source": [ - "import warnings\n", - "\n", - "warnings.filterwarnings(\"ignore\")\n", - "\n", - "\n", - "def ARIMA_run(\n", - " data_paths: List[str],\n", - " targets: List[str] = None,\n", - " entity_column: str = \"district_id\",\n", - " entities: List = None,\n", - " date_column: str = \"date\",\n", - " date_threshold: str = None,\n", - ") -> Dict[str, List[float]]:\n", - " \"\"\"\n", - " Purpose of this function is to automate the run and scoring of SARIMAX models, so we can benchmark results against various different synthetic data configurations.\n", - " The data paths from s3 are passed in, and then entire run, from loading in and sorting the data to creating a model and scoring it, is done via this function.\n", - " The outputs are the target scores for each variable on each dataset's model. This gets used to create bar charts of the RMSE.\n", - " With some fine tuning, this function can be made as a general purpose SARIMAX benchmark function for a variety of datasets.\n", - "\n", - " Args:\n", - " data_paths: a list of paths to the data you want to create models and score with. These can be either local paths or ones from public buckets.\n", - " targets: Which columns in the data will be your target variables?\n", - " entity_column: This is purely used for datasets that have multiple time series data points from multiple places. Since this function was built with that in mind, it assumes that you will\n", - " give a column that denotes those different places/entities. If None is provided, no handler has been built yet that can handle that.\n", - " entities: This should be a list of the set of entities within the entity column.\n", - " date_column: This should be something we can use to sort the data, so that the time series is read appropriately.\n", - " date_threshold: This is to split the data into train and test. Whatever date you want to threshold by to make the train and test should be specified here.\n", - "\n", - " Outputs:\n", - " target_scores: This will be a dictionary of RMSE scores for each target variable on each synthetic dataset.\n", - " \"\"\"\n", - " target_scores = {}\n", - " for target in targets:\n", - " target_scores[target] = []\n", - " for path in data_paths:\n", - " sorted_data = pd.read_csv(path)\n", - " sorted_data.sort_values(date_column, inplace=True)\n", - " sorted_data.drop_duplicates(subset=[date_column, entity_column], inplace=True)\n", - "\n", - " print(\"Path: {}\".format(path))\n", - " for entity in entities:\n", - " print(\"Entity: {}\".format(entity))\n", - " for target in targets:\n", - " train_data = sorted_data[sorted_data[entity_column] == entity][\n", - " sorted_data[date_column] < date_threshold\n", - " ]\n", - " test_data = sorted_data[sorted_data[entity_column] == entity][\n", - " sorted_data[date_column] >= date_threshold\n", - " ]\n", - "\n", - " model = sarimax.SARIMAX(\n", - " train_data[target], order=(0, 1, 1), seasonal_order=(1, 1, 0, 12)\n", - " )\n", - " res = model.fit()\n", - "\n", - " preds = res.forecast(len(test_data[target]))\n", - " rmse = mean_squared_error(test_data[target], preds, squared=False)\n", - " target_scores[target].append(rmse)\n", - " print(\"Target: {}\".format(target))\n", - " print(\"RMSE: {}\".format(rmse))\n", - "\n", - " return target_scores\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "WsK5p3YB204I", - "outputId": "3fbc4baa-6599-4c0f-e5ce-45bcfcecd00e" - }, - "outputs": [], - "source": [ - "target_scores = ARIMA_run(\n", - " [\"synthetic.csv\", \"original.csv\"],\n", - " targets=[\"net_amt\", \"account_balance\", \"credit_amt\", \"debit_amt\"],\n", - " entities=[13],\n", - " date_threshold=\"1998-01-01\",\n", - ")\n", - "target_scores\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 352 - }, - "id": "d9BewEn_3B6x", - "outputId": "7d6946ae-0825-4c4c-abd9-c98106ae1c80" - }, - "outputs": [], - "source": [ - "import plotly.express as px\n", - "\n", - "results = pd.DataFrame.from_dict(target_scores)\n", - "results[\"method\"] = [\"synthetic\", \"real world\"]\n", - "results.plot.bar(x=\"method\", title=\"RMSE per field and run in synthetic timeseries\")\n" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "Time Series Generation POC - Gretel and Global Financial Institution", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}