diff --git a/notebooks/experimental/semantic_operators.ipynb b/notebooks/experimental/semantic_operators.ipynb index d3c3682402..815e14284f 100644 --- a/notebooks/experimental/semantic_operators.ipynb +++ b/notebooks/experimental/semantic_operators.ipynb @@ -25,24 +25,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# BigFrames Semantic Operator Demo" + "# BigQuery DataFrames Semantic Operator Demo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We implemented the semantics operators based on the idea in the \"Lotus\" paper: https://arxiv.org/pdf/2407.11418.\n", + "The BigQuery DataFrames team implements semantics operators as described in the \"Lotus\" paper: https://arxiv.org/pdf/2407.11418.\n", "\n", - "This notebook gives you a hands-on preview of semantic operator APIs powered by LLM. The demonstration is devided into two sections: \n", + "This notebook gives you a hands-on preview of semantic operator APIs powered by LLM. You can open this notebook on Google Colab [here](https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/semantic_operators.ipynb). \n", "\n", - "The first section introduces the API syntax with some simple examples. We aim to get you familiar with how BigFrames semantic operators work. \n", - "\n", - "The second section talks about applying semantic operators on real-world large datasets. The examples are designed to benchmark the performance of the operators, and to (maybe) spark some ideas for your next application scenarios.\n", - "\n", - "You can open this notebook on Google Colab [here](https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/semantic_operators.ipynb).\n", - "\n", - "Without further ado, let's get started." + "The notebook has two sections. The first section introduces the API syntax with examples, with the aim to get you familiar with how semantic operators work. The second section applies semantic operators on a large real-world dataset. You will also find some performance statistics there." ] }, { @@ -56,7 +50,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's import BigFrames packages." + "First, import the BigQuery DataFrames modules." ] }, { @@ -91,7 +85,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Turn on the semantic operator experiment. You will see a warning sign saying that these operators are still under experiments. This is a necessary step. Otherwise you will see `NotImplementedError` when calling these operators." + "Turn on the semantic operator experiment. You will see a warning sign saying that these operators are still under experiments. If you don't turn on the experiment before using the operators, you will get `NotImplemenetedError`s." ] }, { @@ -132,7 +126,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's also create some LLM instances for these operators. They will be passed in as paramters in each method call." + "Create LLM instances. They will be passed in as parameters for each semantic operator." ] }, { @@ -150,8 +144,8 @@ } ], "source": [ - "import bigframes.ml.llm as llm\n", - "gemini_model = llm.GeminiTextGenerator(model_name=llm._GEMINI_1P5_FLASH_001_ENDPOINT)\n", + "from bigframes.ml import llm\n", + "gemini_model = llm.GeminiTextGenerator(model_name=\"gemini-1.5-flash-001\")\n", "text_embedding_model = llm.TextEmbeddingGenerator(model_name=\"text-embedding-005\")" ] }, @@ -166,7 +160,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this section we will go through the semantic operator APIs with small examples." + "You will learn about each semantic operator by trying some examples." ] }, { @@ -180,7 +174,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Semantic filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided. Let's first create a small dataframe:" + "Semantic filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided. \n", + "\n", + "First, create a dataframe:" ] }, { @@ -257,12 +253,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, let's filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be \"escaped\" by using a pair of braces in your instruction. In this example, our instruction should be like this:\n", + "Now, filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be \"escaped\" by using a pair of braces in your instruction. In this example, your instruction should be like this:\n", "```\n", "The {city} is the capital of the {country}.\n", "```\n", "\n", - "Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`. Let's give it a try:" + "Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`." ] }, { @@ -348,7 +344,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Semantic mapping allows to you to combine values from multiple columns into a single output based your instruction. To demonstrate this, let's create an example dataframe:" + "Semantic mapping allows to you to combine values from multiple columns into a single output based your instruction. \n", + "\n", + "Here is an example:" ] }, { @@ -428,7 +426,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, let's ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the column name by setting the `output_column` parameter to hold the mapping results." + "Now, you ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the column name by setting the `output_column` parameter to hold the mapping results." ] }, { @@ -515,13 +513,6 @@ "df.semantics.map(\"What is the food made from {ingredient_1} and {ingredient_2}? One word only.\", output_column=\"food\", model=gemini_model)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The mechanism behind semantic mapping is very similar with semantic filtering. The one major difference: instead of asking LLM to reply true or false to each row, the operator lets LLM reply free-form strings and attach them as a new column to the dataframe." - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -533,7 +524,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Semantic joining can join two dataframes based on the instruction you provided. First, let's prepare two dataframes." + "Semantic joining can join two dataframes based on the instruction you provided. \n", + "\n", + "First, you prepare two dataframes:" ] }, { @@ -550,7 +543,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. We could re-use the aforementioned column reference syntax:" + "You want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. You could re-use the aforementioned column reference syntax:" ] }, { @@ -640,7 +633,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "!! **Important:** Semantic join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes semantic filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`. Therefore, we have added a parameter `max_rows`, a threshold that guards against unexpected expensive calls. With this parameter, the operator first calculates the size of your cross-joined data, and compares it with the threshold. If the size exceeds your threshold, the fuction will abort early with a `ValueError`. You can manually set the value of `max_rows` to raise or lower the threshold." + "!! **Important:** Semantic join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes semantic filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`. Therefore, our team has added a parameter `max_rows`, a threshold that guards against unexpected expensive calls. With this parameter, the operator first calculates the size of your cross-joined data, and compares it with the threshold. If the size exceeds your threshold, the fuction will abort early with a `ValueError`. You can manually set the value of `max_rows` to raise or lower the threshold." ] }, { @@ -654,9 +647,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We use a self-join example to demonstrate a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching \"left.\" and \"right.\" prefixes to your column names. \n", + "This self-join example is for demonstrating a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching \"left.\" and \"right.\" prefixes to your column names. \n", "\n", - "Let's create an example data frame:" + "Create an example data frame:" ] }, { @@ -672,7 +665,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, we use `left.animal` and `right.animal` to differentiate the data sources:" + "You want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, you use `left.animal` and `right.animal` to differentiate the data sources:" ] }, { @@ -781,7 +774,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Semantic aggregation merges all the values in a column into one. At this moment you can only aggregate a single column in each oeprator call. Let's create an example:" + "Semantic aggregation merges all the values in a column into one. At this moment you can only aggregate a single column in each oeprator call.\n", + "\n", + "Here is an example:" ] }, { @@ -884,7 +879,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's ask LLM to find the oldest movie:" + "You ask LLM to find the oldest movie:" ] }, { @@ -922,7 +917,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Instead of going through each row one by one, this operator batches multiple rows in a single request towards LLM. It then aggregates all the batched results with the same technique, until there is only one value left. You could set the batch size with `max_agg_rows` parameter, which defaults to 10." + "Instead of going through each row one by one, this operator first batches rows to get many aggregation results. It then repeatly batches those results for aggregation, until there is only one value left. You could set the batch size with `max_agg_rows` parameter, which defaults to 10." ] }, { @@ -952,7 +947,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to find the top two most popular pets:" + "You want to find the top two most popular pets:" ] }, { @@ -1027,7 +1022,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Under the hood, the semantic top K operator performs pair-wise comparisons with LLM. It also adopts the quick select algorithm, which means the top K results are returns in the order of their indices instead of their ranks." + "Under the hood, the semantic top K operator performs pair-wise comparisons with LLM. The top K results are returned in the order of their indices instead of their ranks." ] }, { @@ -1041,7 +1036,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Semantic search searches the most similar values to your qury within a single column. Here is an example:" + "Semantic search searches the most similar values to your query within a single column. Here is an example:" ] }, { @@ -1124,7 +1119,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to get the top 2 creatures that are most similar to \"monkey\":" + "You want to get the top 2 creatures that are most similar to \"monkey\":" ] }, { @@ -1206,7 +1201,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Notice that we are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.\n", + "Notice that you are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.\n", "\n", "In addition, `score_column` is an optional parameter for storing the distances between the results and your query. If not set, the score column won't be attached to the result." ] @@ -1222,7 +1217,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When you have multiple queries to search in the same value space, you could use similarity join to simplify your call. For example:" + "When you want to perform multiple similarity queries in the same value space, you could use similarity join to simplify your call. For example:" ] }, { @@ -1239,7 +1234,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this example, we want to pick the most related animal from `df2` for each value in `df1`, and this is how it's done:" + "In this example, you want to pick the most related animal from `df2` for each value in `df1`." ] }, { @@ -1336,14 +1331,14 @@ } ], "source": [ - "df1.semantics.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model= text_embedding_model, score_column='distance')" + "df1.semantics.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model=text_embedding_model, score_column='distance')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "!! **Important** Like semantic join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `max_rows` parameter to provide a threshold. " + "!! **Important** Like semantic join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `max_rows` parameter to specify a threshold. " ] }, { @@ -1373,7 +1368,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to cluster these products into 3 groups, and this is how:" + "You want to cluster these products into 3 groups:" ] }, { @@ -1420,17 +1415,17 @@ " \n", " 0\n", " Smartphone\n", - " 1\n", + " 2\n", " \n", " \n", " 1\n", " Laptop\n", - " 1\n", + " 2\n", " \n", " \n", " 2\n", " Coffee Maker\n", - " 3\n", + " 2\n", " \n", " \n", " 3\n", @@ -1449,9 +1444,9 @@ ], "text/plain": [ " Product Cluster ID\n", - "0 Smartphone 1\n", - "1 Laptop 1\n", - "2 Coffee Maker 3\n", + "0 Smartphone 2\n", + "1 Laptop 2\n", + "2 Coffee Maker 2\n", "3 T-shirt 2\n", "4 Jeans 2\n", "\n", @@ -1471,7 +1466,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This operator uses the the embedding model to generate vectors for each value, and then uses KMeans algorithm to group them." + "This operator uses the the embedding model to generate vectors for each value, and then the KMeans algorithm for clustering." ] }, { @@ -1485,7 +1480,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this section we will use BigQuery's public data of hacker news to perform some heavy work. First, let's load 3K rows from the table:" + "In this section, you will use BigQuery's public data of hacker news to perform some heavy work. We recommend you to check the code without executing them in order to save your time and money. The execution results are attached after each cell for your reference.\n", + "\n", + "First, load 3K rows from the table:" ] }, { @@ -1853,7 +1850,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Then, let's keep only the rows that have text content:" + "Then, keep only the rows that have text content:" ] }, { @@ -1864,7 +1861,7 @@ { "data": { "text/plain": [ - "2554" + "2555" ] }, "execution_count": 26, @@ -1881,7 +1878,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's calculate the average text length in all the rows:" + "You can get an idea of the input token length by calculating the average string length." ] }, { @@ -1892,7 +1889,7 @@ { "data": { "text/plain": [ - "390.6558339859039" + "390.61878669276047" ] }, "execution_count": 27, @@ -1908,7 +1905,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now it's LLM's turn. Let's keep the rows in which the text is talking about iPhone. This will take several minutes to finish." + "Now it's LLM's turn. You want to keep only the rows whose texts are talking about iPhone. This will take several minutes to finish." ] }, { @@ -1982,7 +1979,7 @@ " comment\n", " \n", " \n", - " 1513\n", + " 1512\n", " <NA>\n", " Why would this take a week? i(phone)OS was ori...\n", " TheOtherHobbes\n", @@ -1991,7 +1988,7 @@ " comment\n", " \n", " \n", - " 1560\n", + " 1559\n", " <NA>\n", " &gt;or because Apple drama brings many clicks?...\n", " weberer\n", @@ -2009,15 +2006,15 @@ "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", "419 Well last time I got angry down votes for sayi... drieddust \n", "812 New iPhone should be announced on September. L... meerita \n", - "1513 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", - "1560 >or because Apple drama brings many clicks?... weberer \n", + "1512 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", + "1559 >or because Apple drama brings many clicks?... weberer \n", "\n", " score timestamp type \n", "9 2023-04-21 16:45:13+00:00 comment \n", "419 2021-01-11 19:27:27+00:00 comment \n", "812 2019-07-30 20:54:42+00:00 comment \n", - "1513 2021-06-08 09:25:24+00:00 comment \n", - "1560 2022-09-05 13:16:02+00:00 comment \n", + "1512 2021-06-08 09:25:24+00:00 comment \n", + "1559 2022-09-05 13:16:02+00:00 comment \n", "\n", "[5 rows x 6 columns]" ] @@ -2036,7 +2033,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The performance of the semantic operators depends on the length of your input as well as your quota. Here are my benchmarks for running the previous operation over data of different sizes.\n", + "The performance of the semantic operators depends on the length of your input as well as your quota. Here are our benchmarks for running the previous operation over data of different sizes.\n", "\n", "* 800 Rows -> 1m 21.3s\n", "* 2550 Rows -> 5m 9s\n", @@ -2049,7 +2046,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's use LLM to summarize the sentiments towards iPhone:" + "Now, use LLM to summarize the sentiments towards iPhone:" ] }, { @@ -2127,7 +2124,7 @@ " Excited anticipation.\n", " \n", " \n", - " 1513\n", + " 1512\n", " <NA>\n", " Why would this take a week? i(phone)OS was ori...\n", " TheOtherHobbes\n", @@ -2137,7 +2134,7 @@ " Frustrated, critical, obvious.\n", " \n", " \n", - " 1560\n", + " 1559\n", " <NA>\n", " &gt;or because Apple drama brings many clicks?...\n", " weberer\n", @@ -2156,15 +2153,15 @@ "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", "419 Well last time I got angry down votes for sayi... drieddust \n", "812 New iPhone should be announced on September. L... meerita \n", - "1513 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", - "1560 >or because Apple drama brings many clicks?... weberer \n", + "1512 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", + "1559 >or because Apple drama brings many clicks?... weberer \n", "\n", " score timestamp type \\\n", "9 2023-04-21 16:45:13+00:00 comment \n", "419 2021-01-11 19:27:27+00:00 comment \n", "812 2019-07-30 20:54:42+00:00 comment \n", - "1513 2021-06-08 09:25:24+00:00 comment \n", - "1560 2022-09-05 13:16:02+00:00 comment \n", + "1512 2021-06-08 09:25:24+00:00 comment \n", + "1559 2022-09-05 13:16:02+00:00 comment \n", "\n", " sentiment \n", "9 Frustrated, but hopeful. \n", @@ -2173,9 +2170,9 @@ " \n", "812 Excited anticipation. \n", " \n", - "1513 Frustrated, critical, obvious. \n", + "1512 Frustrated, critical, obvious. \n", " \n", - "1560 Negative, clickbait, Apple. \n", + "1559 Negative, clickbait, Apple. \n", " \n", "\n", "[5 rows x 7 columns]" @@ -2194,7 +2191,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here is another example: we count the number of rows whose authors have animals in their names." + "Here is another example: count the number of rows whose authors have animals in their names." ] }, { @@ -2206,7 +2203,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Reading cached table from 2024-12-27 01:00:10.095976+00:00 to avoid incompatibilies with previous reads of this table. To read the latest version, set `use_cache=False` or close the current session with Session.close() or bigframes.pandas.close_session().\n", + "/usr/local/google/home/sycai/src/python-bigquery-dataframes/venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Reading cached table from 2024-12-27 21:39:10.129973+00:00 to avoid incompatibilies with previous reads of this table. To read the latest version, set `use_cache=False` or close the current session with Session.close() or bigframes.pandas.close_session().\n", " exec(code_obj, self.user_global_ns, self.user_ns)\n" ] }, @@ -2938,7 +2935,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here are my performance numbers:\n", + "Here are our performance numbers:\n", "* 3000 rows -> 6m 9.2s\n", "* 10000 rows -> 26m 42.4s" ]