From 00229dce4165d2f636016a7a3fb0523e622cbec0 Mon Sep 17 00:00:00 2001 From: Cody Date: Thu, 26 Sep 2024 12:09:35 -0400 Subject: [PATCH] update --- .../index/execute-results/html.json | 8 ++++---- docs/posts/walking-talking-cube/index.qmd | 12 +++++++++--- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/docs/_freeze/posts/walking-talking-cube/index/execute-results/html.json b/docs/_freeze/posts/walking-talking-cube/index/execute-results/html.json index 5b919921eda3..5fa90f881ebb 100644 --- a/docs/_freeze/posts/walking-talking-cube/index/execute-results/html.json +++ b/docs/_freeze/posts/walking-talking-cube/index/execute-results/html.json @@ -1,15 +1,15 @@ { - "hash": "65b2781aabc2165989aebafdc5e7fec4", + "hash": "2f6fcc0eda9cbc7e3e1763d49288b886", "result": { "engine": "jupyter", - "markdown": "---\ntitle: \"Taking a random cube for a walk and making it talk\"\nauthor: \"Cody Peterson\"\ndate: \"2024-09-24\"\nimage: thumbnail.png\ncategories:\n - blog\n - duckdb\n - udfs\n---\n\n\n***Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.***\n\n## A random cube\n\nWe'll generate a random cube of data with Ibis (default DuckDB backend) and\nvisualize it as a 3D line plot:\n\n::: {#826f56f9 .cell execution_count=2}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nimport ibis # <1>\nimport ibis.selectors as s\nimport plotly.express as px # <1>\n\nibis.options.interactive = True # <2>\nibis.options.repr.interactive.max_rows = 5 # <2>\n\ncon = ibis.connect(\"duckdb://synthetic.ddb\") # <3>\n\nif \"source\" in con.list_tables():\n t = con.table(\"source\") # <4>\nelse:\n lookback = ibis.interval(days=1) # <5>\n step = ibis.interval(seconds=1) # <5>\n\n t = (\n (\n ibis.range( # <6>\n ibis.now() - lookback,\n ibis.now(),\n step=step,\n ) # <6>\n .unnest() # <7>\n .name(\"timestamp\") # <8>\n .as_table() # <9>\n )\n .mutate(\n index=(ibis.row_number().over(order_by=\"timestamp\")), # <10>\n **{col: 2 * (ibis.random() - 0.5) for col in [\"a\", \"b\", \"c\"]}, # <11>\n )\n .mutate(color=ibis._[\"index\"].histogram(nbins=8)) # <12>\n .drop(\"index\") # <13>\n .relocate(\"timestamp\", \"color\") # <14>\n .order_by(\"timestamp\") # <15>\n )\n\n t = con.create_table(\"source\", t.to_pyarrow()) # <16>\n\nc = px.line_3d( # <17>\n t,\n x=\"a\",\n y=\"b\",\n z=\"c\",\n color=\"color\",\n hover_data=[\"timestamp\"],\n) # <17>\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n1. Import the necessary libraries.\n2. Enable interactive mode for Ibis.\n3. Connect to an on-disk DuckDB database.\n4. Load the table if it already exists.\n5. Define the time range and step for the data.\n6. Create the array of timestamps.\n7. Unnest the array to a column.\n8. Name the column \"timestamp\".\n9. Convert the column into a table.\n10. Create a monotonically increasing index column.\n11. Create three columns of random numbers.\n12. Create a color column based on the index (help visualize the time series).\n13. Drop the index column.\n14. Rearrange the columns.\n15. Order the table by timestamp.\n16. Store the table in the on-disk database.\n17. Create a 3D line plot of the data.\n\n## Walking\n\nWe have a random cube of data:\n\n::: {#a751080f .cell execution_count=3}\n``` {.python .cell-code}\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓\n┃ timestamp                color  a          b          c         ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64   │\n├─────────────────────────┼───────┼───────────┼───────────┼───────────┤\n│ 2024-09-15 23:17:45.7430-0.1506630.458096-0.206626 │\n│ 2024-09-15 23:17:46.74300.362739-0.7073630.895386 │\n│ 2024-09-15 23:17:47.74300.2684200.062388-0.991480 │\n│ 2024-09-15 23:17:48.7430-0.752077-0.8998760.604363 │\n│ 2024-09-15 23:17:49.74300.9502230.0495680.373329 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴───────────┘\n
\n```\n:::\n:::\n\n\nBut we need to make it [walk](https://en.wikipedia.org/wiki/Random_walk). We'll\nuse a window function to calculate the cumulative sum of each column:\n\n::: {.panel-tabset}\n\n## Without column selectors\n\n::: {#b577b412 .cell execution_count=4}\n``` {.python .cell-code}\nwindow = ibis.window(order_by=\"timestamp\", preceding=None, following=0)\nwalked = t.select(\n \"timestamp\",\n \"color\",\n a=t[\"a\"].sum().over(window),\n b=t[\"b\"].sum().over(window),\n c=t[\"c\"].sum().over(window),\n).order_by(\"timestamp\")\nwalked\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓\n┃ timestamp                color  a          b          c         ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64   │\n├─────────────────────────┼───────┼───────────┼───────────┼───────────┤\n│ 2024-09-15 23:17:45.7430-0.1506630.458096-0.206626 │\n│ 2024-09-15 23:17:46.74300.212076-0.2492670.688759 │\n│ 2024-09-15 23:17:47.74300.480496-0.186879-0.302721 │\n│ 2024-09-15 23:17:48.7430-0.271581-1.0867550.301642 │\n│ 2024-09-15 23:17:49.74300.678642-1.0371870.674971 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴───────────┘\n
\n```\n:::\n:::\n\n\n## With column selectors\n\n::: {#2a10c7ab .cell execution_count=5}\n``` {.python .cell-code}\nwindow = ibis.window(order_by=\"timestamp\", preceding=None, following=0)\nwalked = t.select(\n \"timestamp\",\n \"color\",\n s.across(\n s.c(\"a\", \"b\", \"c\"), # <1>\n ibis._.sum().over(window), # <2>\n ),\n).order_by(\"timestamp\")\nwalked\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓\n┃ timestamp                color  a          b          c         ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64   │\n├─────────────────────────┼───────┼───────────┼───────────┼───────────┤\n│ 2024-09-15 23:17:45.7430-0.1506630.458096-0.206626 │\n│ 2024-09-15 23:17:46.74300.212076-0.2492670.688759 │\n│ 2024-09-15 23:17:47.74300.480496-0.186879-0.302721 │\n│ 2024-09-15 23:17:48.7430-0.271581-1.0867550.301642 │\n│ 2024-09-15 23:17:49.74300.678642-1.0371870.674971 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴───────────┘\n
\n```\n:::\n:::\n\n\n1. Alternatively, you can use `s.of_type(float)` to select all float columns.\n2. Use the `ibis._` selector to reference a deferred column expression.\n\n:::\n\nWhile the first few rows may look similar to the cube, the 3D line plot does\nnot:\n\n::: {#de9f2c9d .cell execution_count=6}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nc = px.line_3d(\n walked,\n x=\"a\",\n y=\"b\",\n z=\"c\",\n color=\"color\",\n hover_data=[\"timestamp\"],\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n## Talking\n\nWe've made our random cube and we've made it walk, but now we want to make it\ntalk. At this point, you might be questioning the utility of this blog post --\nwhat are we doing and why? The purpose is to demonstrate generating synthetic\ndata that can look realistic. We achieve this by building in randomness (e.g. a\nrandom walk can be used to simulate stock prices) and also by using that\nrandomness to inform the generation of non-numeric synthetic data (e.g. the\nticker symbol of a stock).\n\n### Faking it\n\nLet's demonstrate this concept by pretending we have an application where users\ncan review a location they're at. The user's name, comment, location, and device\ninfo are stored in our database for their review at a given timestamp.\n\n[Faker](https://github.com/joke2k/faker) is a commonly used Python library for\ngenerating fake data. We'll use it to generate fake names, comments, locations,\nand device info for our reviews:\n\n::: {#fdf4b09f .cell execution_count=7}\n``` {.python .cell-code}\nfrom faker import Faker\n\nfake = Faker()\n\nres = (\n fake.name(),\n fake.sentence(),\n fake.location_on_land(),\n fake.user_agent(),\n fake.ipv4(),\n)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=26}\n```\n('Sandra Brown',\n 'By present prepare order apply decide discussion.',\n ('41.84364', '-87.71255', 'South Lawndale', 'US', 'America/Chicago'),\n 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_6_0 rv:6.0; uk-UA) AppleWebKit/532.50.5 (KHTML, like Gecko) Version/4.1 Safari/532.50.5',\n '138.226.52.172')\n```\n:::\n:::\n\n\nWe can use our random numbers to influence the fake data generation in a Python\nUDF:\n\n\n\n::: {#c8c7ae75 .cell execution_count=9}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nimport ibis.expr.datatypes as dt\n\nfrom datetime import datetime, timedelta\n\nibis.options.repr.interactive.max_length = 5\n\nrecord_schema = dt.Struct(\n {\n \"timestamp\": datetime,\n \"name\": str,\n \"comment\": str,\n \"location\": list[str],\n \"device\": dt.Struct(\n {\n \"browser\": str,\n \"ip\": str,\n }\n ),\n }\n)\n\n\n@ibis.udf.scalar.python\ndef faked_batch(\n timestamp: datetime,\n a: float,\n b: float,\n c: float,\n batch_size: int = 8,\n) -> dt.Array(record_schema):\n \"\"\"\n Generate records of fake data.\n \"\"\"\n value = (a + b + c) / 3\n\n res = [\n {\n \"timestamp\": timestamp + timedelta(seconds=0.1 * i),\n \"name\": fake.name() if value >= 0.5 else fake.first_name(),\n \"comment\": fake.sentence(),\n \"location\": fake.location_on_land(),\n \"device\": {\n \"browser\": fake.user_agent(),\n \"ip\": fake.ipv4() if value >= 0 else fake.ipv6(),\n },\n }\n for i in range(batch_size)\n ]\n\n return res\n\n\nif \"faked\" in con.list_tables():\n faked = con.table(\"faked\")\nelse:\n faked = (\n t.mutate(\n faked=faked_batch(t[\"timestamp\"], t[\"a\"], t[\"b\"], t[\"c\"]),\n )\n .select(\n \"a\",\n \"b\",\n \"c\",\n ibis._[\"faked\"].unnest(),\n )\n .unpack(\"faked\")\n .drop(\"a\", \"b\", \"c\")\n )\n\n faked = con.create_table(\"faked\", faked)\n\nfaked\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ timestamp                name      comment                                            location                                                         device                                                                                                                     ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp(6)stringstringarray<string>struct<browser: string, ip: string>                                                                                        │\n├─────────────────────────┼──────────┼───────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤\n│ 2024-09-15 23:17:45.743Logan   Church you use page control value.               ['23.04419', '-82.00919', 'Jaruco', 'CU', 'America/Havana']{'browser': 'Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 6.1; Trident/5.0)', 'ip': '159.41.148.171'}                     │\n│ 2024-09-15 23:17:45.843VictoriaCourse provide too look loss.                    ['57.30185', '39.85331', 'Gavrilov-Yam', 'RU', 'Europe/Moscow']{'browser': 'Mozilla/5.0 (iPad; CPU iPad OS 17_1_1 like Mac OS X) AppleWebKit/535.2 (KHTML, l'+54, 'ip': '139.151.36.97'}  │\n│ 2024-09-15 23:17:45.943Mike    Small now air similar ground even able finally.  ['51.20219', '7.36027', 'Radevormwald', 'DE', 'Europe/Berlin']{'browser': 'Mozilla/5.0 (iPod; U; CPU iPhone OS 3_1 like Mac OS X; gd-GB) AppleWebKit/533.32'+66, 'ip': '221.105.82.36'}  │\n│ 2024-09-15 23:17:46.043Kathryn Career watch suggest.                            ['33.45122', '-86.99666', 'Hueytown', 'US', 'America/Chicago']{'browser': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows CE; Trident/4.0)', 'ip': '183.74.249.25'}                          │\n│ 2024-09-15 23:17:46.143CristinaAddress table poor culture approach around learn.['27.9247', '78.40102', 'Chharra', 'IN', 'Asia/Kolkata']{'browser': 'Mozilla/5.0 (iPad; CPU iPad OS 6_1_6 like Mac OS X) AppleWebKit/532.1 (KHTML, li'+53, 'ip': '139.150.116.75'} │\n│                                                                                                                           │\n└─────────────────────────┴──────────┴───────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nAnd now we have a \"realistic\" dataset of fake reviews matching our desired\nschema. You can adjust this to match the schema and expected distributions of\nyour own data and scale it up as needed.\n\n### GenAI/LLMs\n\nThe names and locations from Faker are bland and unrealistic. The comments are\nnonsensical. ~~And most importantly, we haven't filled our quota for blogs\nmentioning AI.~~ You could [use language models in Ibis UDFs to generate more\nrealistic synthetic data](../lms-for-data/index.qmd). In a future blog post, we\nwill use \"open source\" language models to do this locally for free.\n\n## Next steps\n\nIf you've followed along, you have a `synthetic.ddb` file with a couple tables:\n\n::: {#48f279e9 .cell execution_count=10}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```\n['faked', 'source']\n```\n:::\n:::\n\n\nWe can estimate the size of data generated:\n\n::: {#14adca58 .cell execution_count=11}\n``` {.python .cell-code}\nimport os\n\nsize_in_mbs = os.path.getsize(\"synthetic.ddb\") / (1024 * 1024)\nprint(f\"synthetic.ddb: {size_in_mbs:.2f} MBs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nsynthetic.ddb: 56.76 MBs\n```\n:::\n:::\n\n\nYou can build from here to generate realistic synthetic data at any scale for\nany use case.\n\n", + "markdown": "---\ntitle: \"Taking a random cube for a walk and making it talk\"\nauthor: \"Cody Peterson\"\ndate: \"2024-09-26\"\nimage: thumbnail.png\ncategories:\n - blog\n - duckdb\n - udfs\n---\n\n***Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.***\n\nTo follow along, install the required libraries:\n\n```bash\npip install 'ibis-framework[duckdb]' faker plotly\n```\n\n## A random cube\n\nWe'll generate a random cube of data with Ibis (default DuckDB backend) and\nvisualize it as a 3D line plot:\n\n::: {#0daf46ce .cell execution_count=1}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nimport ibis # <1>\nimport ibis.selectors as s\nimport plotly.express as px # <1>\n\nibis.options.interactive = True # <2>\nibis.options.repr.interactive.max_rows = 5 # <2>\n\ncon = ibis.connect(\"duckdb://synthetic.ddb\") # <3>\n\nif \"source\" in con.list_tables():\n t = con.table(\"source\") # <4>\nelse:\n lookback = ibis.interval(days=1) # <5>\n step = ibis.interval(seconds=1) # <5>\n\n t = (\n (\n ibis.range( # <6>\n ibis.now() - lookback,\n ibis.now(),\n step=step,\n ) # <6>\n .unnest() # <7>\n .name(\"timestamp\") # <8>\n .as_table() # <9>\n )\n .mutate(\n index=(ibis.row_number().over(order_by=\"timestamp\")), # <10>\n **{col: 2 * (ibis.random() - 0.5) for col in [\"a\", \"b\", \"c\"]}, # <11>\n )\n .mutate(color=ibis._[\"index\"].histogram(nbins=8)) # <12>\n .drop(\"index\") # <13>\n .relocate(\"timestamp\", \"color\") # <14>\n .order_by(\"timestamp\") # <15>\n )\n\n t = con.create_table(\"source\", t.to_pyarrow()) # <16>\n\nc = px.line_3d( # <17>\n t,\n x=\"a\",\n y=\"b\",\n z=\"c\",\n color=\"color\",\n hover_data=[\"timestamp\"],\n) # <17>\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n1. Import the necessary libraries.\n2. Enable interactive mode for Ibis.\n3. Connect to an on-disk DuckDB database.\n4. Load the table if it already exists.\n5. Define the time range and step for the data.\n6. Create the array of timestamps.\n7. Unnest the array to a column.\n8. Name the column \"timestamp\".\n9. Convert the column into a table.\n10. Create a monotonically increasing index column.\n11. Create three columns of random numbers.\n12. Create a color column based on the index (help visualize the time series).\n13. Drop the index column.\n14. Rearrange the columns.\n15. Order the table by timestamp.\n16. Store the table in the on-disk database.\n17. Create a 3D line plot of the data.\n\n## Walking\n\nWe have a random cube of data:\n\n::: {#921b1a6e .cell execution_count=2}\n``` {.python .cell-code}\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓\n┃ timestamp                color  a          b          c         ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64   │\n├─────────────────────────┼───────┼───────────┼───────────┼───────────┤\n│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │\n│ 2024-07-23 23:35:07.01000.307479-0.923701-0.479673 │\n│ 2024-07-23 23:35:08.01000.1361950.6045830.078360 │\n│ 2024-07-23 23:35:09.0100-0.2618670.9252870.339049 │\n│ 2024-07-23 23:35:10.01000.8136230.255287-0.079172 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴───────────┘\n
\n```\n:::\n:::\n\n\nBut we need to make it [walk](https://en.wikipedia.org/wiki/Random_walk). We'll\nuse a window function to calculate the cumulative sum of each column:\n\n::: {.panel-tabset}\n\n## Without column selectors\n\n::: {#8e162bae .cell execution_count=3}\n``` {.python .cell-code}\nwindow = ibis.window(order_by=\"timestamp\", preceding=None, following=0)\nwalked = t.select(\n \"timestamp\",\n \"color\",\n a=t[\"a\"].sum().over(window),\n b=t[\"b\"].sum().over(window),\n c=t[\"c\"].sum().over(window),\n).order_by(\"timestamp\")\nwalked\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ timestamp                color  a          b          c        ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64  │\n├─────────────────────────┼───────┼───────────┼───────────┼──────────┤\n│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │\n│ 2024-07-23 23:35:07.0100-0.529928-1.6054170.213133 │\n│ 2024-07-23 23:35:08.0100-0.393733-1.0008340.291492 │\n│ 2024-07-23 23:35:09.0100-0.655600-0.0755470.630542 │\n│ 2024-07-23 23:35:10.01000.1580240.1797400.551369 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n## With column selectors\n\n::: {#b7a45e7e .cell execution_count=4}\n``` {.python .cell-code}\nwindow = ibis.window(order_by=\"timestamp\", preceding=None, following=0)\nwalked = t.select(\n \"timestamp\",\n \"color\",\n s.across(\n s.c(\"a\", \"b\", \"c\"), # <1>\n ibis._.sum().over(window), # <2>\n ),\n).order_by(\"timestamp\")\nwalked\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ timestamp                color  a          b          c        ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ timestamp(6)int64float64float64float64  │\n├─────────────────────────┼───────┼───────────┼───────────┼──────────┤\n│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │\n│ 2024-07-23 23:35:07.0100-0.529928-1.6054170.213133 │\n│ 2024-07-23 23:35:08.0100-0.393733-1.0008340.291492 │\n│ 2024-07-23 23:35:09.0100-0.655600-0.0755470.630542 │\n│ 2024-07-23 23:35:10.01000.1580240.1797400.551369 │\n│  │\n└─────────────────────────┴───────┴───────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n1. Alternatively, you can use `s.of_type(float)` to select all float columns.\n2. Use the `ibis._` selector to reference a deferred column expression.\n\n:::\n\nWhile the first few rows may look similar to the cube, the 3D line plot does\nnot:\n\n::: {#09c4a94f .cell execution_count=5}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nc = px.line_3d(\n walked,\n x=\"a\",\n y=\"b\",\n z=\"c\",\n color=\"color\",\n hover_data=[\"timestamp\"],\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n## Talking\n\nWe've made our random cube and we've made it walk, but now we want to make it\ntalk. At this point, you might be questioning the utility of this blog post --\nwhat are we doing and why? The purpose is to demonstrate generating synthetic\ndata that can look realistic. We achieve this by building in randomness (e.g. a\nrandom walk can be used to simulate stock prices) and also by using that\nrandomness to inform the generation of non-numeric synthetic data (e.g. the\nticker symbol of a stock).\n\n### Faking it\n\nLet's demonstrate this concept by pretending we have an application where users\ncan review a location they're at. The user's name, comment, location, and device\ninfo are stored in our database for their review at a given timestamp.\n\n[Faker](https://github.com/joke2k/faker) is a commonly used Python library for\ngenerating fake data. We'll use it to generate fake names, comments, locations,\nand device info for our reviews:\n\n::: {#67d9f452 .cell execution_count=6}\n``` {.python .cell-code}\nfrom faker import Faker\n\nfake = Faker()\n\nres = (\n fake.name(),\n fake.sentence(),\n fake.location_on_land(),\n fake.user_agent(),\n fake.ipv4(),\n)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n('Robyn Foster',\n 'Employee security there meeting.',\n ('41.75338', '-86.11084', 'Granger', 'US', 'America/Indiana/Indianapolis'),\n 'Mozilla/5.0 (iPod; U; CPU iPhone OS 3_2 like Mac OS X; unm-US) AppleWebKit/533.16.7 (KHTML, like Gecko) Version/3.0.5 Mobile/8B118 Safari/6533.16.7',\n '119.243.96.150')\n```\n:::\n:::\n\n\nWe can use our random numbers to influence the fake data generation in a Python\nUDF:\n\n\n\n::: {#907d2dc8 .cell execution_count=8}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show me the code!\"}\nimport ibis.expr.datatypes as dt\n\nfrom datetime import datetime, timedelta\n\nibis.options.repr.interactive.max_length = 5\n\nrecord_schema = dt.Struct(\n {\n \"timestamp\": datetime,\n \"name\": str,\n \"comment\": str,\n \"location\": list[str],\n \"device\": dt.Struct(\n {\n \"browser\": str,\n \"ip\": str,\n }\n ),\n }\n)\n\n\n@ibis.udf.scalar.python\ndef faked_batch(\n timestamp: datetime,\n a: float,\n b: float,\n c: float,\n batch_size: int = 8,\n) -> dt.Array(record_schema):\n \"\"\"\n Generate records of fake data.\n \"\"\"\n value = (a + b + c) / 3\n\n res = [\n {\n \"timestamp\": timestamp + timedelta(seconds=0.1 * i),\n \"name\": fake.name() if value >= 0.5 else fake.first_name(),\n \"comment\": fake.sentence(),\n \"location\": fake.location_on_land(),\n \"device\": {\n \"browser\": fake.user_agent(),\n \"ip\": fake.ipv4() if value >= 0 else fake.ipv6(),\n },\n }\n for i in range(batch_size)\n ]\n\n return res\n\n\nif \"faked\" in con.list_tables():\n faked = con.table(\"faked\")\nelse:\n faked = (\n t.mutate(\n faked=faked_batch(t[\"timestamp\"], t[\"a\"], t[\"b\"], t[\"c\"]),\n )\n .select(\n \"a\",\n \"b\",\n \"c\",\n ibis._[\"faked\"].unnest(),\n )\n .unpack(\"faked\")\n .drop(\"a\", \"b\", \"c\")\n )\n\n faked = con.create_table(\"faked\", faked)\n\nfaked\n```\n\n::: {.cell-output .cell-output-display execution_count=8}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ timestamp                name    comment                                  location                                                          ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp(6)stringstringarray<string>                                                     │\n├─────────────────────────┼────────┼─────────────────────────────────────────┼───────────────────────────────────────────────────────────────────┤\n│ 2024-07-23 23:35:06.010GlendaThan available eye.                    ['13.65805', '102.56365', 'Paoy Paet', 'KH', 'Asia/Phnom_Penh']   │\n│ 2024-07-23 23:35:06.110TrevorAbility commercial admit adult he.     ['56.9083', '60.8019', 'Beryozovsky', 'RU', 'Asia/Yekaterinburg'] │\n│ 2024-07-23 23:35:06.210Janet Sign fact time against energy.         ['25.66795', '85.83636', 'Dalsingh Sarai', 'IN', 'Asia/Kolkata']  │\n│ 2024-07-23 23:35:06.310AngelaHappen Democrat public office whatever.['45.78071', '12.84052', 'Portogruaro', 'IT', 'Europe/Rome']      │\n│ 2024-07-23 23:35:06.410Donna Travel none coach crime within lawyer. ['28.15112', '-82.46148', 'Lutz', 'US', 'America/New_York']       │\n│                                                                  │\n└─────────────────────────┴────────┴─────────────────────────────────────────┴───────────────────────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nAnd now we have a \"realistic\" dataset of fake reviews matching our desired\nschema. You can adjust this to match the schema and expected distributions of\nyour own data and scale it up as needed.\n\n### GenAI/LLMs\n\nThe names and locations from Faker are bland and unrealistic. The comments are\nnonsensical. ~~And most importantly, we haven't filled our quota for blogs\nmentioning AI.~~ You could [use language models in Ibis UDFs to generate more\nrealistic synthetic data](../lms-for-data/index.qmd). We could use \"open source\"\nlanguage models to do this locally for free, an excercise left to the reader.\n\n## Next steps\n\nIf you've followed along, you have a `synthetic.ddb` file with a couple tables:\n\n::: {#80c2fa0f .cell execution_count=9}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```\n['faked', 'source']\n```\n:::\n:::\n\n\nWe can estimate the size of data generated:\n\n::: {#eef14b3e .cell execution_count=10}\n``` {.python .cell-code}\nimport os\n\nsize_in_mbs = os.path.getsize(\"synthetic.ddb\") / (1024 * 1024)\nprint(f\"synthetic.ddb: {size_in_mbs:.2f} MBs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nsynthetic.ddb: 54.51 MBs\n```\n:::\n:::\n\n\nYou can build from here to generate realistic synthetic data at any scale for\nany use case.\n\n", "supporting": [ - "index_files" + "index_files/figure-html" ], "filters": [], "includes": { "include-in-header": [ - "\n\n\n\n\n" + "\n\n\n\n\n" ] } } diff --git a/docs/posts/walking-talking-cube/index.qmd b/docs/posts/walking-talking-cube/index.qmd index 1d1c3375a6c5..4299072bbbe3 100644 --- a/docs/posts/walking-talking-cube/index.qmd +++ b/docs/posts/walking-talking-cube/index.qmd @@ -1,7 +1,7 @@ --- title: "Taking a random cube for a walk and making it talk" author: "Cody Peterson" -date: "2024-09-24" +date: "2024-09-26" image: thumbnail.png categories: - blog @@ -11,6 +11,12 @@ categories: ***Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.*** +To follow along, install the required libraries: + +```bash +pip install 'ibis-framework[duckdb]' faker plotly +``` + ## A random cube We'll generate a random cube of data with Ibis (default DuckDB backend) and @@ -280,8 +286,8 @@ your own data and scale it up as needed. The names and locations from Faker are bland and unrealistic. The comments are nonsensical. ~~And most importantly, we haven't filled our quota for blogs mentioning AI.~~ You could [use language models in Ibis UDFs to generate more -realistic synthetic data](../lms-for-data/index.qmd). In a future blog post, we -will use "open source" language models to do this locally for free. +realistic synthetic data](../lms-for-data/index.qmd). We could use "open source" +language models to do this locally for free, an excercise left to the reader. ## Next steps