langchain-ai · amotl · Oct 28, 2024 · Oct 28, 2024 · amotl · Oct 29, 2024
diff --git a/docs/docs/.gitignore b/docs/docs/.gitignore
@@ -4,4 +4,5 @@ node_modules/
 
 .docusaurus
 .cache-loader
-docs/api
+docs/api
+example.sqlite
diff --git a/docs/docs/how_to/document_loader_sql_database.ipynb b/docs/docs/how_to/document_loader_sql_database.ipynb
@@ -0,0 +1,360 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SQL Database\n",
-    "# SQL Database\n",
+    "# SQLDatabaseLoader\n",
-    "# SQL Database\n",
+    "# SQLDatabaseLoader\n",
+    "\n",
+    "## About\n",
+    "\n",
+    "The `SQLDatabaseLoader` loads records from any database supported by\n",
+    "[SQLAlchemy], see [SQLAlchemy dialects] for the whole list of supported\n",
+    "SQL databases and dialects.\n",
+    "\n",
+    "For talking to the database, the document loader uses the [SQLDatabase]\n",
+    "utility from the LangChain integration toolkit.\n",
-    "For talking to the database, the document loader uses the [SQLDatabase]\n",
-    "utility from the LangChain integration toolkit.\n",
+    "For talking to the database, the document loader uses the [SQLDatabase]\n",
+    "utility.\n",
-    "For talking to the database, the document loader uses the [SQLDatabase]\n",
-    "utility from the LangChain integration toolkit.\n",
+    "For talking to the database, the document loader uses the [SQLDatabase]\n",
+    "utility.\n",
+    "\n",
+    "You can either use plain SQL for querying, or use an SQLAlchemy `Select`\n",
+    "statement object, if you are using SQLAlchemy-Core or -ORM.\n",
+    "\n",
+    "You can select which columns to place into the document, which columns\n",
+    "to place into its metadata, which columns to use as a `source` attribute\n",
+    "in metadata, and whether to include the result row number and/or the SQL\n",
+    "query expression into the metadata.\n",
+    "\n",
+    "## What's inside\n",
+    "\n",
+    "This notebook covers how to load documents from an [SQLite] database,\n",
+    "using the [SQLAlchemy] document loader.\n",
+    "\n",
+    "It loads the result of a database query with one document per row.\n",
+    "\n",
+    "[SQLAlchemy]: https://www.sqlalchemy.org/\n",
+    "[SQLAlchemy dialects]: https://docs.sqlalchemy.org/en/latest/dialects/\n",
+    "[SQLDatabase]: https://python.langchain.com/docs/integrations/toolkits/sql_database\n",
-    "[SQLDatabase]: https://python.langchain.com/docs/integrations/toolkits/sql_database\n",
+    "[SQLDatabase]: https://python.langchain.com/api_reference/community/utilities/langchain_community.utilities.sql_database.SQLDatabase.html\n",
-    "[SQLDatabase]: https://python.langchain.com/docs/integrations/toolkits/sql_database\n",
+    "[SQLDatabase]: https://python.langchain.com/api_reference/community/utilities/langchain_community.utilities.sql_database.SQLDatabase.html\n",
+    "[SQLite]: https://sqlite.org/\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "## Prerequisites"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "#!pip install langchain langchain-community sqlalchemy termsql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "Populate SQLite database with example input data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Nationals|81.34|98\r\n",
+      "Reds|82.2|97\r\n",
+      "Yankees|197.96|95\r\n",
+      "Giants|117.62|94\r\n",
+      "Braves|83.31|94\r\n",
+      "Athletics|55.37|94\r\n",
+      "Rangers|120.51|93\r\n",
+      "Orioles|81.43|93\r\n",
+      "Rays|64.17|90\r\n",
+      "Angels|154.49|89\r\n",
+      "Tigers|132.3|88\r\n",
+      "Cardinals|110.3|88\r\n",
+      "Dodgers|95.14|86\r\n",
+      "White Sox|96.92|85\r\n",
+      "Brewers|97.65|83\r\n",
+      "Phillies|174.54|81\r\n",
+      "Diamondbacks|74.28|81\r\n",
+      "Pirates|63.43|79\r\n",
+      "Padres|55.24|76\r\n",
+      "Mariners|81.97|75\r\n",
+      "Mets|93.35|74\r\n",
+      "Blue Jays|75.48|73\r\n",
+      "Royals|60.91|72\r\n",
+      "Marlins|118.07|69\r\n",
+      "Red Sox|173.18|69\r\n",
+      "Indians|78.43|68\r\n",
+      "Twins|94.08|66\r\n",
+      "Rockies|78.06|64\r\n",
+      "Cubs|88.19|61\r\n",
+      "Astros|60.65|55\r\n",
+      "||\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!termsql --infile=./example_data/mlb_teams_2012.csv --head --csv --outfile=example.sqlite --table=payroll"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "## Basic usage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "\n",
+    "from langchain_community.document_loaders import SQLDatabaseLoader\n",
+    "\n",
+    "loader = SQLDatabaseLoader(\n",
+    "    \"SELECT * FROM payroll LIMIT 2\",\n",
+    "    url=\"sqlite:///example.sqlite\",\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98'),\n",
+      " Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97')]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Specify which columns are content vs. metadata\n",
+    "\n",
+    "Use the `page_content_mapper` keyword argument to optionally customize how to derive\n",
+    "a page content string from an input database record / row. By default, all columns\n",
+    "will be used.\n",
+    "\n",
+    "Use the `metadata_mapper` keyword argument to optionally customize how to derive\n",
+    "a document metadata dictionary from an input database record / row. By default,\n",
+    "document metadata will be empty."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import functools\n",
+    "\n",
+    "# Configure built-in page content mapper to include only specified columns.\n",
+    "row_to_content = functools.partial(\n",
+    "    SQLDatabaseLoader.page_content_default_mapper, column_names=[\"Team\", \"Wins\"]\n",
+    ")\n",
+    "\n",
+    "# Configure built-in metadata dictionary mapper to include specified columns.\n",
+    "row_to_metadata = functools.partial(\n",
+    "    SQLDatabaseLoader.metadata_default_mapper, column_names=[\"Payroll (millions)\"]\n",
+    ")\n",
+    "\n",
+    "loader = SQLDatabaseLoader(\n",
+    "    \"SELECT * FROM payroll LIMIT 2\",\n",
+    "    url=\"sqlite:///example.sqlite\",\n",
+    "    page_content_mapper=row_to_content,\n",
+    "    metadata_mapper=row_to_metadata,\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Team: Nationals\\nWins: 98', metadata={'Payroll (millions)': 81.34}),\n",
+      " Document(page_content='Team: Reds\\nWins: 97', metadata={'Payroll (millions)': 82.2})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Those examples demonstrate how to use custom functions to define arbitrary\n",
+    "mapping rules by using Python code.\n",
+    "```python\n",
+    "def page_content_mapper(row: sa.RowMapping, column_names: Optional[List[str]] = None) -> str:\n",
+    "    return f\"Team: {row['Team']}\"\n",
+    "```\n",
+    "```python\n",
+    "def metadata_default_mapper(row: sa.RowMapping, column_names: Optional[List[str]] = None) -> Dict[str, Any]:\n",
+    "    return {\"team\": row['Team']}\n",
+    "```"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Specify column(s) to identify the document source\n",
+    "\n",
+    "Use the `source_columns` option to specify the columns to use as a \"source\" for the\n",
+    "document created from each row. This is useful for identifying documents through\n",
+    "their metadata. Typically, you may use the primary key column(s) for that purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = SQLDatabaseLoader(\n",
+    "    \"SELECT * FROM payroll LIMIT 2\",\n",
+    "    url=\"sqlite:///example.sqlite\",\n",
+    "    source_columns=[\"Team\"],\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98', metadata={'source': 'Nationals'}),\n",
+      " Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97', metadata={'source': 'Reds'})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Enrich metadata with row number and/or original SQL query\n",
+    "\n",
+    "Use the `include_rownum_into_metadata` and `include_query_into_metadata` options to\n",
+    "optionally populate the `metadata` dictionary with corresponding information.\n",
+    "\n",
+    "Having the `query` within metadata is useful when using documents loaded from\n",
+    "database tables for chains that answer questions using their origin queries."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "outputs": [],
+   "source": [
+    "loader = SQLDatabaseLoader(\n",
+    "    \"SELECT * FROM payroll LIMIT 2\",\n",
+    "    url=\"sqlite:///example.sqlite\",\n",
+    "    include_rownum_into_metadata=True,\n",
+    "    include_query_into_metadata=True,\n",
+    ")\n",
+    "documents = loader.load()"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98', metadata={'row': 0, 'query': 'SELECT * FROM payroll LIMIT 2'}),\n",
+      " Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97', metadata={'row': 1, 'query': 'SELECT * FROM payroll LIMIT 2'})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(documents)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}