Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add how-to guide for SQLDatabaseLoader #27696

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ node_modules/

.docusaurus
.cache-loader
docs/api
docs/api
example.sqlite
360 changes: 360 additions & 0 deletions docs/docs/how_to/document_loader_sql_database.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,360 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SQL Database\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# SQL Database\n",
"# SQLDatabaseLoader\n",

"\n",
"## About\n",
"\n",
"The `SQLDatabaseLoader` loads records from any database supported by\n",
"[SQLAlchemy], see [SQLAlchemy dialects] for the whole list of supported\n",
"SQL databases and dialects.\n",
"\n",
"For talking to the database, the document loader uses the [SQLDatabase]\n",
"utility from the LangChain integration toolkit.\n",
Comment on lines +15 to +16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"For talking to the database, the document loader uses the [SQLDatabase]\n",
"utility from the LangChain integration toolkit.\n",
"For talking to the database, the document loader uses the [SQLDatabase]\n",
"utility.\n",

"\n",
"You can either use plain SQL for querying, or use an SQLAlchemy `Select`\n",
"statement object, if you are using SQLAlchemy-Core or -ORM.\n",
"\n",
"You can select which columns to place into the document, which columns\n",
"to place into its metadata, which columns to use as a `source` attribute\n",
"in metadata, and whether to include the result row number and/or the SQL\n",
"query expression into the metadata.\n",
"\n",
"## What's inside\n",
"\n",
"This notebook covers how to load documents from an [SQLite] database,\n",
"using the [SQLAlchemy] document loader.\n",
"\n",
"It loads the result of a database query with one document per row.\n",
"\n",
"[SQLAlchemy]: https://www.sqlalchemy.org/\n",
"[SQLAlchemy dialects]: https://docs.sqlalchemy.org/en/latest/dialects/\n",
"[SQLDatabase]: https://python.langchain.com/docs/integrations/toolkits/sql_database\n",
Copy link
Contributor Author

@amotl amotl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to the SQLDatabase Toolkit was probably wrong, so let's better link to the Utility instead?

Suggested change
"[SQLDatabase]: https://python.langchain.com/docs/integrations/toolkits/sql_database\n",
"[SQLDatabase]: https://python.langchain.com/api_reference/community/utilities/langchain_community.utilities.sql_database.SQLDatabase.html\n",

Is there a better piece of dedicated documentation about this component elsewhere in the documentation tree, i.e. are you advising to use a different target link here?

"[SQLite]: https://sqlite.org/\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Prerequisites"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install langchain langchain-community sqlalchemy termsql"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Populate SQLite database with example input data."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nationals|81.34|98\r\n",
"Reds|82.2|97\r\n",
"Yankees|197.96|95\r\n",
"Giants|117.62|94\r\n",
"Braves|83.31|94\r\n",
"Athletics|55.37|94\r\n",
"Rangers|120.51|93\r\n",
"Orioles|81.43|93\r\n",
"Rays|64.17|90\r\n",
"Angels|154.49|89\r\n",
"Tigers|132.3|88\r\n",
"Cardinals|110.3|88\r\n",
"Dodgers|95.14|86\r\n",
"White Sox|96.92|85\r\n",
"Brewers|97.65|83\r\n",
"Phillies|174.54|81\r\n",
"Diamondbacks|74.28|81\r\n",
"Pirates|63.43|79\r\n",
"Padres|55.24|76\r\n",
"Mariners|81.97|75\r\n",
"Mets|93.35|74\r\n",
"Blue Jays|75.48|73\r\n",
"Royals|60.91|72\r\n",
"Marlins|118.07|69\r\n",
"Red Sox|173.18|69\r\n",
"Indians|78.43|68\r\n",
"Twins|94.08|66\r\n",
"Rockies|78.06|64\r\n",
"Cubs|88.19|61\r\n",
"Astros|60.65|55\r\n",
"||\r\n"
]
}
],
"source": [
"!termsql --infile=./example_data/mlb_teams_2012.csv --head --csv --outfile=example.sqlite --table=payroll"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Basic usage"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"\n",
"from langchain_community.document_loaders import SQLDatabaseLoader\n",
"\n",
"loader = SQLDatabaseLoader(\n",
" \"SELECT * FROM payroll LIMIT 2\",\n",
" url=\"sqlite:///example.sqlite\",\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98'),\n",
" Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97')]\n"
]
}
],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specify which columns are content vs. metadata\n",
"\n",
"Use the `page_content_mapper` keyword argument to optionally customize how to derive\n",
"a page content string from an input database record / row. By default, all columns\n",
"will be used.\n",
"\n",
"Use the `metadata_mapper` keyword argument to optionally customize how to derive\n",
"a document metadata dictionary from an input database record / row. By default,\n",
"document metadata will be empty."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"import functools\n",
"\n",
"# Configure built-in page content mapper to include only specified columns.\n",
"row_to_content = functools.partial(\n",
" SQLDatabaseLoader.page_content_default_mapper, column_names=[\"Team\", \"Wins\"]\n",
")\n",
"\n",
"# Configure built-in metadata dictionary mapper to include specified columns.\n",
"row_to_metadata = functools.partial(\n",
" SQLDatabaseLoader.metadata_default_mapper, column_names=[\"Payroll (millions)\"]\n",
")\n",
"\n",
"loader = SQLDatabaseLoader(\n",
" \"SELECT * FROM payroll LIMIT 2\",\n",
" url=\"sqlite:///example.sqlite\",\n",
" page_content_mapper=row_to_content,\n",
" metadata_mapper=row_to_metadata,\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Team: Nationals\\nWins: 98', metadata={'Payroll (millions)': 81.34}),\n",
" Document(page_content='Team: Reds\\nWins: 97', metadata={'Payroll (millions)': 82.2})]\n"
]
}
],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"source": [
"Those examples demonstrate how to use custom functions to define arbitrary\n",
"mapping rules by using Python code.\n",
"```python\n",
"def page_content_mapper(row: sa.RowMapping, column_names: Optional[List[str]] = None) -> str:\n",
" return f\"Team: {row['Team']}\"\n",
"```\n",
"```python\n",
"def metadata_default_mapper(row: sa.RowMapping, column_names: Optional[List[str]] = None) -> Dict[str, Any]:\n",
" return {\"team\": row['Team']}\n",
"```"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specify column(s) to identify the document source\n",
"\n",
"Use the `source_columns` option to specify the columns to use as a \"source\" for the\n",
"document created from each row. This is useful for identifying documents through\n",
"their metadata. Typically, you may use the primary key column(s) for that purpose."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"loader = SQLDatabaseLoader(\n",
" \"SELECT * FROM payroll LIMIT 2\",\n",
" url=\"sqlite:///example.sqlite\",\n",
" source_columns=[\"Team\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98', metadata={'source': 'Nationals'}),\n",
" Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97', metadata={'source': 'Reds'})]\n"
]
}
],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"source": [
"## Enrich metadata with row number and/or original SQL query\n",
"\n",
"Use the `include_rownum_into_metadata` and `include_query_into_metadata` options to\n",
"optionally populate the `metadata` dictionary with corresponding information.\n",
"\n",
"Having the `query` within metadata is useful when using documents loaded from\n",
"database tables for chains that answer questions using their origin queries."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 49,
"outputs": [],
"source": [
"loader = SQLDatabaseLoader(\n",
" \"SELECT * FROM payroll LIMIT 2\",\n",
" url=\"sqlite:///example.sqlite\",\n",
" include_rownum_into_metadata=True,\n",
" include_query_into_metadata=True,\n",
")\n",
"documents = loader.load()"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 50,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Team: Nationals\\nPayroll (millions): 81.34\\nWins: 98', metadata={'row': 0, 'query': 'SELECT * FROM payroll LIMIT 2'}),\n",
" Document(page_content='Team: Reds\\nPayroll (millions): 82.2\\nWins: 97', metadata={'row': 1, 'query': 'SELECT * FROM payroll LIMIT 2'})]\n"
]
}
],
"source": [
"pprint(documents)"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading
Loading