Skip to content

Commit

Permalink
CrateDB: Documentation about Vector Store, Document Loader, and Memory
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Oct 29, 2024
1 parent 0606aab commit 5f04f9b
Show file tree
Hide file tree
Showing 6 changed files with 1,430 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ node_modules/

.docusaurus
.cache-loader
docs/api
docs/api
example.sqlite
276 changes: 276 additions & 0 deletions docs/docs/integrations/document_loaders/cratedb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CrateDB Document Loader\n",
"\n",
"> [CrateDB] is capable of performing both vector and lexical search.\n",
"> It is built on top of the Apache Lucene library, talks SQL,\n",
"> is PostgreSQL-compatible, and scales like Elasticsearch.\n",
"\n",
"This notebook covers how to get started with the CrateDB document loader.\n",
"\n",
"The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's\n",
"SQLDatabaseLoader. It loads the result of a database query with one document\n",
"per row.\n",
"\n",
"[CrateDB]: https://github.com/crate/crate\n",
"[SQLAlchemy]: https://www.sqlalchemy.org/\n",
"\n",
"## Overview\n",
"\n",
"The `CrateDBLoader` class helps you get your unstructured content from CrateDB\n",
"into LangChain's `Document` format.\n",
"\n",
"You must provide an SQLAlchemy-compatible connection string, and a query\n",
"expression in SQL format. \n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: | :---: |\n",
"| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: | \n",
"| CrateDBLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.\n",
"\n",
"### Credentials\n",
"\n",
"You will supply credentials through a regular SQLAlchemy connection string, like\n",
"`crate://username:[email protected]/`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install the **langchain-community** and **sqlalchemy-cratedb** packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-community sqlalchemy-cratedb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now, initialize the loader and start loading documents. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import CrateDBLoader\n",
"\n",
"loader = CrateDBLoader(\"SELECT * FROM sys.summits\", url=\"crate://crate@localhost/\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": "## Load"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"documents = loader.load()\n",
"print(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Lazy Load\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Tutorial\n",
"\n",
"### Populate database."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!crash < ./example_data/mlb_teams_2012.sql\n",
"!crash --command \"REFRESH TABLE mlb_teams_2012;\""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": "### Usage"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"\n",
"from langchain.document_loaders import CrateDBLoader\n",
"\n",
"CONNECTION_STRING = \"crate://crate@localhost/\"\n",
"\n",
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Specifying Which Columns are Content vs Metadata"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
" page_content_columns=[\"Team\"],\n",
" metadata_columns=[\"Payroll (millions)\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Adding Source to Metadata"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
" source_columns=[\"Team\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(documents)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
-- Provisioning table "mlb_teams_2012".
--
-- psql postgresql://postgres@localhost < mlb_teams_2012.sql
-- crash < mlb_teams_2012.sql

DROP TABLE IF EXISTS mlb_teams_2012;
CREATE TABLE mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT);
Expand Down
Loading

0 comments on commit 5f04f9b

Please sign in to comment.