CrateDB: Documentation about Vector Store, Document Loader, and Memory

langchain-ai · Oct 29, 2024 · 5f04f9b · 5f04f9b
1 parent 0606aab
commit 5f04f9b
Show file tree

Hide file tree

Showing 6 changed files with 1,430 additions and 1 deletion.
diff --git a/docs/docs/.gitignore b/docs/docs/.gitignore
@@ -4,4 +4,5 @@ node_modules/
 
 .docusaurus
 .cache-loader
-docs/api
+docs/api
+example.sqlite
diff --git a/docs/docs/integrations/document_loaders/cratedb.ipynb b/docs/docs/integrations/document_loaders/cratedb.ipynb
@@ -0,0 +1,276 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CrateDB Document Loader\n",
+    "\n",
+    "> [CrateDB] is capable of performing both vector and lexical search.\n",
+    "> It is built on top of the Apache Lucene library, talks SQL,\n",
+    "> is PostgreSQL-compatible, and scales like Elasticsearch.\n",
+    "\n",
+    "This notebook covers how to get started with the CrateDB document loader.\n",
+    "\n",
+    "The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's\n",
+    "SQLDatabaseLoader. It loads the result of a database query with one document\n",
+    "per row.\n",
+    "\n",
+    "[CrateDB]: https://github.com/crate/crate\n",
+    "[SQLAlchemy]: https://www.sqlalchemy.org/\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "The `CrateDBLoader` class helps you get your unstructured content from CrateDB\n",
+    "into LangChain's `Document` format.\n",
+    "\n",
+    "You must provide an SQLAlchemy-compatible connection string, and a query\n",
+    "expression in SQL format. \n",
+    "\n",
+    "### Integration details\n",
+    "\n",
+    "| Class                                                                                                                                          | Package                                                                        | Local | Serializable | JS support|\n",
+    "|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: |  :---: |\n",
+    "| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | \n",
+    "### Loader features\n",
+    "| Source | Document Lazy Loading | Async Support\n",
+    "| :---: | :---: | :---: | \n",
+    "| CrateDBLoader | ✅ | ❌ | \n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.\n",
+    "\n",
+    "### Credentials\n",
+    "\n",
+    "You will supply credentials through a regular SQLAlchemy connection string, like\n",
+    "`crate://username:[email protected]/`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation\n",
+    "\n",
+    "Install the **langchain-community** and **sqlalchemy-cratedb** packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU langchain-community sqlalchemy-cratedb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization\n",
+    "\n",
+    "Now, initialize the loader and start loading documents. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import CrateDBLoader\n",
+    "\n",
+    "loader = CrateDBLoader(\"SELECT * FROM sys.summits\", url=\"crate://crate@localhost/\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": "## Load"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = loader.load()\n",
+    "print(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Lazy Load\n"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "page = []\n",
+    "for doc in loader.lazy_load():\n",
+    "    page.append(doc)\n",
+    "    if len(page) >= 10:\n",
+    "        # do some paged operation, e.g.\n",
+    "        # index.upsert(page)\n",
+    "\n",
+    "        page = []"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "## Tutorial\n",
+    "\n",
+    "### Populate database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!crash < ./example_data/mlb_teams_2012.sql\n",
+    "!crash --command \"REFRESH TABLE mlb_teams_2012;\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": "### Usage"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "\n",
+    "from langchain.document_loaders import CrateDBLoader\n",
+    "\n",
+    "CONNECTION_STRING = \"crate://crate@localhost/\"\n",
+    "\n",
+    "loader = CrateDBLoader(\n",
+    "    'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
+    "    url=CONNECTION_STRING,\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "pprint(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "### Specifying Which Columns are Content vs Metadata"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = CrateDBLoader(\n",
+    "    'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
+    "    url=CONNECTION_STRING,\n",
+    "    page_content_columns=[\"Team\"],\n",
+    "    metadata_columns=[\"Payroll (millions)\"],\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pprint(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "### Adding Source to Metadata"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = CrateDBLoader(\n",
+    "    'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
+    "    url=CONNECTION_STRING,\n",
+    "    source_columns=[\"Team\"],\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pprint(documents)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql b/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql
@@ -1,6 +1,7 @@
 -- Provisioning table "mlb_teams_2012".
 --
 -- psql postgresql://postgres@localhost < mlb_teams_2012.sql
+-- crash < mlb_teams_2012.sql
 
 DROP TABLE IF EXISTS mlb_teams_2012;
 CREATE TABLE mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT);