From 056854c33c0e4bf51b735331e5ce7cd81e440a0d Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Tue, 29 Oct 2024 14:25:01 +0100 Subject: [PATCH] CrateDB: Documentation about Vector Store, Document Loader, and Memory --- docs/docs/.gitignore | 3 +- .../document_loaders/cratedb.ipynb | 273 ++++++++ .../example_data/mlb_teams_2012.sql | 1 + .../memory/cratedb_chat_message_history.ipynb | 356 +++++++++++ docs/docs/integrations/providers/cratedb.mdx | 203 ++++++ .../integrations/vectorstores/cratedb.ipynb | 582 ++++++++++++++++++ 6 files changed, 1417 insertions(+), 1 deletion(-) create mode 100644 docs/docs/integrations/document_loaders/cratedb.ipynb create mode 100644 docs/docs/integrations/memory/cratedb_chat_message_history.ipynb create mode 100644 docs/docs/integrations/providers/cratedb.mdx create mode 100644 docs/docs/integrations/vectorstores/cratedb.ipynb diff --git a/docs/docs/.gitignore b/docs/docs/.gitignore index 25a6e30a4b775..e586a74dfb131 100644 --- a/docs/docs/.gitignore +++ b/docs/docs/.gitignore @@ -4,4 +4,5 @@ node_modules/ .docusaurus .cache-loader -docs/api \ No newline at end of file +docs/api +example.sqlite diff --git a/docs/docs/integrations/document_loaders/cratedb.ipynb b/docs/docs/integrations/document_loaders/cratedb.ipynb new file mode 100644 index 0000000000000..2a54890dd3f43 --- /dev/null +++ b/docs/docs/integrations/document_loaders/cratedb.ipynb @@ -0,0 +1,273 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# CrateDB Document Loader\n", + "\n", + "> [CrateDB] is capable of performing both vector and lexical search.\n", + "> It is built on top of the Apache Lucene library, talks SQL,\n", + "> is PostgreSQL-compatible, and scales like Elasticsearch.\n", + "\n", + "This notebook covers how to get started with the CrateDB document loader.\n", + "\n", + "The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's\n", + "SQLDatabaseLoader. It loads the result of a database query with one document\n", + "per row.\n", + "\n", + "[CrateDB]: https://github.com/crate/crate\n", + "[SQLAlchemy]: https://www.sqlalchemy.org/\n", + "\n", + "## Overview\n", + "\n", + "The `CrateDBLoader` class helps you get your unstructured content from CrateDB\n", + "into LangChain's `Document` format.\n", + "\n", + "You must provide an SQLAlchemy-compatible connection string, and a query\n", + "expression in SQL format. \n", + "\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | JS support|\n", + "|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: | :---: |\n", + "| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Async Support\n", + "| :---: | :---: | :---: | \n", + "| CrateDBLoader | ✅ | ❌ | \n", + "\n", + "## Setup\n", + "\n", + "You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.\n", + "\n", + "### Credentials\n", + "\n", + "You will supply credentials through a regular SQLAlchemy connection string, like\n", + "`crate://username:password@cratedb.example.org/`." + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Installation\n", + "\n", + "Install the **langchain-community** and **sqlalchemy-cratedb** packages." + ] + }, + { + "metadata": {}, + "cell_type": "code", + "source": "%pip install -qU langchain-community sqlalchemy-cratedb", + "outputs": [], + "execution_count": null + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## Initialization\n", + "\n", + "Now, initialize the loader and start loading documents. " + ] + }, + { + "metadata": {}, + "cell_type": "code", + "source": [ + "from langchain_community.document_loaders import CrateDBLoader\n", + "\n", + "loader = CrateDBLoader(\"SELECT * FROM sys.summits\", url=\"crate://crate@localhost/\")" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "source": "## Load", + "metadata": { + "collapsed": false + } + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "documents = loader.load()\n", + "print(documents)" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "## Lazy Load\n" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + " if len(page) >= 10:\n", + " # do some paged operation, e.g.\n", + " # index.upsert(page)\n", + "\n", + " page = []" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Tutorial\n", + "\n", + "### Populate database." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "metadata": { + "tags": [] + }, + "source": [ + "!crash < ./example_data/mlb_teams_2012.sql\n", + "!crash --command \"REFRESH TABLE mlb_teams_2012;\"" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "source": "### Usage", + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "metadata": { + "tags": [] + }, + "source": [ + "from langchain.document_loaders import CrateDBLoader\n", + "from pprint import pprint\n", + "\n", + "CONNECTION_STRING = \"crate://crate@localhost/\"\n", + "\n", + "loader = CrateDBLoader(\n", + " 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n", + " url=CONNECTION_STRING,\n", + ")\n", + "documents = loader.load()" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "code", + "metadata": { + "tags": [] + }, + "source": [ + "pprint(documents)" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "### Specifying Which Columns are Content vs Metadata" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "loader = CrateDBLoader(\n", + " 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n", + " url=CONNECTION_STRING,\n", + " page_content_columns=[\"Team\"],\n", + " metadata_columns=[\"Payroll (millions)\"],\n", + ")\n", + "documents = loader.load()" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "pprint(documents)" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "### Adding Source to Metadata" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "loader = CrateDBLoader(\n", + " 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n", + " url=CONNECTION_STRING,\n", + " source_columns=[\"Team\"],\n", + ")\n", + "documents = loader.load()" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "pprint(documents)" + ], + "outputs": [], + "execution_count": null + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql b/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql index 33cb765a38ebe..9df72ef19954a 100644 --- a/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql +++ b/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql @@ -1,6 +1,7 @@ -- Provisioning table "mlb_teams_2012". -- -- psql postgresql://postgres@localhost < mlb_teams_2012.sql +-- crash < mlb_teams_2012.sql DROP TABLE IF EXISTS mlb_teams_2012; CREATE TABLE mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT); diff --git a/docs/docs/integrations/memory/cratedb_chat_message_history.ipynb b/docs/docs/integrations/memory/cratedb_chat_message_history.ipynb new file mode 100644 index 0000000000000..244b21d43d13f --- /dev/null +++ b/docs/docs/integrations/memory/cratedb_chat_message_history.ipynb @@ -0,0 +1,356 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# CrateDB Chat Message History\n", + "\n", + "This notebook demonstrates how to use the `CrateDBChatMessageHistory`\n", + "to manage chat history in CrateDB, for supporting conversational memory." + ], + "metadata": { + "collapsed": false + }, + "id": "f22eab3f84cbeb37" + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!#pip install langchain sqlalchemy-cratedb" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Configuration\n", + "\n", + "To use the storage wrapper, you will need to configure two details.\n", + "\n", + "1. Session Id - a unique identifier of the session, like user name, email, chat id etc.\n", + "2. Database connection string: An SQLAlchemy-compatible URI that specifies the database\n", + " connection. It will be passed to SQLAlchemy create_engine function." + ], + "metadata": { + "collapsed": false + }, + "id": "f8f2830ee9ca1e01" + }, + { + "cell_type": "code", + "execution_count": 52, + "outputs": [], + "source": [ + "from langchain.memory.chat_message_histories import CrateDBChatMessageHistory\n", + "\n", + "CONNECTION_STRING = \"crate://crate@localhost:4200/?schema=example\"\n", + "\n", + "chat_message_history = CrateDBChatMessageHistory(\n", + "\tsession_id=\"test_session\",\n", + "\tconnection_string=CONNECTION_STRING\n", + ")" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Basic Usage" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 53, + "outputs": [], + "source": [ + "chat_message_history.add_user_message(\"Hello\")\n", + "chat_message_history.add_ai_message(\"Hi\")" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-08-28T10:04:38.077748Z", + "start_time": "2023-08-28T10:04:36.105894Z" + } + }, + "id": "4576e914a866fb40" + }, + { + "cell_type": "code", + "execution_count": 61, + "outputs": [ + { + "data": { + "text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]" + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chat_message_history.messages" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-08-28T10:04:38.929396Z", + "start_time": "2023-08-28T10:04:38.915727Z" + } + }, + "id": "b476688cbb32ba90" + }, + { + "cell_type": "markdown", + "source": [ + "## Custom Storage Model\n", + "\n", + "The default data model, which stores information about conversation messages only\n", + "has two slots for storing message details, the session id, and the message dictionary.\n", + "\n", + "If you want to store additional information, like message date, author, language etc.,\n", + "please provide an implementation for a custom message converter.\n", + "\n", + "This example demonstrates how to create a custom message converter, by implementing\n", + "the `BaseMessageConverter` interface." + ], + "metadata": { + "collapsed": false + }, + "id": "2e5337719d5614fd" + }, + { + "cell_type": "code", + "execution_count": 55, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "from typing import Any\n", + "\n", + "from langchain.memory.chat_message_histories.sql import BaseMessageConverter\n", + "from langchain.schema import AIMessage, BaseMessage, HumanMessage, SystemMessage\n", + "\n", + "import sqlalchemy as sa\n", + "from sqlalchemy.orm import declarative_base\n", + "\n", + "\n", + "Base = declarative_base()\n", + "\n", + "\n", + "class CustomMessage(Base):\n", + "\t__tablename__ = \"custom_message_store\"\n", + "\n", + "\tid = sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())\n", + "\tsession_id = sa.Column(sa.Text)\n", + "\ttype = sa.Column(sa.Text)\n", + "\tcontent = sa.Column(sa.Text)\n", + "\tcreated_at = sa.Column(sa.DateTime)\n", + "\tauthor_email = sa.Column(sa.Text)\n", + "\n", + "\n", + "class CustomMessageConverter(BaseMessageConverter):\n", + "\tdef __init__(self, author_email: str):\n", + "\t\tself.author_email = author_email\n", + "\t\n", + "\tdef from_sql_model(self, sql_message: Any) -> BaseMessage:\n", + "\t\tif sql_message.type == \"human\":\n", + "\t\t\treturn HumanMessage(\n", + "\t\t\t\tcontent=sql_message.content,\n", + "\t\t\t)\n", + "\t\telif sql_message.type == \"ai\":\n", + "\t\t\treturn AIMessage(\n", + "\t\t\t\tcontent=sql_message.content,\n", + "\t\t\t)\n", + "\t\telif sql_message.type == \"system\":\n", + "\t\t\treturn SystemMessage(\n", + "\t\t\t\tcontent=sql_message.content,\n", + "\t\t\t)\n", + "\t\telse:\n", + "\t\t\traise ValueError(f\"Unknown message type: {sql_message.type}\")\n", + "\t\n", + "\tdef to_sql_model(self, message: BaseMessage, session_id: str) -> Any:\n", + "\t\tnow = datetime.now()\n", + "\t\treturn CustomMessage(\n", + "\t\t\tsession_id=session_id,\n", + "\t\t\ttype=message.type,\n", + "\t\t\tcontent=message.content,\n", + "\t\t\tcreated_at=now,\n", + "\t\t\tauthor_email=self.author_email\n", + "\t\t)\n", + "\t\n", + "\tdef get_sql_model_class(self) -> Any:\n", + "\t\treturn CustomMessage\n", + "\n", + "\n", + "if __name__ == \"__main__\":\n", + "\n", + "\tBase.metadata.drop_all(bind=sa.create_engine(CONNECTION_STRING))\n", + "\n", + "\tchat_message_history = CrateDBChatMessageHistory(\n", + "\t\tsession_id=\"test_session\",\n", + "\t\tconnection_string=CONNECTION_STRING,\n", + "\t\tcustom_message_converter=CustomMessageConverter(\n", + "\t\t\tauthor_email=\"test@example.com\"\n", + "\t\t)\n", + "\t)\n", + "\n", + "\tchat_message_history.add_user_message(\"Hello\")\n", + "\tchat_message_history.add_ai_message(\"Hi\")" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-08-28T10:04:41.510498Z", + "start_time": "2023-08-28T10:04:41.494912Z" + } + }, + "id": "fdfde84c07d071bb" + }, + { + "cell_type": "code", + "execution_count": 60, + "outputs": [ + { + "data": { + "text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]" + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chat_message_history.messages" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-08-28T10:04:43.497990Z", + "start_time": "2023-08-28T10:04:43.492517Z" + } + }, + "id": "4a6a54d8a9e2856f" + }, + { + "cell_type": "markdown", + "source": [ + "## Custom Name for Session Column\n", + "\n", + "The session id, a unique token identifying the session, is an important property of\n", + "this subsystem. If your database table stores it in a different column, you can use\n", + "the `session_id_field_name` keyword argument to adjust the name correspondingly." + ], + "metadata": { + "collapsed": false + }, + "id": "622aded629a1adeb" + }, + { + "cell_type": "code", + "execution_count": 57, + "outputs": [], + "source": [ + "import json\n", + "import typing as t\n", + "\n", + "from langchain.memory.chat_message_histories.cratedb import CrateDBMessageConverter\n", + "from langchain.schema import _message_to_dict\n", + "\n", + "\n", + "Base = declarative_base()\n", + "\n", + "class MessageWithDifferentSessionIdColumn(Base):\n", + "\t__tablename__ = \"message_store_different_session_id\"\n", + "\tid = sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())\n", + "\tcustom_session_id = sa.Column(sa.Text)\n", + "\tmessage = sa.Column(sa.Text)\n", + "\n", + "\n", + "class CustomMessageConverterWithDifferentSessionIdColumn(CrateDBMessageConverter):\n", + " def __init__(self):\n", + " self.model_class = MessageWithDifferentSessionIdColumn\n", + "\n", + " def to_sql_model(self, message: BaseMessage, custom_session_id: str) -> t.Any:\n", + " return self.model_class(\n", + " custom_session_id=custom_session_id, message=json.dumps(_message_to_dict(message))\n", + " )\n", + "\n", + "\n", + "if __name__ == \"__main__\":\n", + "\tBase.metadata.drop_all(bind=sa.create_engine(CONNECTION_STRING))\n", + "\n", + "\tchat_message_history = CrateDBChatMessageHistory(\n", + "\t\tsession_id=\"test_session\",\n", + "\t\tconnection_string=CONNECTION_STRING,\n", + "\t\tcustom_message_converter=CustomMessageConverterWithDifferentSessionIdColumn(),\n", + "\t\tsession_id_field_name=\"custom_session_id\",\n", + "\t)\n", + "\n", + "\tchat_message_history.add_user_message(\"Hello\")\n", + "\tchat_message_history.add_ai_message(\"Hi\")" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 58, + "outputs": [ + { + "data": { + "text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]" + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chat_message_history.messages" + ], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/integrations/providers/cratedb.mdx b/docs/docs/integrations/providers/cratedb.mdx new file mode 100644 index 0000000000000..dde1adf25e983 --- /dev/null +++ b/docs/docs/integrations/providers/cratedb.mdx @@ -0,0 +1,203 @@ +# CrateDB + +This documentation section shows how to use the CrateDB vector store +functionality around [`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn +how to use it for similarity search and other purposes. + + +## What is CrateDB? + +[CrateDB] is an open-source, distributed, and scalable SQL analytics database +for storing and analyzing massive amounts of data in near real-time, even with +complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits +the shared-nothing distribution layer of [Elasticsearch]. + +It provides a distributed, multi-tenant-capable relational database and search +engine with HTTP and PostgreSQL interfaces, and schema-free objects. It supports +sharding, partitioning, and replication out of the box. + +CrateDB enables you to efficiently store billions of records, and terabytes of +data, and query it using SQL. + +- Provides a standards-based SQL interface for querying relational data, nested + documents, geospatial constraints, and vector embeddings at the same time. +- Improves your operations by storing time-series data, relational metadata, + and vector embeddings within a single database. +- Builds upon approved technologies from Lucene and Elasticsearch. + + +## CrateDB Cloud + +- Offers on-demand CrateDB clusters without operational overhead, + with enterprise-grade features and [ISO 27001] certification. +- The entrypoint to [CrateDB Cloud] is the [CrateDB Cloud Console]. +- Crate.io offers a free tier via [CrateDB Cloud CRFREE]. +- To get started, [sign up] to CrateDB Cloud, deploy a database cluster, + and follow the upcoming instructions. + + +## Features + +The CrateDB adapter supports the _Vector Store_, _Document Loader_, +and _Conversational Memory_ subsystems of LangChain. + +### Vector Store + +`CrateDBVectorSearch` is an API wrapper around CrateDB's `FLOAT_VECTOR` type +and the corresponding `KNN_MATCH` function, based on SQLAlchemy and CrateDB's +SQLAlchemy dialect. It provides an interface to store and retrieve floating +point vectors, and to conduct similarity searches. + +Supports: +- Approximate nearest neighbor search. +- Euclidean distance. + +### Document Loader + +`CrateDBLoader` provides loading documents from a database table by an SQL +query expression or an SQLAlchemy selectable instance. + +### Conversational Memory + +`CrateDBChatMessageHistory` uses CrateDB to manage conversation history. + + +## Installation and Setup + +There are multiple ways to get started with CrateDB. + +### Install CrateDB on your local machine + +You can [download CrateDB], or use the [OCI image] to run CrateDB on Docker or Podman. +Note that this is not recommended for production use. + +```shell +docker run --rm -it --name=cratedb --publish=4200:4200 --publish=5432:5432 \ + --env=CRATE_HEAP_SIZE=4g crate/crate:nightly \ + -Cdiscovery.type=single-node +``` + +### Deploy a cluster on CrateDB Cloud + +[CrateDB Cloud] is a managed CrateDB service. Sign up for a [free trial]. + +### Install Client + +```bash +pip install crash langchain langchain-openai sqlalchemy-cratedb +``` + + +## Usage » Vector Store + +For a more detailed walkthrough of the `CrateDBVectorSearch` wrapper, there is also +a corresponding [Jupyter notebook](/docs/extras/integrations/vectorstores/cratedb.html). + +### Provide input data +The example uses the canonical `state_of_the_union.txt`. +```shell +wget https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt +``` + +### Set environment variables +Use a valid OpenAI API key and SQL connection string. This one fits a local instance of CrateDB. +```shell +export OPENAI_API_KEY=foobar +export CRATEDB_CONNECTION_STRING=crate://crate@localhost +``` + +### Example + +Load and index documents, and invoke query. +```python +from langchain.document_loaders import UnstructuredURLLoader +from langchain.embeddings.openai import OpenAIEmbeddings +from langchain.text_splitter import CharacterTextSplitter +from langchain.vectorstores import CrateDBVectorSearch + + +def main(): + # Load the document, split it into chunks, embed each chunk and load it into the vector store. + raw_documents = UnstructuredURLLoader("https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt").load() + text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) + documents = text_splitter.split_documents(raw_documents) + db = CrateDBVectorSearch.from_documents(documents, OpenAIEmbeddings()) + + query = "What did the president say about Ketanji Brown Jackson" + docs = db.similarity_search(query) + print(docs[0].page_content) + + +if __name__ == "__main__": + main() +``` + + +## Usage » Document Loader + +For a more detailed walkthrough of the `CrateDBLoader`, there is also a corresponding +[Jupyter notebook](/docs/extras/integrations/document_loaders/cratedb.html). + + +### Provide input data +```shell +wget https://github.com/crate-workbench/langchain/raw/cratedb/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql +crash < ./example_data/mlb_teams_2012.sql +crash --command "REFRESH TABLE mlb_teams_2012;" +``` + +### Load documents by SQL query +```python +from langchain.document_loaders import CrateDBLoader +from pprint import pprint + +def main(): + loader = CrateDBLoader( + 'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;', + url="crate://crate@localhost/", + ) + documents = loader.load() + pprint(documents) + +if __name__ == "__main__": + main() +``` + + +## Usage » Conversational Memory + +For a more detailed walkthrough of the `CrateDBChatMessageHistory`, there is also a corresponding +[Jupyter notebook](/docs/extras/integrations/memory/cratedb_chat_message_history.html). + +```python +from langchain.memory.chat_message_histories import CrateDBChatMessageHistory +from pprint import pprint + +def main(): + chat_message_history = CrateDBChatMessageHistory( + session_id="test_session", + connection_string="crate://crate@localhost/", + ) + chat_message_history.add_user_message("Hello") + chat_message_history.add_ai_message("Hi") + pprint(chat_message_history) + +if __name__ == "__main__": + main() +``` + + +[CrateDB]: https://github.com/crate/crate +[CrateDB Cloud]: https://cratedb.com/product +[CrateDB Cloud Console]: https://console.cratedb.cloud/ +[CrateDB Cloud CRFREE]: https://community.crate.io/t/new-cratedb-cloud-edge-feature-cratedb-cloud-free-tier/1402 +[CrateDB SQLAlchemy dialect]: https://cratedb.com/docs/sqlalchemy-cratedb/ +[download CrateDB]: https://cratedb.com/download +[Elastisearch]: https://github.com/elastic/elasticsearch +[`FLOAT_VECTOR`]: https://cratedb.com/docs/crate/reference/en/master/general/ddl/data-types.html#float-vector +[free trial]: https://cratedb.com/lp-crfree?utm_source=langchain +[ISO 27001]: https://cratedb.com/blog/cratedb-elevates-its-security-standards-and-achieves-iso-27001-certification +[`KNN_MATCH`]: https://cratedb.com/docs/crate/reference/en/master/general/builtins/scalar-functions.html#scalar-knn-match +[Lucene]: https://github.com/apache/lucene +[OCI image]: https://hub.docker.com/_/crate +[sign up]: https://console.cratedb.cloud/ diff --git a/docs/docs/integrations/vectorstores/cratedb.ipynb b/docs/docs/integrations/vectorstores/cratedb.ipynb new file mode 100644 index 0000000000000..b1ec55e364c97 --- /dev/null +++ b/docs/docs/integrations/vectorstores/cratedb.ipynb @@ -0,0 +1,582 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# CrateDB\n", + "\n", + "> [CrateDB] is capable of performing both vector and lexical search.\n", + "> It is built on top of the Apache Lucene library, talks SQL,\n", + "> is PostgreSQL-compatible, and scales like Elasticsearch.\n", + "\n", + "This notebook shows how to use the CrateDB vector store functionality around\n", + "[`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn how to use LangChain's\n", + "`CrateDBVectorSearch` adapter for similarity search and other purposes.\n", + "\n", + "It supports:\n", + "- Similarity Search with Euclidean Distance\n", + "- Maximal Marginal Relevance Search (MMR)\n", + "\n", + "## What is CrateDB?\n", + "\n", + "[CrateDB] is an open-source, distributed, and scalable SQL analytics database\n", + "for storing and analyzing massive amounts of data in near real-time, even with\n", + "complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits\n", + "the shared-nothing distribution layer of [Elasticsearch].\n", + "\n", + "This example uses the [Python client driver for CrateDB]. For more documentation,\n", + "see also [LangChain with CrateDB].\n", + "\n", + "\n", + "[CrateDB]: https://github.com/crate/crate\n", + "[Elasticsearch]: https://github.com/elastic/elasticsearch\n", + "[`FLOAT_VECTOR`]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#float-vector\n", + "[`KNN_MATCH`]: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#scalar-knn-match\n", + "[LangChain with CrateDB]: /docs/extras/integrations/providers/cratedb.html\n", + "[Lucene]: https://github.com/apache/lucene\n", + "[Python client driver for CrateDB]: https://cratedb.com/docs/python/" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## Setup\n", + "\n", + "In order to use the CrateDB vector search you must install the sqlalchemy-cratedb package." + ] + }, + { + "metadata": { + "tags": [], + "pycharm": { + "is_executing": true + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "# Install required packages: LangChain, OpenAI SDK, and the CrateDB SQLAlchemy adapter.\n", + "%pip install -qU langchain-community langchain-openai sqlalchemy-cratedb" + ] + }, + { + "metadata": {}, + "cell_type": "raw", + "source": [ + "### Credentials\n", + "\n", + "You will supply credentials through a regular SQLAlchemy connection string, like\n", + "`crate://username:password@cratedb.example.org/`." + ], + "outputs": null, + "execution_count": null + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "### OpenAI API key\n", + "\n", + "You need to provide an OpenAI API key, optionally using the environment\n", + "variable `OPENAI_API_KEY`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2023-09-09T08:02:16.802456Z", + "start_time": "2023-09-09T08:02:07.065604Z" + } + }, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "from dotenv import load_dotenv, find_dotenv\n", + "\n", + "# Run `export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY`.\n", + "# Get OpenAI api key from `.env` file.\n", + "# Otherwise, prompt for it.\n", + "_ = load_dotenv(find_dotenv())\n", + "OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', getpass.getpass(\"OpenAI API key:\"))\n", + "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY" + ] + }, + { + "cell_type": "markdown", + "source": [ + "You also need to provide a connection string to your CrateDB database cluster,\n", + "optionally using the environment variable `CRATEDB_CONNECTION_STRING`.\n", + "\n", + "This example uses a CrateDB instance on your workstation, which you can start by\n", + "running [CrateDB using Docker]. Alternatively, you can also connect to a cluster\n", + "running on [CrateDB Cloud].\n", + "\n", + "[CrateDB Cloud]: https://console.cratedb.cloud/\n", + "[CrateDB using Docker]: https://cratedb.com/docs/guide/install/container/\n", + "\n", + "### CrateDB connection string\n", + "\n", + "You will need to supply an SQLAlchemy-compatible connection string." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import os\n", + "\n", + "CONNECTION_STRING = os.environ.get(\n", + " \"CRATEDB_CONNECTION_STRING\",\n", + " \"crate://crate@localhost:4200/?schema=langchain\",\n", + ")\n", + "\n", + "# For CrateDB Cloud, use:\n", + "# CONNECTION_STRING = os.environ.get(\n", + "# \"CRATEDB_CONNECTION_STRING\",\n", + "# \"crate://username:password@hostname:4200/?ssl=true&schema=langchain\",\n", + "# )" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2023-09-09T08:02:28.174088Z", + "start_time": "2023-09-09T08:02:28.162698Z" + } + }, + "outputs": [], + "source": [ + "\"\"\"\n", + "# Alternatively, the connection string can be assembled from individual\n", + "# environment variables.\n", + "import os\n", + "\n", + "CONNECTION_STRING = CrateDBVectorSearch.connection_string_from_db_params(\n", + " driver=os.environ.get(\"CRATEDB_DRIVER\", \"crate\"),\n", + " host=os.environ.get(\"CRATEDB_HOST\", \"localhost\"),\n", + " port=int(os.environ.get(\"CRATEDB_PORT\", \"4200\")),\n", + " database=os.environ.get(\"CRATEDB_DATABASE\", \"langchain\"),\n", + " user=os.environ.get(\"CRATEDB_USER\", \"crate\"),\n", + " password=os.environ.get(\"CRATEDB_PASSWORD\", \"\"),\n", + ")\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Import Python Modules\n", + "\n", + "You will start by importing all required modules." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.vectorstores import CrateDBVectorSearch\n", + "from langchain.document_loaders import UnstructuredURLLoader\n", + "from langchain.docstore.document import Document" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manage vector store\n", + "\n", + "In the example above, you created a vector store from scratch. When\n", + "aiming to work with an existing vector store, you can initialize it directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "store = CrateDBVectorSearch(\n", + " collection_name=COLLECTION_NAME,\n", + " connection_string=CONNECTION_STRING,\n", + " embedding_function=embeddings,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add items to vector store\n", + "\n", + "You can also add documents to an existing vector store." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "store.add_documents([Document(page_content=\"foo\")])" + ] + }, + { + "cell_type": "code", + "metadata": { + "jupyter": { + "is_executing": true + } + }, + "source": [ + "docs_with_score = db.similarity_search_with_score(\"foo\")" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "docs_with_score[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "docs_with_score[1]" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Update items in vector store\n", + "\n", + "FIXME" + ] + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "# Foo." + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Delete items from vector store\n", + "FIXME" + ] + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "store.delete(ids=[uuids[-1]])" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Overwriting a Vector Store\n", + "\n", + "If you have an existing collection, you can overwrite it by using `from_documents`,\n", + "aad setting `pre_delete_collection = True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "db = CrateDBVectorSearch.from_documents(\n", + " documents=docs,\n", + " embedding=embeddings,\n", + " collection_name=COLLECTION_NAME,\n", + " connection_string=CONNECTION_STRING,\n", + " pre_delete_collection=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "docs_with_score = db.similarity_search_with_score(\"foo\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "docs_with_score[0]" + ] + }, + { + "metadata": { + "collapsed": false + }, + "cell_type": "markdown", + "source": [ + "### Load and Index Documents\n", + "\n", + "Next, you will read input data, and tokenize it. The module will create a table\n", + "with the name of the collection. Make sure the collection name is unique, and\n", + "that you have the permission to create a table." + ] + }, + { + "metadata": { + "collapsed": false, + "pycharm": { + "is_executing": true + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "loader = UnstructuredURLLoader(\"https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt\")\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "COLLECTION_NAME = \"state_of_the_union_test\"\n", + "\n", + "embeddings = OpenAIEmbeddings()\n", + "\n", + "db = CrateDBVectorSearch.from_documents(\n", + " embedding=embeddings,\n", + " documents=docs,\n", + " collection_name=COLLECTION_NAME,\n", + " connection_string=CONNECTION_STRING,\n", + ")" + ] + }, + { + "metadata": { + "collapsed": false + }, + "cell_type": "markdown", + "source": [ + "## Query vector store\n", + "\n", + "### Query directly\n", + "\n", + "#### Similarity search\n", + "Searching by euclidean distance is the default." + ] + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2023-09-09T08:05:11.104135Z", + "start_time": "2023-09-09T08:05:10.548998Z" + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "docs_with_score = db.similarity_search_with_score(query)" + ] + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2023-09-09T08:05:13.532334Z", + "start_time": "2023-09-09T08:05:13.523191Z" + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "for doc, score in docs_with_score:\n", + " print(\"-\" * 80)\n", + " print(\"Score: \", score)\n", + " print(doc.page_content)\n", + " print(\"-\" * 80)" + ] + }, + { + "metadata": { + "collapsed": false + }, + "cell_type": "markdown", + "source": [ + "#### Maximal Marginal Relevance Search (MMR)\n", + "Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents." + ] + }, + { + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-09-09T08:05:23.276819Z", + "start_time": "2023-09-09T08:05:21.972256Z" + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "docs_with_score = db.max_marginal_relevance_search_with_score(query)" + }, + { + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2023-09-09T08:05:27.478580Z", + "start_time": "2023-09-09T08:05:27.470138Z" + } + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "for doc, score in docs_with_score:\n", + " print(\"-\" * 80)\n", + " print(\"Score: \", score)\n", + " print(doc.page_content)\n", + " print(\"-\" * 80)" + ] + }, + { + "metadata": { + "collapsed": false + }, + "cell_type": "markdown", + "source": [ + "#### Searching in Multiple Collections\n", + "`CrateDBVectorSearchMultiCollection` is a special adapter which provides similarity search across\n", + "multiple collections. It can not be used for indexing documents." + ] + }, + { + "metadata": { + "collapsed": false + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "from langchain.vectorstores.cratedb import CrateDBVectorSearchMultiCollection\n", + "\n", + "multisearch = CrateDBVectorSearchMultiCollection(\n", + " collection_names=[\"test_collection_1\", \"test_collection_2\"],\n", + " embedding_function=embeddings,\n", + " connection_string=CONNECTION_STRING,\n", + ")\n", + "docs_with_score = multisearch.similarity_search_with_score(query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "### Query by turning into retriever" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "retriever = store.as_retriever()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(retriever)" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## Usage for retrieval-augmented generation\n", + "\n", + "For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:\n", + "\n", + "- [Tutorials: working with external knowledge](https://python.langchain.com/docs/tutorials/#working-with-external-knowledge)\n", + "- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)\n", + "- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all `CrateDBVectorSearch` features and configurations,\n", + "head to the API reference:\n", + "https://python.langchain.com/api_reference/cratedb/vectorstores/langchain_cratedb.vectorstores.CrateDBVectorSearch.html" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.1" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}