diff --git a/README.md b/README.md index 86722cb..2cf5ca9 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,8 @@ Selfie personalizes text generation, augmenting both local and hosted Large Language Models (LLMs) with your personal data. Selfie exposes an OpenAI-compatible API that wraps the LLM of your choice, and automatically injects relevant context into each text generation request. +selfie-augmentation + ## Features * Automatically mix your data into chat and text completions using OpenAI-compatible clients like [OpenAI SDKs](https://platform.openai.com/docs/libraries), [SillyTavern](https://sillytavernai.com), and [Instructor](https://github.com/jxnl/instructor)* (untested). @@ -17,10 +19,10 @@ Selfie personalizes text generation, augmenting both local and hosted Large Lang * Runs locally by default to keep your data private. * Unopinionated compatibility with your LLM or provider of choice. * Easily switch to vanilla text generation modes. +* Directly and selectively query loaded data. On the roadmap: -* Load data using any [LlamaHub loader](https://llamahub.ai/?tab=loaders) (partial support is available through the API). -* Directly and selectively query loaded data. +* Load data using any [LlamaHub loader](https://llamahub.ai/?tab=loaders). * Easy deployment with Docker and pre-built executables. ## Overview @@ -60,56 +62,29 @@ This starts a local web server and should launch the UI in your browser at http: > Note: You can host selfie at a publicly-accessible URL with [ngrok](https://ngrok.com). Add your ngrok token (and optionally, ngrok domain) in `selfie/.env` and run `poetry run python -m selfie --share`. -### Step 1: Gather Messaging Data - -Future versions of Selfie will support loading any text data. For now, you can import chat logs from popular messaging platforms. - -> Note: If you don't have any chat logs or want to try the app first, you can use the example chat logs provided in the `example-chats` directory.) - -Export chats that you use frequently and contain information you want the LLM to know. - -#### Export Instructions - -The following links provide instructions for exporting chat logs from popular messaging platforms: - -* [WhatsApp](https://faq.whatsapp.com/1180414079177245/?cms_platform=android) -* [Google](https://takeout.google.com/settings/takeout) (select Messages from the list) - -These platforms are not yet supported, but you can create a parser in selfie/parsers/chats to support them (please contribute!): - -* [Instagram](https://help.instagram.com/181231772500920) -* [Facebook Messenger](https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher) -* [Telegram](https://www.maketecheasier.com/export-telegram-chat-history/) +### Step 1: Import Your Data -Ensure you ask permission of the friends who are also in the chats you export. +Selfie supports importing text data, with special processing for certain data formats, like chat logs from WhatsApp and ChatGPT. -[//]: # (You can also redact their name, messages, and other personal information in later steps.) +> Note: You can try the example files in the `example-chats` directory if you want to try a specific data format that you don't have ready for import. -### Step 2: Import Messaging Data +To import data into Selfie: -1. Place your exported chat logs in a directory on your computer, e.g. `/home/alice/chats`. -2. Open the UI at http://localhost:8181. -3. Add your directory as a Data Source. Give it a name (e.g. My Chats), enter the **absolute** path, and click `Add Directory`. This must be a directory (i.e. folder), not a file. Example absolute path would be: `/Users/{you}/Projects/selfie/example-chats` -4. In the Documents table, select the exported chat logs you want to import, and click `Index`. +1. **Open the Add Data Page**: Access the UI and locate the Add Data section. +2. **Select Data Source**: Choose the type of data you are uploading (e.g., WhatsApp, Text Files). Choose the type that most closely matches your data format. +3. **Upload Files**: Choose your files and submit them for upload. -If this process is successful, your selected chat logs will show as indexed in the table. You can now use the API to connect to your LLM and generate personalized text completions. +Ensure you obtain consent from participants in the chats you wish to export. -[//]: # (1. Open http://localhost:8181/docs) -[//]: # (2. Find `POST /v1/index_documents/chat-processor`) -[//]: # (3. Upload one or more exported chat log files. To get these files, export them from platforms that you use frequently and contain information you want the LLM to know. Exports: [WhatsApp](https://faq.whatsapp.com/1180414079177245/?cms_platform=android) | [Google](https://takeout.google.com/settings/takeout) | [Instagram](https://help.instagram.com/181231772500920) | [Facebook Messenger](https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher) | [Telegram](https://www.maketecheasier.com/export-telegram-chat-history/). Ensure you ask permission of the friend who is also in the chat you export. You can also redact their name, messages, and other personal information in later steps.) -[//]: # (4. Copy, paste, and edit the example parser_configs JSON. Include one configuration object in the list for each file you upload.) -[//]: # () -[//]: # (![chat-processor.png](docs/images/chat-processor.png)) -[//]: # () -[//]: # (Setting `extract_importance` to `true` will give you better query results, but usually causes the import to take a while.) +Support for new types of data can be added by creating new data connectors in `selfie/connectors/` (instructions [here](./selfie/connectors/README.md), please contribute!). -### Step 3: Generate Personalized Text +### Step 2: Engage with Your Data -You can quickly verify if everything is in order by visiting the summarization endpoint in your browser: http://localhost:8181/v1/index_documents/summary?topic=travel ([docs](http://localhost:8181/docs#/default/get_index_documents_summary_v1_index_documents_summary_get)). +The Playground page includes a chat interface and a Search feature. Write an LLM persona by entering a name and bio, and try interacting with your data through conversation. You can also search your data for specific topics under Search. -Next, scroll down to the Playground section in the UI. Enter your name and a simple bio, and try asking some questions whose answers are in your chat logs. +You can interact with your data via the API directly, for instance, try viewing this link in your web browser: http://localhost:8181/v1/index_documents/summary?topic=travel. Detailed API documentation is available [here](http://localhost:8181/docs). -## Usage Guide +## API Usage Guide By default, Selfie augments text completions with local models using llama.cpp and a local txtai embeddings database. diff --git a/docs/images/playground-augmentation.png b/docs/images/playground-augmentation.png new file mode 100644 index 0000000..9c4369d Binary files /dev/null and b/docs/images/playground-augmentation.png differ diff --git a/poetry.lock b/poetry.lock index 2e76f87..4339821 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1,4 +1,4 @@ -# This file is automatically @generated by Poetry 1.7.1 and should not be changed by hand. +# This file is automatically @generated by Poetry 1.6.1 and should not be changed by hand. [[package]] name = "absl-py" @@ -3045,31 +3045,26 @@ python-versions = ">=3.8" files = [ {file = "PyMuPDF-1.23.22-cp310-none-macosx_10_9_x86_64.whl", hash = "sha256:e301481b8549d1d0497e1072563de9d8ea70b374263933e9906b69e1b358cf0c"}, {file = "PyMuPDF-1.23.22-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:e416948cc5050e012ebe28ee15c6bba23aaae41fd248fc9043eb7f934b30c303"}, - {file = "PyMuPDF-1.23.22-cp310-none-manylinux2014_aarch64.whl", hash = "sha256:9dc8470905648f9b55f4cd899854f60f0f3d6bec13984ada730bdc8727aa3a64"}, {file = "PyMuPDF-1.23.22-cp310-none-manylinux2014_x86_64.whl", hash = "sha256:8b27ce82af2edf973d1b83ff52641e29b53386de3e953a872d76ed2b1cf0a320"}, {file = "PyMuPDF-1.23.22-cp310-none-win32.whl", hash = "sha256:61b8a44a61504edd2f8975c15a7d56c8877c2e760600c263aa62f3f065ba42db"}, {file = "PyMuPDF-1.23.22-cp310-none-win_amd64.whl", hash = "sha256:23400f405b2a6b88c69676df3e8c2001eb655c910b7077fc2af2811c3c38a63e"}, {file = "PyMuPDF-1.23.22-cp311-none-macosx_10_9_x86_64.whl", hash = "sha256:ec4cd4894f8edde505856b9426c67e9c57318f5e283b44634ebc15a2ec9fe532"}, {file = "PyMuPDF-1.23.22-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:ad70aba698f2382f694902c49b258b7393247b715d429acc493b9d37ecbe96fe"}, - {file = "PyMuPDF-1.23.22-cp311-none-manylinux2014_aarch64.whl", hash = "sha256:c8a4d945a5980f996d4a4da9f385f33937aebd637417c091c4bef50f5a78dfe4"}, {file = "PyMuPDF-1.23.22-cp311-none-manylinux2014_x86_64.whl", hash = "sha256:68543c6958876d246e18d290bc2250633a84411806296d44306a015e6ff64239"}, {file = "PyMuPDF-1.23.22-cp311-none-win32.whl", hash = "sha256:78ec6364fee90bcefae7f036a3c115bf4ec85f5d7af56979f237c96fbb5fc57b"}, {file = "PyMuPDF-1.23.22-cp311-none-win_amd64.whl", hash = "sha256:5b8b7ad2a1d27c4a48219a913cd9b6b7d48eb443bc6ca12cea9287b2c7aede5d"}, {file = "PyMuPDF-1.23.22-cp312-none-macosx_10_9_x86_64.whl", hash = "sha256:5a7a720656b8efc00e5b3e42edbb74dd51484268b38edc34ba12dd7fc77d0048"}, {file = "PyMuPDF-1.23.22-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:80887102345bc7452a5b45a69d1842131ab2d7652d272ce7b0619885775f6bfb"}, - {file = "PyMuPDF-1.23.22-cp312-none-manylinux2014_aarch64.whl", hash = "sha256:8bdb49633242dfe89c29345f58e06b55dfd834774db2e4481dad82ad89d1eb3b"}, {file = "PyMuPDF-1.23.22-cp312-none-manylinux2014_x86_64.whl", hash = "sha256:13cb22263fa5e9ec87f46f74ef3ba5ad57200a07764fcf5918aa64118056de58"}, {file = "PyMuPDF-1.23.22-cp312-none-win32.whl", hash = "sha256:d58b328faad077297efee0a808490149b1796a359f737fb74f9e2125632d0347"}, {file = "PyMuPDF-1.23.22-cp312-none-win_amd64.whl", hash = "sha256:2878fecb1cb4e1a03f33ca786672c236400a811f310e2fa2929c30445b88952c"}, {file = "PyMuPDF-1.23.22-cp38-none-macosx_10_9_x86_64.whl", hash = "sha256:c79b5eee74f4138b1bd0abc1ecd06e551ebabff2262ac88db625311709e08a9b"}, {file = "PyMuPDF-1.23.22-cp38-none-macosx_11_0_arm64.whl", hash = "sha256:96b3188bc12ce96e92673a8e9133328cb9fa050289ff9dd9e2a3716d46c8d62a"}, - {file = "PyMuPDF-1.23.22-cp38-none-manylinux2014_aarch64.whl", hash = "sha256:ec2385777a910a9531f6df58339dbcea65efc0c5e90384541550598111b658f4"}, {file = "PyMuPDF-1.23.22-cp38-none-manylinux2014_x86_64.whl", hash = "sha256:20fa2094b4adda4902e47842d4d34334e6f09dcd0db8ffb9fe71f45dd33349ab"}, {file = "PyMuPDF-1.23.22-cp38-none-win32.whl", hash = "sha256:c134307e4a2990599b291c6de28a3411851f08980e4bee05576361fdf726e3fe"}, {file = "PyMuPDF-1.23.22-cp38-none-win_amd64.whl", hash = "sha256:f03e131c5aadc63d15b2dff096ee520a9f66852a6712dd6005633bcad7c386ac"}, {file = "PyMuPDF-1.23.22-cp39-none-macosx_10_9_x86_64.whl", hash = "sha256:34380f046b117d10a4c06942a0cc19b843b8ce35c322543437f4d64673b64165"}, {file = "PyMuPDF-1.23.22-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:475eb3cf564aef3a2de98df72f478a5508c9eadc7072bb8d59c9c7e5b6611ba8"}, - {file = "PyMuPDF-1.23.22-cp39-none-manylinux2014_aarch64.whl", hash = "sha256:67085a596a4413989ae449ca5ad41c82289359d2130ca1fdeae014cf163f92f3"}, {file = "PyMuPDF-1.23.22-cp39-none-manylinux2014_x86_64.whl", hash = "sha256:3c999c1ac8050afb2330349ac9a3c3b8fddd72749665a670ca4ba992b9c570f5"}, {file = "PyMuPDF-1.23.22-cp39-none-win32.whl", hash = "sha256:df5d47f63db5ad4a83cb89e35243a3d0a221be23c3535c8d953fc79a47bd6635"}, {file = "PyMuPDF-1.23.22-cp39-none-win_amd64.whl", hash = "sha256:4fb6f0bd2ce12eb2964e7f2b81568d0f9848207e347850095724af4b6bdecf96"}, @@ -3088,7 +3083,6 @@ python-versions = ">=3.8" files = [ {file = "PyMuPDFb-1.23.22-py3-none-macosx_10_9_x86_64.whl", hash = "sha256:9085a1e2fbf16f2820f9f7ad3d25e85f81d9b9eb0409110c1670d4cf5a27a678"}, {file = "PyMuPDFb-1.23.22-py3-none-macosx_11_0_arm64.whl", hash = "sha256:01016dd33220cef4ecaf929d09fd27a584dc3ec3e5c9f4112dfe63613ea35135"}, - {file = "PyMuPDFb-1.23.22-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:cf50e814db91f2a2325219302fbac229a23682c372cf8232aabd51ea3f18210e"}, {file = "PyMuPDFb-1.23.22-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3ffa713ad18e816e584c8a5f569995c32d22f8ac76ab6e4a61f2d2983c4b73d9"}, {file = "PyMuPDFb-1.23.22-py3-none-win32.whl", hash = "sha256:d00e372452845aea624659c302d25e935052269fd3aafe26948301576d6f2ee8"}, {file = "PyMuPDFb-1.23.22-py3-none-win_amd64.whl", hash = "sha256:7c9c157281fdee9f296e666a323307dbf74cb38f017921bb131fa7bfcd39c2bd"}, @@ -3289,6 +3283,7 @@ files = [ {file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"}, {file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"}, {file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"}, + {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"}, {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"}, {file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"}, {file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"}, @@ -3937,7 +3932,7 @@ files = [ ] [package.dependencies] -greenlet = {version = "!=0.4.17", optional = true, markers = "platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\" or extra == \"asyncio\""} +greenlet = {version = "!=0.4.17", optional = true, markers = "platform_machine == \"win32\" or platform_machine == \"WIN32\" or platform_machine == \"AMD64\" or platform_machine == \"amd64\" or platform_machine == \"x86_64\" or platform_machine == \"ppc64le\" or platform_machine == \"aarch64\" or extra == \"asyncio\""} typing-extensions = ">=4.6.0" [package.extras] @@ -5105,4 +5100,4 @@ gpu = ["auto-gptq", "autoawq", "optimum"] [metadata] lock-version = "2.0" python-versions = ">=3.11,<3.12" -content-hash = "6e0c082803aee00b213535ca5868cd694bd03de280fbcf4630855d5a47486b5d" +content-hash = "3c3da1501932f8ba4f90732803b9bd0b3892c69ed285c91873d32d966464a358" diff --git a/pyproject.toml b/pyproject.toml index b55104b..c2b4203 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,6 +7,7 @@ readme = "README.md" [tool.poetry.dependencies] python = ">=3.11,<3.12" +beautifulsoup4 = "^4.12.3" fastapi = "^0.109.0" uvicorn = "^0.27.0" humanize = "^4.9.0" diff --git a/selfie-ui/src/app/components/Markdown.tsx b/selfie-ui/src/app/components/Markdown.tsx index e86b8c4..7c9eda6 100644 --- a/selfie-ui/src/app/components/Markdown.tsx +++ b/selfie-ui/src/app/components/Markdown.tsx @@ -5,7 +5,7 @@ import rehypeSanitize from 'rehype-sanitize'; export const Markdown = ({ content }: { content: string }) => { return ( diff --git a/selfie/connectors/factory.py b/selfie/connectors/factory.py index ea9c76f..b6c8616 100644 --- a/selfie/connectors/factory.py +++ b/selfie/connectors/factory.py @@ -1,3 +1,6 @@ +from selfie.connectors.text_files.connector import TextFilesConnector +from selfie.connectors.google_messages.connector import GoogleMessagesConnector +from selfie.connectors.telegram.connector import TelegramConnector from selfie.connectors.whatsapp.connector import WhatsAppConnector from selfie.connectors.chatgpt.connector import ChatGPTConnector @@ -5,8 +8,11 @@ class ConnectorFactory: # Register all document connectors here connector_registry = [ + ChatGPTConnector, + GoogleMessagesConnector, + TelegramConnector, + TextFilesConnector, WhatsAppConnector, - ChatGPTConnector ] connector_map = {} diff --git a/selfie/connectors/google_messages/connector.py b/selfie/connectors/google_messages/connector.py new file mode 100644 index 0000000..d35372a --- /dev/null +++ b/selfie/connectors/google_messages/connector.py @@ -0,0 +1,53 @@ +from abc import ABC +from typing import Any, List + +from selfie.connectors.base_connector import BaseConnector +from selfie.database import BaseModel +from selfie.embeddings import EmbeddingDocumentModel, DataIndex +from selfie.parsers.chat import ChatFileParser +from selfie.types.documents import DocumentDTO +from selfie.utils import data_uri_to_string + + +class GoogleMessagesConfiguration(BaseModel): + files: List[str] + + +class GoogleMessagesConnector(BaseConnector, ABC): + def __init__(self): + super().__init__() + self.id = "google_messages" + self.name = "Google Messages" + + def load_document(self, configuration: dict[str, Any]) -> List[DocumentDTO]: + config = GoogleMessagesConfiguration(**configuration) + + return [ + DocumentDTO( + content=data_uri_to_string(data_uri), + content_type="text/plain", + name="todo", + size=len(data_uri_to_string(data_uri).encode('utf-8')) + ) + for data_uri in config.files + ] + + def validate_configuration(self, configuration: dict[str, Any]): + # TODO: check if file can be read from path + pass + + def transform_for_embedding(self, configuration: dict[str, Any], documents: List[DocumentDTO]) -> List[EmbeddingDocumentModel]: + return [ + embeddingDocumentModel + for document in documents + for embeddingDocumentModel in DataIndex.map_share_gpt_data( + ChatFileParser().parse_document( + document=document.content, + parser_type="google_messages", + mask=False, + document_name=document.name, + ).conversations, + source="google_messages", + source_document_id=document.id + ) + ] diff --git a/selfie/connectors/google_messages/documentation.md b/selfie/connectors/google_messages/documentation.md new file mode 100644 index 0000000..15560b7 --- /dev/null +++ b/selfie/connectors/google_messages/documentation.md @@ -0,0 +1,9 @@ +## Export Instructions + +Google Takeout is a service that allows you to download a copy of your data stored within Google products. To export your Google Hangouts chat history, follow the instructions below. + +1. Go to Google Takeout and log in to your Google account. +2. Select "Deselect all" and then scroll down to select "Messages" from the list of Google products. (note: `Messages` may not appear in the list if you have not used Google Messages in the past) +3. Click "Next step" and choose your delivery method, frequency, and file type. +4. Click "Create export" to start the process. Once completed, you will receive an email with a link to download your exported data. +5. Download the .zip file and extract the `.json` files in the `Messages` folder to access your chat files. diff --git a/selfie/connectors/google_messages/schema.json b/selfie/connectors/google_messages/schema.json new file mode 100644 index 0000000..8e777b5 --- /dev/null +++ b/selfie/connectors/google_messages/schema.json @@ -0,0 +1,14 @@ +{ + "title": "Upload Google Messages Conversations", + "type": "object", + "properties": { + "files": { + "type": "array", + "title": "Files", + "description": "Upload .json files exported from Google Messages", + "items": { + "type": "object" + } + } + } +} diff --git a/selfie/connectors/google_messages/uischema.json b/selfie/connectors/google_messages/uischema.json new file mode 100644 index 0000000..f4ef5a6 --- /dev/null +++ b/selfie/connectors/google_messages/uischema.json @@ -0,0 +1,8 @@ +{ + "files": { + "ui:widget": "nativeFile", + "ui:options": { + "accept": ".json" + } + } +} diff --git a/selfie/connectors/telegram/connector.py b/selfie/connectors/telegram/connector.py new file mode 100644 index 0000000..6c5f41a --- /dev/null +++ b/selfie/connectors/telegram/connector.py @@ -0,0 +1,53 @@ +from abc import ABC +from typing import Any, List + +from selfie.connectors.base_connector import BaseConnector +from selfie.database import BaseModel +from selfie.embeddings import EmbeddingDocumentModel, DataIndex +from selfie.parsers.chat import ChatFileParser +from selfie.types.documents import DocumentDTO +from selfie.utils import data_uri_to_string + + +class TelegramConfiguration(BaseModel): + files: List[str] + + +class TelegramConnector(BaseConnector, ABC): + def __init__(self): + super().__init__() + self.id = "telegram" + self.name = "Telegram" + + def load_document(self, configuration: dict[str, Any]) -> List[DocumentDTO]: + config = TelegramConfiguration(**configuration) + + return [ + DocumentDTO( + content=data_uri_to_string(data_uri), + content_type="text/plain", + name="todo", + size=len(data_uri_to_string(data_uri).encode('utf-8')) + ) + for data_uri in config.files + ] + + def validate_configuration(self, configuration: dict[str, Any]): + # TODO: check if file can be read from path + pass + + def transform_for_embedding(self, configuration: dict[str, Any], documents: List[DocumentDTO]) -> List[EmbeddingDocumentModel]: + return [ + embeddingDocumentModel + for document in documents + for embeddingDocumentModel in DataIndex.map_share_gpt_data( + ChatFileParser().parse_document( + document=document.content, + parser_type="telegram", + mask=False, + document_name=document.name, + ).conversations, + source="telegram", + source_document_id=document.id + ) + ] diff --git a/selfie/connectors/telegram/documentation.md b/selfie/connectors/telegram/documentation.md new file mode 100644 index 0000000..b7b5849 --- /dev/null +++ b/selfie/connectors/telegram/documentation.md @@ -0,0 +1,10 @@ +## Export Instructions + +To export your Telegram conversations, follow the instructions below (based on official export instructions). + +1. Install the Telegram desktop app from the official website if you haven't already. +2. Open the Telegram desktop app and log in. +3. Navigate to and open the chat you wish to export. +4. Click on the three dots (...) at the top-right of the chat window and select "Export chat history". +5. In the export settings, deselect all options to export text only. +6. Click "Export" to start the process. Once completed, click "Show my data" or navigate to the "Telegram Desktop" folder in your "Downloads" directory to access the exported `messages.html` file. diff --git a/selfie/connectors/telegram/schema.json b/selfie/connectors/telegram/schema.json new file mode 100644 index 0000000..995779f --- /dev/null +++ b/selfie/connectors/telegram/schema.json @@ -0,0 +1,14 @@ +{ + "title": "Upload Telegram Conversations", + "type": "object", + "properties": { + "files": { + "type": "array", + "title": "Files", + "description": "Upload .html files exported from Telegram", + "items": { + "type": "object" + } + } + } +} diff --git a/selfie/connectors/telegram/uischema.json b/selfie/connectors/telegram/uischema.json new file mode 100644 index 0000000..39b72a9 --- /dev/null +++ b/selfie/connectors/telegram/uischema.json @@ -0,0 +1,8 @@ +{ + "files": { + "ui:widget": "nativeFile", + "ui:options": { + "accept": ".html" + } + } +} diff --git a/selfie/connectors/text_files/connector.py b/selfie/connectors/text_files/connector.py new file mode 100644 index 0000000..d54c8fd --- /dev/null +++ b/selfie/connectors/text_files/connector.py @@ -0,0 +1,50 @@ +from abc import ABC +from typing import Any, List + +from llama_index.core.node_parser import SentenceSplitter + +from selfie.connectors.base_connector import BaseConnector +from selfie.database import BaseModel, DataManager +from selfie.embeddings import EmbeddingDocumentModel +from selfie.types.documents import DocumentDTO +from selfie.utils import data_uri_to_string + + +class TextFilesConfiguration(BaseModel): + files: List[str] + + +class TextFilesConnector(BaseConnector, ABC): + def __init__(self): + super().__init__() + self.id = "text_files" + self.name = "Text Files" + + def load_document(self, configuration: dict[str, Any]) -> List[DocumentDTO]: + config = TextFilesConfiguration(**configuration) + + return [ + DocumentDTO( + content=data_uri_to_string(data_uri), + content_type="text/plain", + name="todo", + size=len(data_uri_to_string(data_uri).encode('utf-8')) + ) + for data_uri in config.files + ] + + def validate_configuration(self, configuration: dict[str, Any]): + # TODO: check if file can be read from path + pass + + def transform_for_embedding(self, configuration: dict[str, Any], documents: List[DocumentDTO]) -> List[EmbeddingDocumentModel]: + return [ + EmbeddingDocumentModel( + text=text_chunk, + source="text_files", + timestamp=DataManager._extract_timestamp(document), + source_document_id=document.id, + ) + for document in documents + for text_chunk in SentenceSplitter(chunk_size=1024).split_text(document.content) + ] diff --git a/selfie/connectors/text_files/documentation.md b/selfie/connectors/text_files/documentation.md new file mode 100644 index 0000000..88992ac --- /dev/null +++ b/selfie/connectors/text_files/documentation.md @@ -0,0 +1,3 @@ +## Instructions + +Upload any text files. If there is a tailored connector for your content (e.g., WhatsApp chat exports), you should use that instead for higher quality results. \ No newline at end of file diff --git a/selfie/connectors/text_files/schema.json b/selfie/connectors/text_files/schema.json new file mode 100644 index 0000000..52b818d --- /dev/null +++ b/selfie/connectors/text_files/schema.json @@ -0,0 +1,14 @@ +{ + "title": "Upload Text Files", + "type": "object", + "properties": { + "files": { + "type": "array", + "title": "Files", + "description": "Upload files containing text", + "items": { + "type": "object" + } + } + } +} diff --git a/selfie/connectors/text_files/uischema.json b/selfie/connectors/text_files/uischema.json new file mode 100644 index 0000000..f4ef5a6 --- /dev/null +++ b/selfie/connectors/text_files/uischema.json @@ -0,0 +1,8 @@ +{ + "files": { + "ui:widget": "nativeFile", + "ui:options": { + "accept": ".json" + } + } +} diff --git a/selfie/parsers/chat/__init__.py b/selfie/parsers/chat/__init__.py index 95cb821..b0b1884 100644 --- a/selfie/parsers/chat/__init__.py +++ b/selfie/parsers/chat/__init__.py @@ -5,6 +5,7 @@ from typing import Dict import yaml +from selfie.parsers.chat.telegram import TelegramParser from selfie.parsers.chat.discord import DiscordParser from selfie.parsers.chat.whatsapp import WhatsAppParser from selfie.parsers.chat.google import GoogleTakeoutMessagesParser @@ -47,6 +48,7 @@ class Parser(Enum): DISCORD = DiscordParser GOOGLE_MESSAGES = GoogleTakeoutMessagesParser CHATGPT = ChatGPTParser + TELEGRAM = TelegramParser class ChatFileParser: diff --git a/selfie/parsers/chat/base.py b/selfie/parsers/chat/base.py index 5a997d8..2569fbe 100644 --- a/selfie/parsers/chat/base.py +++ b/selfie/parsers/chat/base.py @@ -211,3 +211,17 @@ def extract_conversations(self, data: Any) -> ShareGPTConversation: """ raise NotImplementedError + +class HtmlBasedChatParser(JsonBasedChatParser): + def _parse_html_to_model_hook(self, html_string: str) -> Any: + raise NotImplementedError("This method should be implemented by subclasses.") + + def _can_parse_hook(self, document: str) -> bool: + try: + return super()._can_parse_hook(json.dumps(self._parse_html_to_model_hook(document).dict())) + except NotImplementedError: + return False + + def _parse_chat_hook(self, document: str) -> ShareGPTConversation: + model = self._parse_html_to_model_hook(document) + return self.extract_conversations(model) diff --git a/selfie/parsers/chat/telegram.py b/selfie/parsers/chat/telegram.py new file mode 100644 index 0000000..4bf4521 --- /dev/null +++ b/selfie/parsers/chat/telegram.py @@ -0,0 +1,63 @@ +from datetime import datetime +from bs4 import BeautifulSoup +from selfie.parsers.chat.base import HtmlBasedChatParser +from selfie.types.share_gpt import ShareGPTConversation, ShareGPTMessage + +from typing import List, Optional +from pydantic import BaseModel + + +class TelegramMessage(BaseModel): + id: Optional[str] + timestamp: Optional[str] + author: Optional[str] + content: Optional[str] + link: Optional[str] + + +class TelegramConversation(BaseModel): + title: Optional[str] + messages: List[TelegramMessage] + + +class TelegramParser(HtmlBasedChatParser): + SUPPORTED_SCHEMAS = [TelegramConversation] + + def _parse_html_to_model_hook(self, html_string: str) -> TelegramConversation: + soup = BeautifulSoup(html_string, 'html.parser') + title = soup.find('div', class_='text bold').text.strip() if soup.find('div', class_='text bold') else None + + messages = [] + for message_div in soup.find_all('div', class_='message'): + id = message_div.get('id') + timestamp = message_div.find('div', class_='pull_right').get('title') if message_div.find('div', class_='pull_right') else None + author = message_div.find('div', class_='from_name').text.strip() if message_div.find('div', class_='from_name') else None + content = message_div.find('div', class_='text').text.strip() if message_div.find('div', class_='text') else None + link = message_div.find('a')['href'] if message_div.find('a') else None + + if content: + messages.append(TelegramMessage( + id=id, + timestamp=timestamp, + author=author, + content=content, + link=link + )) + + return TelegramConversation(title=title, messages=messages) + + def extract_conversations(self, data: TelegramConversation) -> ShareGPTConversation: + share_gpt_messages = [] + + for message in data.messages: + timestamp = datetime.strptime(message.timestamp, "%d.%m.%Y %H:%M:%S %Z%z") if message.timestamp else None + from_user = message.author if message.author else "Unknown" + content = message.content if message.content else "No content" + + share_gpt_messages.append(ShareGPTMessage(**{ + 'from': from_user, + 'value': content, + 'timestamp': timestamp, + })) + + return ShareGPTConversation(conversations=share_gpt_messages)