-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Alkemio generic VC engine #62
Conversation
Thanks for opening this pull request! Please check out our contributing guidelines. |
WalkthroughThe pull request introduces several significant changes across multiple files, focusing on enhancing the application's structure and functionality. Key modifications include updates to the Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Outside diff range and nitpick comments (17)
models.py (3)
1-1
: Remove unused importAzureChatOpenAI
.The
AzureChatOpenAI
import is not used in this file and can be removed to clean up the code.Apply this diff to remove the unused import:
-from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings +from langchain_openai import AzureOpenAIEmbeddings🧰 Tools
🪛 Ruff (0.8.0)
1-1:
langchain_openai.AzureChatOpenAI
imported but unusedRemove unused import:
langchain_openai.AzureChatOpenAI
(F401)
3-3
: Remove unused importSecretStr
.The
SecretStr
import frompydantic
is not utilized and can be safely removed.Apply this diff to remove the unused import:
-from pydantic import SecretStr
🧰 Tools
🪛 Ruff (0.8.0)
3-3:
pydantic.SecretStr
imported but unusedRemove unused import:
pydantic.SecretStr
(F401)
5-5
: Remove unused importChatMistralAI
.The
ChatMistralAI
import is not used in this file and can be removed to maintain clean code.Apply this diff to remove the unused import:
-from langchain_mistralai.chat_models import ChatMistralAI
🧰 Tools
🪛 Ruff (0.8.0)
5-5:
langchain_mistralai.chat_models.ChatMistralAI
imported but unusedRemove unused import:
langchain_mistralai.chat_models.ChatMistralAI
(F401)
ai_adapter.py (2)
5-5
: Remove duplicate import ofsetup_logger
.The
setup_logger
function is imported twice from thelogger
module on lines 4 and 5. The import on line 5 is unnecessary and should be removed.Apply this diff to remove the duplicate import:
-from logger import setup_logger
🧰 Tools
🪛 Ruff (0.8.0)
5-5: Redefinition of unused
setup_logger
from line 4Remove definition:
setup_logger
(F811)
26-26
: Correct the grammatical error in the error message.The error message has a grammatical error. It should read "Alkemio's Virtual Contributor is currently unavailable." without the article "the" before "Alkemio's".
Apply this diff to correct the message:
-result = f"{input.display_name} - the Alkemio's VirtualContributor is currently unavailable." +result = f"{input.display_name} - Alkemio's Virtual Contributor is currently unavailable."ingest.py (4)
26-26
: Remove unnecessaryf
prefixes from logging statements.The logging statements on lines 26, 28, 33, 39, and 43 use f-strings without any placeholders, which is unnecessary. Removing the
f
prefix cleans up the code.Apply this diff to remove the extraneous
f
prefixes:-logger.info(f"Generating embeddings.") +logger.info("Generating embeddings.") -logger.info(f"Embeddings generated.") +logger.info("Embeddings generated.") -logger.info(f"Local db exists. Load and add new data.") +logger.info("Local db exists. Load and add new data.") -logger.info(f"Local db does not exits. Initialise") +logger.info("Local db does not exist. Initialise") -logger.info(f"Local db updated.") +logger.info("Local db updated.")Also applies to: 28-28, 33-33, 39-39, 43-43
🧰 Tools
🪛 Ruff (0.8.0)
26-26: f-string without any placeholders
Remove extraneous
f
prefix(F541)
39-39
: Correct typo in log message: 'exits' should be 'exist'.There's a typo in the log message. It should read "Local db does not exist. Initialize".
Apply this diff to correct the typo:
-logger.info("Local db does not exits. Initialise") +logger.info("Local db does not exist. Initialize")🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous
f
prefix(F541)
51-51
: Correct typo in log message: 'Closning' should be 'Cloning'.There's a typo in the log message. It should read "Cloning repository: {repo_name}".
Apply this diff to correct the typo:
-logger.info(f"Closning repository: {repo_name}") +logger.info(f"Cloning repository: {repo_name}")
145-145
: Correct typo in log message: 'embedidng' should be 'embedding'.There's a typo in the log message. It should read "{len(documents)} files added for embedding".
Apply this diff to correct the typo:
-logger.info(f"{len(documents)} files added for embedidng") +logger.info(f"{len(documents)} files added for embedding")logger.py (2)
16-18
: Consider safer log file handling and rotationThe current implementation writes logs to the user's home directory without checking permissions or implementing log rotation. This could lead to:
- Permission issues in containerized environments
- Disk space problems due to unbounded log growth
Consider:
- Using a configurable log directory with proper permission checks
- Implementing log rotation using
RotatingFileHandler
orTimedRotatingFileHandler
Example implementation:
from logging.handlers import RotatingFileHandler # ... log_dir = os.path.join(env.local_path, 'logs') os.makedirs(log_dir, exist_ok=True) f_handler = RotatingFileHandler( os.path.join(log_dir, "app.log"), maxBytes=10*1024*1024, # 10MB backupCount=5 )
23-30
: Eliminate duplicate formatter definitionsThe console and file formatters are identical. Consider defining the format string once to improve maintainability.
log_format = '{"time": "%(asctime)s", "name": %(name)r, "level": "%(levelname)s", "message": %(message)r}' time_format = "%m-%d %H:%M:%S" formatter = logging.Formatter(log_format, time_format) c_handler.setFormatter(formatter) f_handler.setFormatter(formatter)main.py (1)
35-37
: Add graceful shutdown handlingThe engine startup lacks graceful shutdown handling for the async event loop.
Consider implementing signal handlers:
import signal def handle_shutdown(sig, frame): logger.info("Shutdown signal received") asyncio.get_event_loop().stop() signal.signal(signal.SIGTERM, handle_shutdown) signal.signal(signal.SIGINT, handle_shutdown) engine = AlkemioVirtualContributorEngine() engine.register_handler(query) try: asyncio.run(engine.start()) except KeyboardInterrupt: logger.info("Shutting down gracefully...")Dockerfile (1)
46-46
: Add HEALTHCHECK and security improvementsWhile the entry point change is correct, consider adding:
- A health check to help container orchestrators monitor the application
- A non-root user for better security
# Create non-root user RUN useradd -r -u 1001 -g root appuser RUN chown -R appuser:root /app # Add health check HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ CMD curl -f http://localhost:${PORT}/health || exit 1 # Switch to non-root user USER appuser CMD ["poetry", "run", "python", "main.py"]Note: This assumes you'll implement a health check endpoint in the application. Would you like assistance in implementing the health check endpoint?
prompts.py (4)
7-7
: Fix typo in the prompt textThe word "exhaustive" is misspelled as "exthausive".
-3 - Provide an answer of 250 words or less that is professional, engaging, accurate and exthausive, based on the context delimited by triple pluses. \ +3 - Provide an answer of 250 words or less that is professional, engaging, accurate and exhaustive, based on the context delimited by triple pluses. \
12-15
: Consider using standard delimitersThe triple plus (+++) delimiter is uncommon. Consider using more standard delimiters like triple backticks (```) or triple quotes (""") for better readability and consistency with common practices.
29-37
: Standardize delimiter usageThe prompt uses inconsistent delimiters:
- Triple pluses (+++) for chat history
- Triple dashes (---) for the standalone query section
Consider using the same delimiter style throughout the file for consistency.
19-28
: Consider adding examples for clarityThe instructions for creating standalone queries could benefit from explicit examples, particularly for:
- Handling sentiment expressions
- Combining chat history with human input
This would help ensure consistent behavior across different implementations.
Example addition:
justice to the human input. 3. Do only return the standalone query, do not try to respond to the user query and do not return any other information. \ Never return the chat history delimited by triple pluses. + +Examples: +1. Sentiment input: + Human: "I love how Alkemio works!" + Output: "I love how Alkemio works!" + +2. Chat history combination: + History: "What is Alkemio?" + Human: "How does it handle collaboration?" + Output: "How does Alkemio handle collaboration?"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (12)
.gitignore
(1 hunks)Dockerfile
(1 hunks)ai_adapter.py
(1 hunks)config.py
(1 hunks)def_ingest.py
(0 hunks)ingest.py
(1 hunks)logger.py
(1 hunks)main.py
(1 hunks)models.py
(1 hunks)prompts.py
(1 hunks)pyproject.toml
(2 hunks)virtual_contributor_engine_guidance.py
(0 hunks)
💤 Files with no reviewable changes (2)
- virtual_contributor_engine_guidance.py
- def_ingest.py
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
🪛 Ruff (0.8.0)
models.py
1-1: langchain_openai.AzureChatOpenAI
imported but unused
Remove unused import: langchain_openai.AzureChatOpenAI
(F401)
3-3: pydantic.SecretStr
imported but unused
Remove unused import: pydantic.SecretStr
(F401)
5-5: langchain_mistralai.chat_models.ChatMistralAI
imported but unused
Remove unused import: langchain_mistralai.chat_models.ChatMistralAI
(F401)
main.py
3-3: Redefinition of unused ai_adapter
from line 1
Remove definition: ai_adapter
(F811)
4-4: Redefinition of unused asyncio
from line 2
Remove definition: asyncio
(F811)
ingest.py
26-26: f-string without any placeholders
Remove extraneous f
prefix
(F541)
28-28: f-string without any placeholders
Remove extraneous f
prefix
(F541)
33-33: f-string without any placeholders
Remove extraneous f
prefix
(F541)
39-39: f-string without any placeholders
Remove extraneous f
prefix
(F541)
43-43: f-string without any placeholders
Remove extraneous f
prefix
(F541)
ai_adapter.py
5-5: Redefinition of unused setup_logger
from line 4
Remove definition: setup_logger
(F811)
🔇 Additional comments (2)
pyproject.toml (1)
Line range hint 11-27
: Verify compatibility of updated and new dependencies.
Several dependencies have been updated and new ones added. Ensure that the new versions and additions are compatible with each other and with the existing codebase.
Run the following script to check for dependency conflicts:
Ensure that the command completes without errors, indicating that dependencies are resolvable.
✅ Verification successful
Based on the search results and the poetry.lock file from earlier, I can now generate the final response:
Dependencies are compatible with existing codebase usage patterns
The codebase shows consistent usage of the core dependencies:
- langchain ecosystem (core, community) is used for document processing and vector stores
- beautifulsoup4 is used for HTML parsing
- git/gitpython for repository operations
- FAISS for vector storage
The updated versions in pyproject.toml are compatible with these usage patterns, and the poetry.lock file shows successful resolution of all dependencies without conflicts.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for dependency conflicts using Poetry.
# Perform a dependency resolution check
poetry check
# Attempt to lock the dependencies
poetry lock --no-update
Length of output: 234
Script:
#!/bin/bash
# Let's check for any existing poetry.lock file and its content
fd poetry.lock --exec cat {}
# Let's also check if there are any requirements.txt files that might show resolved dependencies
fd requirements.txt --exec cat {}
# Check for any dependency-related files or documentation
fd -e txt -e md -e rst -g "*depend*" --exec cat {}
# Look for any import errors or dependency-related issues in Python files
rg -l "ImportError|ModuleNotFoundError|requirements|dependencies" --type py
# Check for actual usage of the new dependencies in the codebase
rg "from langchain_mistralai|from azure.ai.inference|from alkemio_virtual_contributor_engine" --type py
Length of output: 65845
Script:
#!/bin/bash
# Let's check for actual imports of the core dependencies in the codebase
rg "from langchain\." --type py
rg "from langchain_core\." --type py
rg "from langchain_community\." --type py
rg "from openai\." --type py
rg "from faiss" --type py
rg "from bs4" --type py
rg "import aio_pika|from aio_pika" --type py
rg "import aiormq|from aiormq" --type py
rg "import tiktoken|from tiktoken" --type py
rg "from langchain_openai\." --type py
rg "import git|from git" --type py
# Let's also check for any version constraints or dependency issues mentioned in documentation
rg -i "version|dependency|requirement" "*.md" "*.rst" "*.txt"
Length of output: 1054
prompts.py (1)
5-11
: Review the language handling logic
There's potential for confusion in the language handling:
- Step 1 instructs to translate questions to English
- Step 6 instructs to answer in {language}
- The {language} parameter has no default value
This could lead to inconsistent behavior if:
- The language parameter is not provided
- The input is in a different language and needs translation
Let's verify how the language parameter is handled:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (7)
config.py (3)
10-32
: Add class documentation and field descriptions.The
Env
class would benefit from a docstring explaining its purpose and documenting the configuration fields.Apply this diff:
@dataclass class Env: + """Configuration environment for the virtual contributor engine. + + Attributes: + model_name: The name of the LLM deployment + embeddings_model_name: The name of the embeddings model deployment + # ... document other fields ... + """ model_name: str
23-24
: Add validation for numeric configuration values.The chunk size, token limit, and temperature should be validated to ensure they are within acceptable ranges.
Consider adding property decorators with validation:
@property def chunk_size(self) -> int: return self._chunk_size @chunk_size.setter def chunk_size(self, value: int): if value <= 0: raise ValueError("chunk_size must be positive") self._chunk_size = valueAlso applies to: 26-26
72-72
: Consider implementing a singleton pattern or lazy initialization.The global instance creation could make unit testing difficult and may not be the most efficient approach if the configuration isn't always needed.
Consider using a singleton pattern:
class Env: _instance = None @classmethod def get_instance(cls): if cls._instance is None: cls._instance = cls() return cls._instanceprompts.py (4)
7-7
: Fix typo in the prompt textThere's a typo in the word "exthausive" which should be "exhaustive".
-3 - Provide an answer of 250 words or less that is professional, engaging, accurate and exthausive, based on the context delimited by triple pluses. \ +3 - Provide an answer of 250 words or less that is professional, engaging, accurate and exhaustive, based on the context delimited by triple pluses. \
6-6
: Fix grammatical error in context descriptionThe sentence about context has a grammatical error.
-2 - The text provided in the context delimited by triple pluses is retrieved from the Alkemio website is not part of the conversation with the user. +2 - The text provided in the context delimited by triple pluses, which is retrieved from the Alkemio website, is not part of the conversation with the user.
18-37
: Consider enhancing the prompt with error handlingThe
condense_prompt
could benefit from additional instructions for edge cases:
- How to handle empty chat history
- Maximum length for the standalone query
- Error handling for malformed or inappropriate input
Consider adding these instructions to the prompt:
condense_prompt = """ Create a single sentence standalone query based on the human input, using the following step-by-step instructions: 1. If the human input is expressing a sentiment, delete and ignore the chat history delimited by triple pluses. \ Then, return the human input containing the sentiment as the standalone query. Do NOT in any way respond to the human input, \ simply repeat it. 2. Otherwise, combine the chat history delimited by triple pluses and human input into a single standalone query that does \ justice to the human input. 3. Do only return the standalone query, do not try to respond to the user query and do not return any other information. \ Never return the chat history delimited by triple pluses. +4. If the chat history is empty, treat the human input as a standalone query. +5. Keep the standalone query concise, not exceeding 100 words. +6. If the input is inappropriate or malformed, return "Invalid input". +++ chat history:
1-37
: Consider adding type hints and documentationTo improve maintainability and code clarity, consider:
- Converting the prompts to a typed structure (e.g., dataclass or Enum)
- Adding docstrings to describe the expected placeholders and their formats
- Including examples of usage
Here's a suggested implementation:
from dataclasses import dataclass from typing import ClassVar @dataclass class SystemPrompts: """System prompts for the Alkemio virtual contributor. Attributes: chat_system_prompt: Template for chat responses about Alkemio. Placeholders: {language}: Target response language {context}: Retrieved context from Alkemio website condense_prompt: Template for creating standalone queries. Placeholders: {chat_history}: Previous conversation history {message}: Current user input """ chat_system_prompt: ClassVar[str] = """...""" condense_prompt: ClassVar[str] = """..."""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
config.py
(1 hunks)ingest.py
(1 hunks)main.py
(1 hunks)prompts.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- main.py
🧰 Additional context used
🪛 Ruff (0.8.0)
ingest.py
26-26: f-string without any placeholders
Remove extraneous f
prefix
(F541)
28-28: f-string without any placeholders
Remove extraneous f
prefix
(F541)
33-33: f-string without any placeholders
Remove extraneous f
prefix
(F541)
39-39: f-string without any placeholders
Remove extraneous f
prefix
(F541)
43-43: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (3)
config.py (2)
1-8
: LGTM! Imports and environment setup are well organized.
The necessary imports are present and the environment variables are loaded appropriately using dotenv.
34-40
: Consider using URL validation for RabbitMQ configuration.
The RabbitMQ connection parameters should be validated to ensure they form a valid connection string.
Let's verify if there are any existing RabbitMQ connection utilities in the codebase:
ingest.py (1)
1-18
: LGTM: Well-organized imports and proper logger setup.
The imports are appropriate for the required functionality, and the logger setup follows best practices.
Congrats on merging your first pull request! We at Alkemio thank you for your contribution! |
Related to alkem-io/alkemio#1456
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
Dependency Updates
.gitignore
to exclude log files and specific directories from version control.