diff --git a/streamlit_app/README.md b/streamlit_app/README.md index cfb35be..b52dc84 100644 --- a/streamlit_app/README.md +++ b/streamlit_app/README.md @@ -26,17 +26,23 @@ OllaLab-Lean - Interactive Web Apps will be automatically installed following th If you have issues with the main installation or simply want to install just the OllaLab-Lean - Interactive Web Apps, you can follow the steps below. +### Download this folder. +- Use git clone or http download to download OllaLab Lean +- Get into FedRAMP-OllaLab-Lean/streamlit_app/app + ### Install Python There are several ways to install Python. You may find the official guide in [Official Python Downloads](https://www.python.org/downloads/). If you have Visual Studio Code installed, you may also follow [Getting Started with Python in VS Code](https://code.visualstudio.com/docs/python/python-tutorial) +On windows, you may also [Install Python from Microsoft App Store](https://learn.microsoft.com/en-us/windows/python/beginners) + ### Install Python Virtual Environment A virtual environment is created on top of an existing Python installation, known as the virtual environment’s “base” Python, and may optionally be isolated from the packages in the base environment, so only those explicitly installed in the virtual environment are available. More details are in [Creation of virtual environment](https://docs.python.org/3/library/venv.html) Virtual environments are created by executing the venv module: ``` -python -m venv /path/to/new/virtual/environment +python -m venv ./.venv ``` If successfuly, a folder ".venv" will be created in /path/to/new/virtual/environment. You will then need to invoke the virtual environment. Assuming you are at the folder containing the ".venv" folder for the virtual environment you've just set up. You can launch the virtual environment by: - On windows @@ -74,7 +80,11 @@ Feel free to check out [Additional info on installing packages within Python Vir Ollama is an AI tool that allows users to run large language models (LLMs) locally on their computer. Installation files of Ollama for Mac, Linux, and Windows can be found at [Official Ollama Installation Files](https://ollama.com/download) -On Mac, you can also use "brew install ollama" to install Ollama on Homebrew. +Command to install Ollama on Linux: +``` +curl -fsSL https://ollama.com/install.sh | sh +``` + To verify your Ollama installation, you may go to localhost:11434 or 127.0.0.1:11434. If the installation went well, you should see "Ollama is running". diff --git a/streamlit_app/app/entity_bridge/1_algorithm.md b/streamlit_app/app/entity_bridge/1_algorithm.md new file mode 100644 index 0000000..04fad3b --- /dev/null +++ b/streamlit_app/app/entity_bridge/1_algorithm.md @@ -0,0 +1,167 @@ +# Problem Statement + +We face the challenge of merging multiple datasets containing pairs of {Unique IDs, Entity Names}, where entities overlap but have different Unique IDs across datasets due to independent publishing sources. Our objective is to accurately merge these datasets based on Entity Names by leveraging Large Language Model (LLM) knowledge of real-world entities. The entities must represent actual entities, and the LLM's embedded knowledge should facilitate the merging process. + +# Input Description + +We have at least two data files of type csv, tsv, or xlsx. Each file has at least one pair of {Unique IDs, Entity Names}. A file may have up to two pairs of {Unique IDs, Entity Names} where one pair represent the parent and the other pair represents the child. A child has no child of its own. A parent pair has at least one child. Data in pairs for child cannot appear in the parent columns meaning the Parent-Child relationship only goes 1 level (there is no grand parent). + +# Solution + +## 1\. Load files and normalize data + +### File Loading + +1. **File Upload**: The user uploads two or more files (e.g. F1, F2, F3) +2. **Data Frame Creation**: Files are loaded into data frames (F1, F2, F3). +3. **Handling Missing Data**: +- The program checks for empty cells and allows the user to choose how to handle them: + - Remove rows with missing values. + - Fill missing values with defaults or placeholders. + - Skip processing fields with excessive missing data. +4. **Field Selection with Validation:** +- For each file, the user must select the following fields from a list of columns: + - Parent ID Field (optional). + - Parent Name Field (mandatory). + - Child ID Field (optional). + - Child Name Field (optional). +- The program validates selected fields for correct data types and formats. +- Provides immediate feedback if invalid fields are selected. +5. **Reset Option**: A "Reset" button is always available to restart the process. +6. **Saving Initial Data Frames**: Selected columns are saved to initial data frames (e.g., F1_initial, F2_initial). + +### Normalize IDs + +For each initial data frame: + +1. **Parent IDs**: +- If the table has no Parent Name, the program will halt and issue an error message. The program will not move forward in this case. +- If the table lacks a Parent ID but has Parent Names, generate a unique Parent ID for each unique Parent Name. +2. **Child IDs**: +- If the table lacks Child IDs but has Child Names, generate unique Child IDs for each unique Child Name. +3. **Handling Missing Child Names**: +- If Child Names are missing but Child IDs are present: + - Prompt the user to provide a naming convention. + - Use a placeholder or combine Parent Name with Child ID to create a meaningful Child Name. +4. **Ensuring Data Integrity**: +- Verify that after processing, each record has a Parent ID and Parent Name. +- Implement error handling for records that still lack mandatory fields. + +### Normalize Entity Names + +To prevent over-normalization and preserve essential parts of entity names: + +1. **Duplicate Original Names**: +- Create copies of Parent Name and Child Name columns with a _original suffix. + +2. **Normalization Steps**: +- Case Normalization: Convert all letters to uppercase. +- Punctuation Removal: Remove dots (.), hyphens (-), underscores (_), commas (,), and other non-essential punctuation while maintaining readability. +- Controlled Prefix/Suffix Removal: + - Use a predefined list of non-essential terms (e.g., "INC", "LLC", "CORP") to remove from Parent Names. + - Avoid removing words that are integral to the entity's identity. + - Allow users to customize the list of terms if necessary. +- Logging Changes: + - Record all normalization actions in a log for transparency and debugging. + +3. **Retain Mappings**: +- Maintain a mapping of original to normalized names to prevent confusion during later stages. + +### Remove duplicated rows + +To ensure data remains manageable and analyzable: + +1. **Identify Duplicates Using Normalized Fields**: +- Use combinations of normalized Parent IDs, Parent Names, Child IDs, and Child Names. + +2. **Handle Duplicates**: +- Instead of merging data into lists within a cell, create a separate mapping table that relates duplicate IDs to a unique entity ID. +- Preserve the tabular structure for compatibility with data analysis tools. + +## 2\. Construct Unique Parent Name List + +a. **Automated Matching Using Similarity Metrics**: +- Apply string similarity algorithms (e.g., Levenshtein distance, Jaro-Winkler) to compute similarity scores between normalized Parent Names across all data frames. +- Set a similarity threshold (e.g., 90%) to automatically group names above this threshold. +- Implement efficient data structures (e.g., inverted indices, clustering) to reduce computational complexity. +- Use blocking techniques to group entities and limit comparisons. + +b. **User Input for Ambiguous Cases**: +- Present ambiguous matches (below the threshold) to the user for confirmation. + +c. **Create Unique Parent Names Data Frame**: +- Assign a unique identifier (UniqueParentID) to each grouped entity. +- Maintain mappings to original Parent IDs and names from each dataset. + +d. **User Interface Enhancements**: +- Provide bulk actions to approve or reject suggested matches. +- Allow adjustment of similarity thresholds and reprocess matches accordingly. +- Allow saving UniqueParentID, normalized Parent Names, and mmapping of original Parent Names to normalized Parent Names to a unique parent name output file. + +## 3\. Construct Unique Children Name List + +Similar to constructing the unique parent name list: + +a. **Automated Matching Using Similarity Metrics**: Compute similarity scores between normalized Child Names across all data frames. + +b. **User Input for Ambiguous Cases**: Present ambiguous matches (below the threshold) to the user for confirmation. + +c. **Efficient Comparison Process**: Use optimized algorithms suitable for larger datasets. + +d. **Create Unique Child Names Data Frame**: +- Assign unique identifiers (UniqueChildID) to each group of similar child entities. +- Maintain mappings to original Child IDs and names. +- Allow saving UniqueChildID, normalized Child Names, and mmapping of original Child Names to normalized Child Names to a unique parent name output file. + +## 4\. Enrich Original Data Frames with UniqueParentID and UniqueChildID +a. **Prepare for Enrichment**: +- Duplicate original data frames (e.g. F1) to create enriched versions (e.g., F1_enriched). + +b. **Matching Using Normalized Names and IDs**: +- Use Parent Name field or Child Name field to search for matching UniqueParentID or UniqueChildID leveraging the information in the Unique Parent Name List or the Unique Child Name List. +- If a match is found, add to the column UniqueParentID or UniqueChildID + +c. **Save to file**: +- Allow the user to save/download the enriched data frame. + +## 5\. User Interface and Experience Improvements +To enhance usability: +- Progress Indicators: + - Display progress bars or status updates during lengthy operations. +- Undo and Revert Options: + - Allow users to undo recent actions or revert to previous steps without restarting. +- Help and Guidance: + - Provide tooltips, FAQs, and a help section within the interface. + - Offer examples and suggestions during field selection and parameter settings. +- Error Handling: + - Present clear, actionable error messages. + - Guide users on how to resolve issues when they occur. + +## 6\. Testing, Validation, and Extensibility + +To ensure reliability and future-proofing: + +- Unit Testing: + - Develop unit tests for each function and component. + - Use test-driven development practices where feasible. +- Performance Testing: + - Evaluate performance with datasets of varying sizes and complexities. + - Optimize algorithms based on profiling results. +- Extensibility: + - Design the system to handle multiple datasets beyond just two. + - Support additional data formats (e.g., JSON, XML) and database connections. + - Modularize components to allow for easy updates and feature additions. + +11. Documentation and Support + +To aid users and developers: + +- Comprehensive Documentation: + - Provide a detailed user manual with step-by-step instructions. + - Include technical documentation for developers, outlining system architecture and codebase. +- Logging and Auditability: + - Implement detailed logging of user actions and system processes. + - Store logs securely and provide access for audit purposes if needed. +- Support Resources: + - Offer support channels such as email, chat, or forums. + - Regularly update documentation with FAQs and troubleshooting tips. \ No newline at end of file diff --git a/streamlit_app/app/entity_bridge/2_projectStructure.md b/streamlit_app/app/entity_bridge/2_projectStructure.md new file mode 100644 index 0000000..f6c8ee0 --- /dev/null +++ b/streamlit_app/app/entity_bridge/2_projectStructure.md @@ -0,0 +1,645 @@ +# Entity Bridge - Structure + +## **Folder Structure and Files** + +``` +streamlit_app/ +├── app/ +│ ├── main.py # Main entry point of the Streamlit application +│ ├── pages/ +│ │ └── Entity_Bridge.py # Streamlit page for the Entity Bridge component +│ ├── entity_bridge/ # Package containing modules for Entity Bridge +│ │ ├── __init__.py # Initialization file for the entity_bridge package +│ │ ├── data_loader.py # Module for loading and handling data files +│ │ ├── data_normalizer.py # Module for normalizing IDs and entity names +│ │ ├── duplicate_remover.py # Module for identifying and removing duplicate rows +│ │ ├── entity_matcher.py # Module for matching entities across datasets +│ │ ├── ui_helper.py # Module containing UI helper functions +│ │ ├── llm_integration.py # Module for integrating with various LLM APIs +│ │ ├── utils.py # Module containing utility functions +``` + +--- + +### **Brief Descriptions** + +- **streamlit_app/app/main.py**: Main entry point of the Streamlit application that initializes the app and provides navigation. + +- **streamlit_app/app/pages/Entity_Bridge.py**: Streamlit page that implements the Entity Bridge component, handling user interactions and displaying results. + +- **streamlit_app/app/entity_bridge/**: Package containing all modules related to the Entity Bridge functionality. + + - **__init__.py**: Initialization file for the entity_bridge package. + + - **data_loader.py**: Module responsible for loading data files and handling missing data. + + - **data_normalizer.py**: Module that normalizes IDs and entity names to ensure consistency. + + - **duplicate_remover.py**: Module that identifies and removes duplicate rows from datasets. + + - **entity_matcher.py**: Module that matches entities across datasets using similarity metrics. + + - **ui_helper.py**: Module containing helper functions for building Streamlit UI components. + + - **llm_integration.py**: Module that integrates with various Large Language Models (LLMs) for advanced entity matching. + + - **utils.py**: Utility module containing shared functions used across the application. + +--- + +## **Content of Each Proposed File** + +Below are the contents of each file, excluding specific function or method implementation details but including detailed docstrings for classes and functions. + +--- + +### **streamlit_app/app/main.py** + +```python +""" +Main Entry Point of the Streamlit Application + +This module initializes the Streamlit app and provides navigation between different pages. +""" + +import streamlit as st + +def main(): + """ + Main function to run the Streamlit application. + """ + st.set_page_config(page_title="Entity Bridge", layout="wide") + st.sidebar.title("Navigation") + page = st.sidebar.radio("Go to", ["Entity Bridge", "Other Page"]) + + if page == "Entity Bridge": + from pages import Entity_Bridge + Entity_Bridge.app() + else: + st.write("Welcome to the Other Page.") + +if __name__ == "__main__": + main() +``` + +--- + +### **streamlit_app/app/pages/Entity_Bridge.py** + +```python +""" +Entity Bridge - Streamlit Page + +This module defines the Entity Bridge component page for the Streamlit application. +""" + +import streamlit as st +from entity_bridge import data_loader +from entity_bridge import data_normalizer +from entity_bridge import duplicate_remover +from entity_bridge import entity_matcher +from entity_bridge import ui_helper +from entity_bridge import llm_integration + +def app(): + """ + Main function to run the Entity Bridge page. + + This function sets up the UI for the Entity Bridge, handles user inputs, + and displays the results after processing. + """ + st.title("Entity Bridge") + st.write("Merge multiple datasets containing entity information with overlapping entities.") + + # File upload section + uploaded_files = ui_helper.display_file_upload() + + if uploaded_files: + # Load and preprocess the data files + data_frames = data_loader.load_and_preprocess_files(uploaded_files) + + # Normalize IDs and Names + normalized_data_frames = data_normalizer.normalize_data_frames(data_frames) + + # Remove duplicate rows + deduplicated_data_frames = duplicate_remover.remove_duplicates(normalized_data_frames) + + # Construct unique parent and child name lists + unique_parents = entity_matcher.construct_unique_parent_list(deduplicated_data_frames) + unique_children = entity_matcher.construct_unique_child_list(deduplicated_data_frames) + + # Enrich original data frames with unique IDs + enriched_data_frames = entity_matcher.enrich_data_frames_with_unique_ids( + deduplicated_data_frames, unique_parents, unique_children + ) + + # Display or allow download of the enriched data + ui_helper.display_enriched_data(enriched_data_frames) + + # Save the resulting datasets if needed + ui_helper.download_enriched_data(enriched_data_frames) +``` + +--- + +### **streamlit_app/app/entity_bridge/__init__.py** + +```python +""" +Entity Bridge Package Initialization + +This package contains modules for the Entity Bridge application, +which facilitates the merging of datasets based on entity names. +""" + +# You can import commonly used functions or classes here +``` + +--- + +### **streamlit_app/app/entity_bridge/data_loader.py** + +```python +""" +Data Loader Module + +This module provides functions to load and handle data files, including +file I/O operations and initial preprocessing steps. +""" + +import pandas as pd +import streamlit as st + +def load_data(file): + """ + Load data from an uploaded file into a pandas DataFrame. + + Args: + file (UploadedFile): The file uploaded by the user. + + Returns: + DataFrame: A pandas DataFrame containing the data from the file. + + Raises: + ValueError: If the file format is unsupported or an error occurs during loading. + """ + pass # Implementation goes here + +def handle_missing_data(df, strategy): + """ + Handle missing data in the DataFrame based on the specified strategy. + + Args: + df (DataFrame): The DataFrame to process. + strategy (str): The strategy to handle missing data ('remove', 'fill', 'skip'). + + Returns: + DataFrame: The DataFrame after handling missing data. + + Raises: + ValueError: If the strategy is unsupported. + """ + pass # Implementation goes here + +def load_and_preprocess_files(uploaded_files): + """ + Load and preprocess multiple uploaded files. + + Args: + uploaded_files (list): List of files uploaded by the user. + + Returns: + list: List of preprocessed pandas DataFrames. + + Side Effects: + Displays options and messages in the Streamlit UI. + """ + pass # Implementation goes here +``` + +--- + +### **streamlit_app/app/entity_bridge/data_normalizer.py** + +```python +""" +Data Normalizer Module + +This module includes functions to normalize IDs and entity names to ensure +consistent formatting across datasets. +""" + +import pandas as pd + +def normalize_ids(df, id_columns, name_columns): + """ + Normalize IDs in the DataFrame, generating new IDs if they are missing. + + Args: + df (DataFrame): The DataFrame to process. + id_columns (list): List of ID column names to normalize. + name_columns (list): List of name column names related to the IDs. + + Returns: + DataFrame: The DataFrame with normalized IDs. + """ + pass # Implementation goes here + +def normalize_entity_names(df, name_columns, custom_stopwords=None): + """ + Normalize entity names in the DataFrame by applying various text preprocessing steps. + + Args: + df (DataFrame): The DataFrame to process. + name_columns (list): List of name column names to normalize. + custom_stopwords (list, optional): List of custom stopwords to remove from names. + + Returns: + DataFrame: The DataFrame with normalized names. + """ + pass # Implementation goes here + +def normalize_data_frames(data_frames): + """ + Apply normalization to a list of DataFrames. + + Args: + data_frames (list): List of DataFrames to normalize. + + Returns: + list: List of normalized DataFrames. + """ + pass # Implementation goes here +``` + +--- + +### **streamlit_app/app/entity_bridge/duplicate_remover.py** + +```python +""" +Duplicate Remover Module + +This module provides functions to identify and remove duplicate rows from DataFrames. +""" + +import pandas as pd + +def identify_duplicates(df, subset_columns): + """ + Identify duplicate rows in the DataFrame based on subset columns. + + Args: + df (DataFrame): The DataFrame to check for duplicates. + subset_columns (list): List of column names to consider when identifying duplicates. + + Returns: + DataFrame: A DataFrame containing only duplicate rows. + """ + pass # Implementation goes here + +def remove_duplicates(df, subset_columns): + """ + Remove duplicate rows from the DataFrame based on subset columns. + + Args: + df (DataFrame): The DataFrame to process. + subset_columns (list): List of column names to consider when removing duplicates. + + Returns: + DataFrame: The DataFrame after removing duplicates. + """ + pass # Implementation goes here + +def remove_duplicates_from_data_frames(data_frames): + """ + Remove duplicates from a list of DataFrames. + + Args: + data_frames (list): List of DataFrames to process. + + Returns: + list: List of DataFrames with duplicates removed. + """ + pass # Implementation goes here +``` + +--- + +### **streamlit_app/app/entity_bridge/entity_matcher.py** + +```python +""" +Entity Matcher Module + +This module provides functions to match entities across datasets using +similarity metrics and user input for ambiguous cases. +""" + +import pandas as pd + +def compute_similarity_scores(df_list, column_name): + """ + Compute similarity scores between entities across multiple DataFrames. + + Args: + df_list (list): List of DataFrames to compare. + column_name (str): The name of the column containing the entities. + + Returns: + DataFrame: A DataFrame containing pairs of entities and their similarity scores. + """ + pass # Implementation goes here + +def automated_entity_matching(similarity_df, threshold): + """ + Automatically match entities based on a similarity threshold. + + Args: + similarity_df (DataFrame): DataFrame containing similarity scores. + threshold (float): Similarity threshold for automatic matching. + + Returns: + DataFrame: DataFrame containing matched entities. + """ + pass # Implementation goes here + +def user_confirm_ambiguous_matches(ambiguous_matches): + """ + Present ambiguous matches to the user for confirmation. + + Args: + ambiguous_matches (DataFrame): DataFrame containing ambiguous entity matches. + + Returns: + DataFrame: DataFrame with user-confirmed matches. + """ + pass # Implementation goes here + +def construct_unique_parent_list(data_frames): + """ + Construct a unique parent entity list from the data frames. + + Args: + data_frames (list): List of DataFrames containing parent entities. + + Returns: + DataFrame: DataFrame containing unique parent entities with unique identifiers. + """ + pass # Implementation goes here + +def construct_unique_child_list(data_frames): + """ + Construct a unique child entity list from the data frames. + + Args: + data_frames (list): List of DataFrames containing child entities. + + Returns: + DataFrame: DataFrame containing unique child entities with unique identifiers. + """ + pass # Implementation goes here + +def enrich_data_frames_with_unique_ids(data_frames, unique_parents, unique_children): + """ + Enrich the original data frames with unique parent and child IDs. + + Args: + data_frames (list): List of original DataFrames to enrich. + unique_parents (DataFrame): DataFrame containing unique parent entities. + unique_children (DataFrame): DataFrame containing unique child entities. + + Returns: + list: List of enriched DataFrames. + """ + pass # Implementation goes here +``` + +--- + +### **streamlit_app/app/entity_bridge/ui_helper.py** + +```python +""" +UI Helper Module + +This module contains helper functions to build and manage the Streamlit UI components. +""" + +import streamlit as st + +def display_file_upload(): + """ + Display file upload widgets and return the uploaded files. + + Returns: + list: List of UploadedFile objects. + + Side Effects: + Renders file upload widgets in the Streamlit UI. + """ + uploaded_files = st.file_uploader("Upload one or more data files", type=['csv', 'tsv', 'xlsx'], accept_multiple_files=True) + return uploaded_files + +def display_missing_data_options(): + """ + Display options for handling missing data and return the user's choice. + + Returns: + str: The selected strategy for handling missing data ('remove', 'fill', 'skip'). + + Side Effects: + Renders radio buttons in the Streamlit UI. + """ + options = ['Remove rows with missing values', 'Fill missing values with defaults', 'Skip processing fields with excessive missing data'] + choice = st.radio("Select how to handle missing data:", options) + strategy_mapping = { + 'Remove rows with missing values': 'remove', + 'Fill missing values with defaults': 'fill', + 'Skip processing fields with excessive missing data': 'skip' + } + return strategy_mapping.get(choice, 'remove') + +def display_enriched_data(enriched_data_frames): + """ + Display the enriched data frames in the Streamlit UI. + + Args: + enriched_data_frames (list): List of enriched DataFrames to display. + + Side Effects: + Renders data frames and relevant information in the Streamlit UI. + """ + pass # Implementation goes here + +def download_enriched_data(enriched_data_frames): + """ + Provide options to download the enriched data frames. + + Args: + enriched_data_frames (list): List of enriched DataFrames to download. + + Side Effects: + Adds download buttons to the Streamlit UI. + """ + pass # Implementation goes here + +def display_similarity_threshold_setting(default_threshold=0.9): + """ + Display a slider to adjust the similarity threshold. + + Args: + default_threshold (float): Default value for the similarity threshold. + + Returns: + float: The user-selected similarity threshold. + + Side Effects: + Renders a slider in the Streamlit UI. + """ + threshold = st.slider("Set similarity threshold for matching:", min_value=0.0, max_value=1.0, value=default_threshold, step=0.01) + return threshold +``` + +--- + +### **streamlit_app/app/entity_bridge/llm_integration.py** + +```python +""" +LLM Integration Module + +This module provides functions to integrate with various Large Language Models (LLMs) +such as OpenAI, Ollama, Anthropic, Google Vertex AI, and AWS Bedrock. +""" + +def setup_llm_client(provider, **credentials): + """ + Set up the LLM client based on the selected provider and credentials. + + Args: + provider (str): The name of the LLM provider ('ollama', 'openai', 'anthropic', 'vertexai', 'bedrock'). + **credentials: Keyword arguments containing necessary credentials. + + Returns: + object: An instance of the LLM client. + + Raises: + ValueError: If the provider is unsupported or credentials are missing. + """ + pass # Implementation goes here + +def generate_entity_mappings_with_llm(prompt, client, model_name): + """ + Generate entity mappings using the provided LLM client. + + Args: + prompt (str): The prompt to send to the LLM. + client (object): The LLM client instance. + model_name (str): The name of the LLM model to use. + + Returns: + dict: A dictionary containing the entity mappings generated by the LLM. + + Raises: + Exception: If the LLM generation fails. + """ + pass # Implementation goes here + +def integrate_llm_in_entity_matching(similarity_df, client, model_name): + """ + Use LLM to enhance entity matching, especially for ambiguous cases. + + Args: + similarity_df (DataFrame): DataFrame containing entities and similarity scores. + client (object): The LLM client instance. + model_name (str): The LLM model to use. + + Returns: + DataFrame: An updated DataFrame with improved entity matching. + + Side Effects: + May involve additional API calls to the LLM provider. + """ + pass # Implementation goes here +``` + +--- + +### **streamlit_app/app/entity_bridge/utils.py** + +```python +""" +Utilities Module + +This module contains utility functions used throughout the Entity Bridge application. +""" + +import uuid +import re +from difflib import SequenceMatcher + +def generate_unique_identifier(): + """ + Generate a unique identifier string. + + Returns: + str: A unique identifier generated using UUID4. + """ + return str(uuid.uuid4()) + +def calculate_similarity(s1, s2): + """ + Calculate the similarity between two strings using sequence matching. + + Args: + s1 (str): The first string. + s2 (str): The second string. + + Returns: + float: The similarity score between 0.0 and 1.0. + """ + return SequenceMatcher(None, s1, s2).ratio() + +def normalize_text(text, custom_stopwords=None): + """ + Normalize text by uppercase conversion, punctuation removal, and stopwords removal. + + Args: + text (str): The text to normalize. + custom_stopwords (list, optional): List of custom stopwords to remove. + + Returns: + str: The normalized text. + """ + pass # Implementation goes here + +def log_normalization_actions(actions_log, action_description): + """ + Record a normalization action to the actions log. + + Args: + actions_log (list): List maintaining logs of normalization actions. + action_description (str): Description of the normalization action performed. + + Side Effects: + Updates the actions_log list. + """ + actions_log.append(action_description) +``` + +--- + +## **Notes** + +- **LLM Integration Details**: The `llm_integration.py` module is designed to support multiple LLM providers. The `setup_llm_client` function handles the instantiation of clients for different providers based on provided credentials. This approach ensures that support for providers like OpenAI, Ollama, Anthropic, Google Vertex AI, and AWS Bedrock is modular and maintainable. + +- **Error Handling and Logging**: Throughout the modules, appropriate error handling should be implemented. For example, when loading files or making API calls, exceptions should be caught and informative messages displayed to the user. Logging should be implemented to keep track of actions and facilitate debugging. + +- **Modularity and Extensibility**: The code structure is modular, allowing for easy extension. For example, new normalization rules or matching algorithms can be added without significant changes to the existing codebase. + +- **User Interface Considerations**: The `ui_helper.py` module abstracts UI components, making it easier to manage and update the UI elements separately from the business logic. + +- **Data Security and Privacy**: Since user data and credentials are involved, ensure that sensitive information is handled securely. Avoid logging sensitive data and consider implementing secure methods for credential storage and usage. + +- **Comments and Documentation**: All modules and functions include detailed docstrings, following best practices for documentation, which aids in maintenance and collaborative development. diff --git a/streamlit_app/app/entity_bridge/3_developmentStrategy.md b/streamlit_app/app/entity_bridge/3_developmentStrategy.md new file mode 100644 index 0000000..41c039d --- /dev/null +++ b/streamlit_app/app/entity_bridge/3_developmentStrategy.md @@ -0,0 +1,390 @@ +Certainly! Developing a project like the Entity Bridge requires careful planning to ensure that each component works seamlessly with the others. Here's a recommended development strategy to guide you through building the application effectively: + +--- + +## **Development Strategy** + +### **1. Set Up the Project Structure** + +- **Initialize the Project Directory**: + - Create the directory structure as outlined in your proposed files. + - Ensure that you have a virtual environment set up to manage dependencies. + +- **Version Control**: + - Initialize Git in your project directory. + - Create a `.gitignore` file to exclude unnecessary files (e.g., virtual environment folders, `.DS_Store`, etc.). + +- **Install Required Packages**: + - Install essential packages such as `streamlit`, `pandas`, `numpy`, etc. + - Record all dependencies in a `requirements.txt` file for reproducibility. + +```bash +pip install streamlit pandas numpy +pip freeze > requirements.txt +``` + +--- + +### **2. Develop Core Utility Functions** + +#### **File: `utils.py`** + +- **Purpose**: Provides foundational functions used throughout the application. +- **Functions to Implement**: + - `generate_unique_identifier()` + - `calculate_similarity(s1, s2)` + - `normalize_text(text, custom_stopwords=None)` + - `log_normalization_actions(actions_log, action_description)` + +#### **Why Start Here?** + +- Utility functions are building blocks used across multiple modules. +- Implementing them first ensures consistency and reduces redundant code. +- Functions like `calculate_similarity` are critical for the entity matching process. + +--- + +### **3. Implement Data Loading and Handling** + +#### **File: `data_loader.py`** + +- **Purpose**: Handles file uploads and initial data processing. +- **Functions to Implement**: + - `load_data(file)` + - `handle_missing_data(df, strategy)` + - `load_and_preprocess_files(uploaded_files)` + +#### **Steps**: + +- **Develop `load_data(file)`**: + - Write code to read CSV, TSV, and XLSX files into pandas DataFrames. + - Handle parsing errors and provide meaningful error messages. + +- **Implement `handle_missing_data(df, strategy)`**: + - Support strategies such as 'remove', 'fill', or 'skip'. + - Use pandas functions like `dropna()`, `fillna()`, etc. + +- **Test Loading and Handling**: + - Create sample data files to test the loading and missing data handling. + - Ensure that the function handles different file types correctly. + +--- + +### **4. Build Data Normalization Module** + +#### **File: `data_normalizer.py`** + +- **Purpose**: Normalizes IDs and entity names to ensure consistency. +- **Functions to Implement**: + - `normalize_ids(df, id_columns, name_columns)` + - `normalize_entity_names(df, name_columns, custom_stopwords=None)` + - `normalize_data_frames(data_frames)` + +#### **Steps**: + +- **Implement `normalize_ids`**: + - Check for missing IDs and generate unique IDs using `generate_unique_identifier()`. + - Ensure IDs are strings to maintain consistency. + +- **Implement `normalize_entity_names`**: + - Convert names to uppercase. + - Remove specified punctuation. + - Implement controlled prefix/suffix removal using a custom stopwords list. + - Use `normalize_text()` utility function. + +- **Test Normalization Functions**: + - Apply functions to sample data. + - Verify that normalization behaves as expected. + +--- + +### **5. Create Duplicate Remover Module** + +#### **File: `duplicate_remover.py`** + +- **Purpose**: Identifies and removes duplicate rows based on normalized data. +- **Functions to Implement**: + - `identify_duplicates(df, subset_columns)` + - `remove_duplicates(df, subset_columns)` + - `remove_duplicates_from_data_frames(data_frames)` + +#### **Steps**: + +- **Implement `identify_duplicates` and `remove_duplicates`**: + - Use pandas' `duplicated()` and `drop_duplicates()` methods. + - Ensure that duplicates are identified based on the right combination of columns. + +- **Test Duplicate Removal**: + - Introduce duplicates in test data and verify that they are correctly identified and removed. + +--- + +### **6. Develop UI Components for Data Upload and Options** + +#### **File: `ui_helper.py`** + +- **Purpose**: Manages Streamlit UI components for user interaction. +- **Functions to Implement**: + - `display_file_upload()` + - `display_missing_data_options()` + - `display_similarity_threshold_setting()` + +#### **Steps**: + +- **Implement `display_file_upload()`**: + - Use `st.file_uploader()` to allow multiple file uploads. + - Test UI by running a simple Streamlit app. + +- **Implement `display_missing_data_options()`**: + - Provide options using `st.radio()` or `st.selectbox()`. + +- **Integrate UI Components**: + - Create a minimal Streamlit app (`main.py`) to test UI elements. + - Ensure that user selections are captured correctly. + +--- + +### **7. Integrate Data Modules with the UI** + +#### **File: `app.py` or `main.py`** + +- **Purpose**: Serves as the main entry point, integrating backend modules with UI. +- **Steps**: + +- **Set Up Streamlit App Structure**: + - Import necessary modules. + - Create a function `main()` that orchestrates the flow. + +- **Implement Data Loading Workflow**: + - Use UI components to get user input. + - Call `data_loader.load_and_preprocess_files(uploaded_files)`. + +- **Test End-to-End Flow**: + - Run the app and ensure data is loaded and displayed. + +--- + +### **8. Develop Entity Matching Logic** + +#### **File: `entity_matcher.py`** + +- **Purpose**: Matches entities across datasets using similarity metrics. +- **Functions to Implement**: + - `compute_similarity_scores(df_list, column_name)` + - `automated_entity_matching(similarity_df, threshold)` + - `construct_unique_parent_list(data_frames)` + - `construct_unique_child_list(data_frames)` + +#### **Steps**: + +- **Implement `compute_similarity_scores`**: + - Use similarity metrics such as Levenshtein distance. + - Consider using libraries like `fuzzywuzzy` or `RapidFuzz` for efficiency. + +- **Implement `automated_entity_matching`**: + - Group entities that exceed the similarity threshold. + +- **Test Matching Logic**: + - Use sample data with known matches to validate accuracy. + +--- + +### **9. Enhance UI for Entity Matching** + +#### **File: `ui_helper.py`** + +- **Functions to Implement**: + - `display_enriched_data(enriched_data_frames)` + - `download_enriched_data(enriched_data_frames)` + +#### **Steps**: + +- **Implement `display_enriched_data()`**: + - Use `st.dataframe()` or `st.table()` to display data. + - Paginate if the data is large. + +- **Implement `download_enriched_data()`**: + - Provide download links using `st.download_button()`. + +- **Integrate With Main App**: + - Update `app.py` or `Entity_Bridge.py` to call these functions after matching. + +--- + +### **10. Incorporate LLM Integration** + +#### **File: `llm_integration.py`** + +- **Purpose**: Enhances entity matching using LLMs for ambiguous cases. +- **Functions to Implement**: + - `setup_llm_client(provider, **credentials)` + - `generate_entity_mappings_with_llm(prompt, client, model_name)` + - `integrate_llm_in_entity_matching(similarity_df, client, model_name)` + +#### **Steps**: + +- **Start with One LLM Provider**: + - Implement integration with OpenAI API first. + - Ensure API keys are securely handled. + +- **Implement `setup_llm_client()`**: + - Create a function to initialize the OpenAI client. + +- **Implement LLM-based Matching**: + - Modify `automated_entity_matching` to use LLM outputs for improved accuracy. + +- **Test LLM Integration**: + - Use prompts to match entities and assess the quality of results. + +--- + +### **11. Finalize Data Enrichment** + +#### **File: `entity_matcher.py`** + +- **Functions to Implement**: + - `enrich_data_frames_with_unique_ids(data_frames, unique_parents, unique_children)` + +#### **Steps**: + +- **Implement Enrichment Function**: + - Map unique IDs back to the original data frames. + - Ensure that the enriched data frames retain all original information plus the new IDs. + +- **Validate Enriched Data**: + - Check that the mapping is correct and no data is lost. + +--- + +### **12. Add Error Handling and Logging** + +- **Implement Error Handling**: + - Use try-except blocks where necessary. + - Provide meaningful error messages to the user using `st.error()`. + +- **Implement Logging**: + - Use Python’s `logging` module to record events. + - Configure logging levels for development and production. + +--- + +### **13. Perform Testing** + +- **Unit Tests**: + - Write tests for utility functions and modules. + - Use frameworks like `unittest` or `pytest`. + +- **Integration Tests**: + - Test the entire workflow with different datasets. + - Ensure that the application behaves as expected under various conditions. + +--- + +### **14. Improve User Interface and Experience** + +#### **Enhancements**: + +- **Progress Indicators**: + - Use `st.progress()` to show processing status during long operations. + +- **User Feedback**: + - Add `st.success()`, `st.warning()`, and `st.info()` messages where appropriate. + +- **Help and Guidance**: + - Include tooltips and explanations using `st.tooltip()` or help arguments. + +- **Reset Options**: + - Implement a reset button to clear the session state. + +--- + +### **15. Documentation and Support** + +- **Code Documentation**: + - Ensure all modules and functions have comprehensive docstrings. + +- **User Guide**: + - Create a `README.md` with instructions on how to use the application. + +- **Technical Documentation**: + - Document the architecture and design decisions. + +--- + +### **16. Deployment** + +- **Prepare for Deployment**: + - Ensure all dependencies are listed in `requirements.txt`. + - Remove any hard-coded paths or debug code. + +- **Deploy the Application**: + - Consider using Streamlit Sharing, Heroku, or AWS for deployment. + - Securely manage secrets and API keys. + +--- + +### **17. Extensibility and Future Enhancements** + +- **Add Support for More LLMs**: + - Extend `llm_integration.py` to include other providers. + +- **Optimize Performance**: + - For large datasets, consider optimizing similarity calculations. + +- **UI Improvements**: + - Implement advanced features like undo actions or state management. + +--- + +## **Development Tips** + +- **Agile Approach**: + - Develop in sprints, focusing on delivering a functional component each time. + - Continuously integrate and test new code. + +- **Version Control Best Practices**: + - Commit early and often. + - Use feature branches for new functionalities. + +- **Continuous Testing**: + - Run tests after each significant change. + - Consider setting up automated tests. + +- **Collaboration**: + - If working in a team, use pull requests and code reviews. + +- **Security**: + - Do not commit secrets or API keys. + - Use environment variables or a `.streamlit/secrets.toml` file for sensitive information. + +- **Feedback Loop**: + - Gather feedback from users and stakeholders. + - Iterate on the design based on feedback. + +--- + +## **Suggested Order of File Development** + +1. **`utils.py`**: Provides foundational functions needed elsewhere. + +2. **`data_loader.py`**: Enables data to be loaded and is essential for testing subsequent modules. + +3. **`data_normalizer.py`**: Depends on data from `data_loader.py` and utilities from `utils.py`. + +4. **`duplicate_remover.py`**: Requires normalized data to function correctly. + +5. **`ui_helper.py`**: Early development allows you to test the UI and get immediate feedback. + +6. **`main.py` and `Entity_Bridge.py`**: Integrate the above modules to form the core application. + +7. **`entity_matcher.py`**: Introduces core functionality after data can be loaded, normalized, and displayed. + +8. **`llm_integration.py`**: Optional for initial versions but adds significant value when implemented. + +9. **Testing and Documentation**: Should be done continuously but formalized once the core functionality is in place. + +--- + +By following this structured approach, you develop the application incrementally, ensuring that each component works correctly before moving on to the next. This reduces complexity and helps in isolating and fixing issues early in the development process. + +Remember to frequently run your application and test it with different datasets to ensure robustness and reliability. Good luck with your development! \ No newline at end of file diff --git a/streamlit_app/app/pages/Simple_Chat.py b/streamlit_app/app/pages/Simple_Chat.py index a09e5f7..4824540 100644 --- a/streamlit_app/app/pages/Simple_Chat.py +++ b/streamlit_app/app/pages/Simple_Chat.py @@ -197,7 +197,7 @@ def view_prompt_variable(var_name, var_data): try: models_response = openai_client.models.list() # Filter models to only include chat models - allowed_models = ['gpt-3.5-turbo', 'gpt-4', 'gpt-4o', 'gpt-4o-mini', 'o1-mini'] + allowed_models = ['gpt-3.5-turbo', 'gpt-4', 'gpt-4o', 'gpt-4o-mini', 'o1-mini', 'o1-preview'] model_names = [model.id for model in models_response.data if model.id in allowed_models] if not model_names: st.warning("No permitted models available for your OpenAI API key.")