Skip to content

Deepmark AI enables a unique testing environment for language models (LLM) assessment on task-specific metrics and on your own data so your GenAI-powered solution has predictable and reliable performance.

License

Notifications You must be signed in to change notification settings

IngestAI/deepmark

Repository files navigation

Deepmark AI empowers Generative AI builders to make informed decisions when choosing among Large Language Models (LLM), enabling seamless assessment of various LLM on your own data, so your AI applications have predictable and reliable performance.

Introduction

Artificial Intelligence (AI) is expected to contribute approximately $15.7 trillion to the global economy by 2030, according to a recent study by PwC. As AI continues to play a crucial role in various domains, Generative AI and Large Language Models (LLM) have emerged as a powerful building block in creating AI-powered applications capable of generating enormous business value and generative AI is the key element in these kinds of applications.

Why are We Doing This? - Problem Statement

AI sparked a revolution in the last decade and now AI Subject Matter Experts at MIT (https://horizon.mit.edu/about-us) believe that Generative AI is going to further transform several domains such as code development, chatbots, audio/video amongst many others. With the advancement of Generative AI companies such as openAI and their products such as ChatGPT, there are legal, ethical and trust issues with Gen AI. These challenges beg the need for a good assessment of the products including metrics that need to aim to improve or rank these various models that drive the overall technology. This is also a roadblock for adaptation of GenAI in several companies today.

According to recent HBR report: Generative AI cannot operate on a set-it-and-forget-it basis — the tools need constant oversight.

Although assessment metrics are clearly defined and intrinsic metrics are normally assessed almost instantly when an LLM model is released, there’s no available tools (open-source or proprietary) that enable developers to seamlessly make task-specific (intrinsic) assessments on their unique data. The only solution close to it is the LangChain LangSmith, which is still in closed beta and is not mature enough to provide comprehensive extrinsic metrics that are essential for adoption.

In summary, organizations need to be able to assess LLM models on their own data to deliver verifiable results that balance accuracy, precision, recall (the model’s ability to correctly identify positive cases within a given dataset), and reliability, as models can produce different answers to the same prompts, impeding the user’s ability to assess the accuracy of outputs.

Our Solution

To address this challenge of reliability, we (IngestAI Labs) have developed Deepmark AI - a benchmarking tool that enables assessment of large language models (LLM) on various extrinsic (task-specific) metrics on your own data. It has pre-built integration with leading Generative AI APIs such as GPT-4, Anthropic, GPT-3.5 Turbo, Cohere, AI21, and others.

Current GenAI (LLM) Assessment Metrics

When it comes to assessing the performance of LLMs, there are two main types of metrics that can be used: intrinsic and extrinsic.

Examples of intrinsic metrics include, but they are not limited to

  • Entropy,
  • Perplexity,
  • Coherence, etc.

Extrinsic metrics, or also called Task-Specific metrics, may include:

  • Accuracy,
  • Latency,
  • Cost, etc.

These assessment metrics are not exhaustive, and specific applications may have additional or alternative metrics depending on the context and requirements, but some of the task-specific metrics like latency, accuracy, or cost can be considered as the most commonly used.

Deepmark AI enables a unique testing environment for language models (LLM), allowing GenAI developers to easily diagnose inaccuracies and performance issues in a matter of seconds. By using Deepmark AI, Generative AI applications developers can run multiple LLM models on hundreds or thousands of iterations over specific tasks (question-answering, sentiment analysis, NER, etc) and get exact assessment results in seconds.

DeepMark AI is a tool specifically designed for Generative AI builders.This solution focuses on iterative assessment of extrinsic (task-specific) metrics to identify most predictable, reliable, and cost-effective Generative AI models based on the unique needs of a particular use case. Deepmark AI offers capabilities for comprehensive assessment of various important GenAI performance metrics, such as:

  • Question answering accuracy
  • Text classification accuracy
  • PII recognition accuracy
  • Named entity recognition (NER) accuracy
  • Summarization quality (Relevance)
  • Sentiment analysis accuracy
  • Cost analysis
  • Failure rate
  • Accuracy
  • Latency

Deepmark AI empowers developers and organizations to make informed decisions when navigating through the most important performance metrics of Large Language Models.

User Adoption:

Since its launch in February 2023, IngestAI Labs plantorm (Playground, AI Aggregator, App Builder) has quickly gained popularity as a community-driven platform for rapid exploration, experimentation, and rapid prototyping of various AI use cases.

The platform has gained a significant industry recognition:

  • StartX AI Series,
  • ProductHunt Product #1 of the Day,
  • Accelerated by the PLUGandPLAY Silicon Valley program, and
  • Backed by the Cohere Acceleration Program.

In less than one year, IngestAI has amassed an impressive user base of over 40,000 individuals, with nearly 15,000 active users on a monthly basis and few NASDAQ-traded companies among customers and in the pipeline. This level of traction speaks to the platform's ability to attract and engage users and generate business value.

Key features of Deepmark AI include

Reliability Assessment

Reliability is a critical factor in determining the effectiveness of Generative AI models. DeepMark.AI.AI offers comprehensive reliability assessments by evaluating model performance under various conditions and capturing potential failure points. This enables developers to identify areas for improvement and enhance the overall reliability of their AI applications.

Accuracy Evaluation

Ensuring the accuracy of Generative AI models is essential for generating high-quality outputs. DeepMark.AI.AI provides developers with tools to rigorously evaluate the accuracy of their models through extensive testing and validation procedures. By leveraging advanced statistical techniques and comparison methodologies, developers can derive meaningful insights into the accuracy of their Generative AI applications.

Cost Analysis

Understanding the cost implications before deploying Generative AI models is vital for optimizing resource allocation and maximizing return on investment. DeepMark.AI incorporates cost analysis, enabling developers to make precise estimations of the financial requirements associated with running their AI applications on different GenAI models. By providing cost projections, DeepMark.AI helps developers make informed decisions to achieve cost-effective solutions.

Relevance Assessment

Ensuring the relevance of generated outputs is critical, especially in applications where Generative AI is employed to address specific use cases. DeepMark.AI.AI facilitates relevance assessment by providing developers with tools to compare generated outputs against desired criteria. This allows developers to fine-tune their models and ensure the generated content aligns with the intended goals and requirements.

Latency Assessment

The assessment of latency in APIs for Generative AI models is of critical importance to deliver high-quality, efficient AI-powered applications. Latency denotes the time taken to get a response after a request is made and is a potential indicator of performance. By evaluating latency, AI developers can identify inefficiencies and ensure that AI applications perform at an optimal speed. This contributes to overall user satisfaction and impacts the reliability and credibility of AI applications.

Failure Rate Assessment

Assessing and monitoring failure rates on hundreds or thousands of requests is an essential aspect of assessment of robustness of Generative AI applications. DeepMark.AI offers failure rates assessment capabilities, allowing developers to seamlessly track failure rates at various scales, from hundreds to thousands of requests per second. By providing insights into potential failure patterns, DeepMark.AI enables developers to proactively address issues and maintain optimal performance.

Key Benefits of Deepmark AI

Incorporating the DeepMark.AI technology developed by IngestAI Labs within a AI development can yield to numerous advantages, including:

Predictability and Cost-effectiveness

DeepMark.AI prioritizes predictability and cost-effectiveness by providing developers with reliable assessment metrics, cost estimations, and optimization recommendations. This empowers developers to make informed decisions, reducing the risks associated with designing and deploying Generative AI applications.

Data-driven Decision-making

By leveraging data and rigor, DeepMark.AI enables organizations to move away from relying solely on intuition when assessing Generative AI models. This data-driven approach instills confidence in the decision-making process, allowing for greater precision and accuracy in AI applications development.

Enhances Application Quality

The ability of DeepMark.AI to comprehensively assess reliability, accuracy, relevance, and cost-efficiency contributes to enhancing the overall quality of AI applications. Through continuous monitoring or periodic assessment, developers can iteratively improve their models’ performance (e.g. by improving metapromts or fine-tuning), ensuring optimal performance and user satisfaction.

Path Forward

IngestAI is working on building own bias detection model based on a proprietary comparative dataset consisting of 7,5+ millions of varied requests and responses of different large language models, which are being labeled and used for training, testing, and refining of identification of bias-related contexts, real-time detection and resolution of biases and unsafe prompts or responses. Deepmark AI is a tool built on top of proprietary ML models for AI application developers which provides reliable assessments of predictability, accuracy, cost-efficiency, and other benchmark metrics. By prioritizing safety, truthfulness, predictability, and cost-effectiveness, while leveraging data and rigor, Deepmark AI empowers developers to build high-quality reliable Generative AI-powered applications. With its comprehensive features and benefits, Deepmark AI opens up new possibilities for organizations seeking to harness the true potential of Generative AI.

IngestAI DeepMark Setup via Docker Image

Docker Image: https://hub.docker.com/r/embedditor/deepmark

You can find detailed instructions on the Docker web page.

IngestAI DeepMark Setup via GitHub

  1. Install Laravel

  2. php artisan storage:link

  3. php artisan queue:table

  4. php artisan migrate

  5. Set BEARER_TOKEN in the .env

  6. Use the token from p.5 as the HTTP Header "X-Bearer-Token"

Install frontend

  1. You should have installed node.js and npm on your local machine, please see the documentation https://nodejs.org/
  2. Stable version for node.js is 16.16.0 you can use this https://github.com/nvm-sh/nvm for installing several node versions in 1 machine
  3. Go to the project root directory and in your terminal run npm i
  4. If you want to build project in the dev version you should run npm run dev, or npm run build for the production version
  5. For the local version, follow the link you will find in the terminal

About

Deepmark AI enables a unique testing environment for language models (LLM) assessment on task-specific metrics and on your own data so your GenAI-powered solution has predictable and reliable performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published