Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Evaluate agents with GenAI Model Eval #1555

Merged
merged 15 commits into from
Dec 18, 2024

Conversation

inardini
Copy link
Contributor

@inardini inardini commented Dec 18, 2024

Description

This PR is about the new Gen AI Evaluation for agent evaluation.

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
    • Your account is listed in CODEOWNERS for the file(s).
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).
  • Appropriate docs were updated (if necessary)

@inardini inardini requested a review from a team as a code owner December 18, 2024 10:03
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @inardini, Gemini is currently reviewing this pull request. In the meantime, here's a summary for you and other reviewers to quickly grasp the changes and their purpose:

This pull request aims to evaluate agents using the Vertex AI GenAI Model Eval. The primary changes involve the addition of three new Jupyter notebooks:

  • gemini/evaluation/evaluating_crewai_agent.ipynb: This notebook demonstrates how to evaluate a CrewAI agent using Vertex AI Gen AI Evaluation. It covers building a local agent, preparing an evaluation dataset, and performing evaluations for single tool usage, trajectory, and response generation. The notebook is quite comprehensive, covering various aspects of agent evaluation.
  • gemini/evaluation/evaluating_langgraph_agent.ipynb: Similar to the CrewAI notebook, this notebook focuses on evaluating a LangGraph agent using the same Vertex AI service. The structure and content are parallel to the CrewAI notebook, adapting the evaluation process for the LangGraph framework.
  • gemini/reasoning-engine/evaluating_crewai_agent_reasoning_engine_customized_template.ipynb: This notebook shows how to evaluate a CrewAI agent deployed on Vertex AI Reasoning Engine using a customized template. It includes steps for building, deploying, and evaluating the agent, mirroring the evaluation techniques from the previous notebooks but within the Reasoning Engine context. This notebook also demonstrates the use of a custom evaluation metric.

All three notebooks provide detailed instructions, code examples, and visualizations to aid in understanding and replicating the evaluation process. They appear well-structured and comprehensive.


Here's a little haiku to lighten the mood:

Code flows like a stream,
Tests run, a gentle cascade,
Bugs vanish in peace.

@inardini inardini changed the title evaluate agents with GenAI Model Eval feat: Evaluate agents with GenAI Model Eval Dec 18, 2024
Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces notebooks demonstrating agent evaluation using the Vertex AI Gen AI Evaluation service. The notebooks cover various agent frameworks (CrewAI, LangGraph) and deployment scenarios (local, Reasoning Engine). The structure is generally clear, with good explanations of the evaluation process and metrics. The use of helper functions for visualization and data handling is a plus, making the notebooks more user-friendly. The inclusion of Bring Your Own Dataset (BYOD) examples adds valuable flexibility.

However, there are areas for improvement, particularly in code consistency, error handling, and adherence to best practices. Some sections could benefit from more detailed explanations or examples, especially when introducing new concepts or tools. Additionally, there are opportunities to improve code efficiency and robustness. Addressing these points will enhance the overall quality and educational value of the notebooks.

@holtskinner holtskinner merged commit 08722bf into GoogleCloudPlatform:main Dec 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants