Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text generation task type #569

Merged
merged 162 commits into from
Jul 26, 2024
Merged

Add text generation task type #569

merged 162 commits into from
Jul 26, 2024

Conversation

bnativi
Copy link
Contributor

@bnativi bnativi commented Apr 29, 2024

Improvements

  • Added a new evaluate_text_generation task type that calculates four new metrics. These include text comparison metrics, which compare a prediction string to a groundtruth string. These also include llm-guided metrics which only sometimes require a groundtruth. The metrics are:
    • AnswerRelevance (Q&A, llm-guided)
    • Bias (general text generation, llm-guided)
    • BLEU (text comparison)
    • Coherence (general text generation, llm-guided)
    • ContextRelevance (RAG, llm-guided)
    • Faithfulness (RAG, llm-guided)
    • Hallucination (RAG, llm-guided)
    • ROUGE (text comparison)
    • Toxicity (general text generation, llm-guided)
  • Added text generation notebook with three example use cases (RAG, summarization and content generation).
  • Added WrappedOpenAIClient and WrappedMistralAIClient to handle llm calls and llm-guided metric computations.
  • Changed from alpine to slim.

Testing

  • Added API functional and unit tests for the text generation metrics and llm clients.
  • Added client side integration tests for text generation metrics.
  • Added external integration tests to test the llm-guided metrics with OpenAI's API and Mistral's API. Because those API's do not give us fully deterministic control, the integration tests only check that valid metrics are returned and do not check the exact metric values.
    • These should only run on merge to main, and not on pushes to other branches.
      tests don't run when not merging to main
    • They pass when run.
      pass on github when OPENAI_API_KEY is set
  • If I purposely try to make them fail, say by setting the OPENAI_API_KEY to "", then they fail as expected.
    fail on github when OPENAI_API_KEY is not set
    • The secret API keys should only be available to the external API integration tests and not to the rest of the integration tests. When I added a test to integration_tests/client/ that tries to make non-mocked OPENAI API calls, the test fails because the api key is not available for this test.
      regular integration tests fail when they try to make non-mocked OpenAI API calls

b.nativi added 30 commits April 19, 2024 22:07
@ntlind
Copy link
Contributor

ntlind commented Jul 15, 2024

the code, tests, and notebook all look good to me. I'm ready to approve once we figure out why the benchmarks are now failing.

@bnativi bnativi merged commit b1d5030 into main Jul 26, 2024
12 checks passed
@bnativi bnativi deleted the add_llm_guided_metrics branch July 26, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants