Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create MVP AI console #934

Merged
merged 5 commits into from
Sep 10, 2024
Merged

Create MVP AI console #934

merged 5 commits into from
Sep 10, 2024

Conversation

wwwillchen
Copy link
Collaborator

@wwwillchen wwwillchen commented Sep 10, 2024

Overview

This PR introduces an AI Console, which is a Mesop CRUD and dashboard app, and a core set of modules that provides a more structured approach, particularly around the entities and persistence of them. See console.py for an overview of the entities.

Workflows supported

  • Create/update producers, models, prompt contexts, prompt fragments, expected examples (for evals), golden examples
  • Run evals
  • AI service (for editor toolbar) uses producers

Screenshot:

image

Future work

Feature parity

There's still a couple things left to do to get feature parity with our existing AI modules and then we can delete those:

  • Format golden dataset (to upload for fine-tuning) - @richard-to, if you can help with this, since you did it for Gemini, that'd be helpful :)
  • Support example variables in prompt fragments - this could allow us to do more powerful few-shot prompting, which is what I think you were going for earlier.

More ideas

Having a UI makes it easier to improve our workflows in the future, for example, we could:

  • Create a button to generate more prompts for expected examples.
  • Create a button to turn an evaluated example into an expected example (e.g. simulate a follow-up interaction, e.g. editing a specific component) or into a golden example (e.g. save the best evaluated example).
  • Support selecting different producers from the editor toolbar (this makes it easier to experiment with different models/settings in a more realistic workflow).
  • Support pasting before/after to create a golden example (and generate the diff)

Copy link
Collaborator

@richard-to richard-to left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. This looks pretty awesome. Haven't gotten a chance to actually run and test out the code yet. I'm definitely curious how the workflow will feel compared to the more text and command line driven approach previously.

Definitely seems like an improvement over the previous workflow.

Also the lambda on the event handlers! Didn't we could do that now. Makes things much nicer.

ai/src/service.py Show resolved Hide resolved
ai/src/service.py Show resolved Hide resolved
ai/src/service.py Show resolved Hide resolved
ai/src/console.py Show resolved Hide resolved
ai/src/ai/console/scaffold.py Show resolved Hide resolved
ai/src/ai/common/executor.py Show resolved Hide resolved
ai/src/ai/common/executor.py Show resolved Hide resolved
ai/src/ai/common/executor.py Show resolved Hide resolved
ai/src/ai/common/prompt_fragment.py Show resolved Hide resolved
ai/src/ai/console/pages/add_edit_eval_page.py Show resolved Hide resolved
@richard-to
Copy link
Collaborator

Support selecting different producers from the editor toolbar (this makes it easier to experiment with different models/settings in a more realistic workflow).

Yes, that would be pretty helpful. I think one case is supporting a producer that returns the full output (and not just the diff). I think that affects the visual editor UI slightly since it shows the diff fragment.

Also helpful, I think would be a way to have goldens that could be used for both diff and full outputs. I think it's probably already possible with the patched.py output. But haven't tested it out if that's the case.

I do think in the end, seems like the diff approach will be most efficient, especially when can handle multiple diff changes. For example one question I had was how the diff would look like if I wanted to add a new function say to line X of the code which is a blank line? How would it determine which blank line to replace? I was also wondering if returning the diff format in JSON could improve things, especially with an enforced JSON structured output.

Create a button to turn an evaluated example into an expected example (e.g. simulate a follow-up interaction, e.g. editing a specific component) or into a golden example (e.g. save the best evaluated example).

Yeah I think that would be very useful. It could also be helpful if more than one example could be generated per example input.

Support pasting before/after to create a golden example (and generate the diff)

Yes +1

@wwwillchen
Copy link
Collaborator Author

Support selecting different producers from the editor toolbar (this makes it easier to experiment with different models/settings in a more realistic workflow).

Yes, that would be pretty helpful. I think one case is supporting a producer that returns the full output (and not just the diff). I think that affects the visual editor UI slightly since it shows the diff fragment.

Also helpful, I think would be a way to have goldens that could be used for both diff and full outputs. I think it's probably already possible with the patched.py output. But haven't tested it out if that's the case.

+1 - agree it should be do-able .

I do think in the end, seems like the diff approach will be most efficient, especially when can handle multiple diff changes. For example one question I had was how the diff would look like if I wanted to add a new function say to line X of the code which is a blank line? How would it determine which blank line to replace? I was also wondering if returning the diff format in JSON could improve things, especially with an enforced JSON structured output.

I think you need to have bigger replacement targets, otherwise you'll get ambiguity of what to replace.

https://aider.chat/docs/unified-diffs.html
https://aider.chat/docs/benchmarks.html

I think trying out udiff would be interesting as models like Gemini Flash seem to not understand diff patch format and wants to return the unified diff format.

I'm a little bear-ish about JSON format because I've seen for Gemini it can return very strange results when you enforce the JSON schema in the response (the model starts repeating tokens over and over again). There's also a paper that suggests structured response can hurt performance: https://arxiv.org/abs/2408.02442

Of course, like everything else in AI, it's worth experimenting :)

Create a button to turn an evaluated example into an expected example (e.g. simulate a follow-up interaction, e.g. editing a specific component) or into a golden example (e.g. save the best evaluated example).

Yeah I think that would be very useful. It could also be helpful if more than one example could be generated per example input.

Agree, I've noticed that generating more outputs with the exact same input can get significantly different results.

Support pasting before/after to create a golden example (and generate the diff)

Yes +1

@wwwillchen
Copy link
Collaborator Author

Thanks for the detailed review!

@wwwillchen wwwillchen merged commit 6f021e8 into google:main Sep 10, 2024
1 of 2 checks passed
@wwwillchen wwwillchen deleted the ai-console branch September 10, 2024 23:33
@wwwillchen
Copy link
Collaborator Author

wwwillchen commented Sep 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants