Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use guardrails for code validation #345

Closed
yje-arch opened this issue Sep 18, 2023 · 9 comments
Closed

Use guardrails for code validation #345

yje-arch opened this issue Sep 18, 2023 · 9 comments
Labels
bug Something isn't working Stale

Comments

@yje-arch
Copy link

yje-arch commented Sep 18, 2023

Hi,

I am the maintainer of YiVal https://github.com/YiVal/YiVal, currently we are trying to use guardrails to help us generate valid python code.

To get started, I do a quick evaluation, here is the code:
https://github.com/YiVal/YiVal/blob/master/demo/guardrails/run_leetcode.py

I download 80 leetcode questions and ask gpt-3.5-turo to generate python code for it, the pass / fail will basically be judged from if eval can run. I just follow the colab here:
https://github.com/ShreyaR/guardrails/blob/main/docs/examples/bug_free_python_code.ipynb

I use the same prompt without guardrail as comparison,
Here is the results
Screenshot 2023-09-17 at 8 58 10 PM

as you can see, with guardrils, the failed rate is higher and we use more token compared to just using GPT raw api

I am wondering if I there is anything wrong that I am doing here. This could also help others who might have the same issue

Thanks in advance for taking a look!

@yje-arch yje-arch added the bug Something isn't working label Sep 18, 2023
@zsimjee
Copy link
Collaborator

zsimjee commented Sep 18, 2023

Hi,

Thanks for the detailed issue. The additional tokens may be from the gr. json prompt key. That adds a decent amount of weight to the prompt, do you have a comparison in which that is not used? I think that would be good data to collect. Another thing you could do to reduce token count is to use string-style validation instead of pydantic/structured validation. You can apply the BugFreePythonCode validator to a guard structure like this - https://docs.guardrailsai.com/defining_guards/strings/. With a little bit of prompt engineering, I think you can slim down your token count considerably using this approach. Would love to see results/help with this longer term! Feel free to post here or work with us on discord!

@yje-arch
Copy link
Author

yje-arch commented Sep 18, 2023

Hi,
Thank you for your response!

I appreciate your understanding regarding the token usage, as retries naturally lead to more tokens being used.

However, I am particularly concerned about the quality of the output. As you can see from the attached image above, the simple code used to generate the Python code:

response = await openai.ChatCompletion.acreate(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0,
    max_tokens=1000
)

appears to perform better than the code using guardrails:

guard = gd.Guard.from_pydantic(
    output_class=BugFreePythonCode, prompt=prompt_guardrail
)
raw_llm_response, validated_response = await guard(
    llm_api=openai.ChatCompletion.acreate,
    prompt_params={"leetcode_problem": leetcode_problem},
    model="gpt-3.5-turbo",
    max_tokens=1000,
    temperature=0,
    num_reasks=3,
)

The accuracy achieved using the simple openai api is 0.625, while it is 0.55 when using guardrails (The prompt is pretty much the same). This difference is significant and seems to contradict the purpose of using guardrails. Could you please help me understand if there is something I am missing or if there are any recommendations you could provide to improve the accuracy using guardrails?

Thank you for your time and assistance.

@irgolic
Copy link
Contributor

irgolic commented Sep 19, 2023

LLMs are better at generating code in a block than in a JSON field. Try using Guard.from_string, targeting a string output type instead of Guard.from_pydantic which generates dicts.

@yje-arch
Copy link
Author

Thanks, I tried to use the following

      guard = gd.Guard.from_string(
           validators=[BugFreePython(on_fail="reask")], prompt=prompt, description="leetcode problem"
       )
       raw_llm_response, validated_response = await guard(
           llm_api=openai.ChatCompletion.acreate,
           num_reasks = 3
       )

And the validated_response response is not executable since it returns a string, underlying the BugFreePython use ast.parse, which will work for a string in python regardless , but it cannot be executed by exec()

an example response would be

    │ ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────── Validated Output ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
    │ │ 'Sure! Here\'s a Python code snippet that solves the problem:\n\n```python\ndef length_of_longest_substring(s):\n    # Create a dictionary to store the characters and their indices\n    char_map = {}\n    # Initialize variables to  │ │
    │ │ keep track of the starting index and the longest substring length\n    start = 0\n    max_length = 0\n    \n    # Iterate through the string\n    for i in range(len(s)):\n        # Check if the current character is already in the   │ │
    │ │ dictionary and its index is greater than or equal to the start index\n        if s in char_map and char_map[s] >= start:\n            # Update the start index to the next character after the repeated character\n            start =  │ │
    │ │ char_map[s] + 1\n        # Update the dictionary with the current character and its index\n        char_map[s] = i\n        # Update the max length if the current substring length is greater\n        max_length = max(max_length, i  │ │
    │ │ - start + 1)\n    \n    return max_length\n\n# Test the function with the given examples\ns1 = "abcabcbb"\nprint(length_of_longest_substring(s1))  # Output: 3\n\ns2 = "bbbbb"\nprint(length_of_longest_substring(s2))  # Output:       │ │
    │ │ 1\n\ns3 = "pwwkew"\nprint(length_of_longest_substring(s3))  # Output: 3\n```\n\nThis code snippet uses a sliding window approach to find the length of the longest substring without repeating characters. It keeps track of the        │ │
    │ │ starting index of the current substring and updates it whenever a repeated character is found. The function returns the maximum length encountered during the iteration.'

@irgolic
Copy link
Contributor

irgolic commented Sep 21, 2023

Some more prompt engineering might help, like asking it in the prompt to only return a python code block without any surrounding text.

@yje-arch
Copy link
Author

Thanks, does guardrails naturally support it?

@irgolic
Copy link
Contributor

irgolic commented Sep 25, 2023

This is the example we've got, though it generates the code as part of a JSON. https://docs.guardrailsai.com/examples/bug_free_python_code/#step-3-wrap-the-llm-api-call-with-guard

Have you tried using the prompt in that example, generating a string instead of a JSON?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale label Aug 22, 2024
Copy link

github-actions bot commented Sep 5, 2024

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants