Use guardrails for code validation #345

yje-arch · 2023-09-18T03:56:44Z

Hi,

I am the maintainer of YiVal https://github.com/YiVal/YiVal, currently we are trying to use guardrails to help us generate valid python code.

To get started, I do a quick evaluation, here is the code:
https://github.com/YiVal/YiVal/blob/master/demo/guardrails/run_leetcode.py

I download 80 leetcode questions and ask gpt-3.5-turo to generate python code for it, the pass / fail will basically be judged from if eval can run. I just follow the colab here:
https://github.com/ShreyaR/guardrails/blob/main/docs/examples/bug_free_python_code.ipynb

I use the same prompt without guardrail as comparison,
Here is the results

as you can see, with guardrils, the failed rate is higher and we use more token compared to just using GPT raw api

I am wondering if I there is anything wrong that I am doing here. This could also help others who might have the same issue

Thanks in advance for taking a look!

zsimjee · 2023-09-18T04:05:47Z

Hi,

Thanks for the detailed issue. The additional tokens may be from the gr. json prompt key. That adds a decent amount of weight to the prompt, do you have a comparison in which that is not used? I think that would be good data to collect. Another thing you could do to reduce token count is to use string-style validation instead of pydantic/structured validation. You can apply the BugFreePythonCode validator to a guard structure like this - https://docs.guardrailsai.com/defining_guards/strings/. With a little bit of prompt engineering, I think you can slim down your token count considerably using this approach. Would love to see results/help with this longer term! Feel free to post here or work with us on discord!

yje-arch · 2023-09-18T22:23:47Z

Hi,
Thank you for your response!

I appreciate your understanding regarding the token usage, as retries naturally lead to more tokens being used.

However, I am particularly concerned about the quality of the output. As you can see from the attached image above, the simple code used to generate the Python code:

response = await openai.ChatCompletion.acreate(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0,
    max_tokens=1000
)

appears to perform better than the code using guardrails:

guard = gd.Guard.from_pydantic(
    output_class=BugFreePythonCode, prompt=prompt_guardrail
)
raw_llm_response, validated_response = await guard(
    llm_api=openai.ChatCompletion.acreate,
    prompt_params={"leetcode_problem": leetcode_problem},
    model="gpt-3.5-turbo",
    max_tokens=1000,
    temperature=0,
    num_reasks=3,
)

The accuracy achieved using the simple openai api is 0.625, while it is 0.55 when using guardrails (The prompt is pretty much the same). This difference is significant and seems to contradict the purpose of using guardrails. Could you please help me understand if there is something I am missing or if there are any recommendations you could provide to improve the accuracy using guardrails?

Thank you for your time and assistance.

irgolic · 2023-09-19T14:17:36Z

LLMs are better at generating code in a block than in a JSON field. Try using Guard.from_string, targeting a string output type instead of Guard.from_pydantic which generates dicts.

yje-arch · 2023-09-19T16:54:23Z

Thanks, I tried to use the following

      guard = gd.Guard.from_string(
           validators=[BugFreePython(on_fail="reask")], prompt=prompt, description="leetcode problem"
       )
       raw_llm_response, validated_response = await guard(
           llm_api=openai.ChatCompletion.acreate,
           num_reasks = 3
       )

And the validated_response response is not executable since it returns a string, underlying the BugFreePython use ast.parse, which will work for a string in python regardless , but it cannot be executed by exec()

an example response would be

    │ ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────── Validated Output ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
    │ │ 'Sure! Here\'s a Python code snippet that solves the problem:\n\n```python\ndef length_of_longest_substring(s):\n    # Create a dictionary to store the characters and their indices\n    char_map = {}\n    # Initialize variables to  │ │
    │ │ keep track of the starting index and the longest substring length\n    start = 0\n    max_length = 0\n    \n    # Iterate through the string\n    for i in range(len(s)):\n        # Check if the current character is already in the   │ │
    │ │ dictionary and its index is greater than or equal to the start index\n        if s in char_map and char_map[s] >= start:\n            # Update the start index to the next character after the repeated character\n            start =  │ │
    │ │ char_map[s] + 1\n        # Update the dictionary with the current character and its index\n        char_map[s] = i\n        # Update the max length if the current substring length is greater\n        max_length = max(max_length, i  │ │
    │ │ - start + 1)\n    \n    return max_length\n\n# Test the function with the given examples\ns1 = "abcabcbb"\nprint(length_of_longest_substring(s1))  # Output: 3\n\ns2 = "bbbbb"\nprint(length_of_longest_substring(s2))  # Output:       │ │
    │ │ 1\n\ns3 = "pwwkew"\nprint(length_of_longest_substring(s3))  # Output: 3\n```\n\nThis code snippet uses a sliding window approach to find the length of the longest substring without repeating characters. It keeps track of the        │ │
    │ │ starting index of the current substring and updates it whenever a repeated character is found. The function returns the maximum length encountered during the iteration.'

irgolic · 2023-09-21T16:58:47Z

Some more prompt engineering might help, like asking it in the prompt to only return a python code block without any surrounding text.

yje-arch · 2023-09-21T18:05:13Z

Thanks, does guardrails naturally support it?

irgolic · 2023-09-25T11:00:58Z

This is the example we've got, though it generates the code as part of a JSON. https://docs.guardrailsai.com/examples/bug_free_python_code/#step-3-wrap-the-llm-api-call-with-guard

Have you tried using the prompt in that example, generating a string instead of a JSON?

github-actions · 2024-08-22T03:35:37Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions · 2024-09-05T03:36:12Z

This issue was closed because it has been stalled for 14 days with no activity.

yje-arch added the bug Something isn't working label Sep 18, 2023

github-actions bot added the Stale label Aug 22, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use guardrails for code validation #345

Use guardrails for code validation #345

yje-arch commented Sep 18, 2023 •

edited

Loading

zsimjee commented Sep 18, 2023

yje-arch commented Sep 18, 2023 •

edited

Loading

irgolic commented Sep 19, 2023

yje-arch commented Sep 19, 2023

irgolic commented Sep 21, 2023

yje-arch commented Sep 21, 2023

irgolic commented Sep 25, 2023

github-actions bot commented Aug 22, 2024

github-actions bot commented Sep 5, 2024

Use guardrails for code validation #345

Use guardrails for code validation #345

Comments

yje-arch commented Sep 18, 2023 • edited Loading

zsimjee commented Sep 18, 2023

yje-arch commented Sep 18, 2023 • edited Loading

irgolic commented Sep 19, 2023

yje-arch commented Sep 19, 2023

irgolic commented Sep 21, 2023

yje-arch commented Sep 21, 2023

irgolic commented Sep 25, 2023

github-actions bot commented Aug 22, 2024

github-actions bot commented Sep 5, 2024

yje-arch commented Sep 18, 2023 •

edited

Loading

yje-arch commented Sep 18, 2023 •

edited

Loading