Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue about tool call metrics. #1658

Open
MattZ-99 opened this issue Nov 12, 2024 · 2 comments · May be fixed by #1687
Open

Issue about tool call metrics. #1658

MattZ-99 opened this issue Nov 12, 2024 · 2 comments · May be fixed by #1687
Labels
module-metrics this is part of metrics module question Further information is requested

Comments

@MattZ-99
Copy link

Hi all, it's surprising to see the ToolUse metrics in Ragas version v0.2.

Based on the base Metric class, I have implemented several useful metrics before.
After the update of ToolUse section, I'm glad to contribute them for the open source.

As a new here, I'd like to have a discussion before pull request.

  1. Metrics for parallel calling.

Parallel function calling now is a common feature for current llm, where parallel means multiple independent/unordered function calls. For example, Berkeley Function-Calling Leaderboard provides a specialized "parallel" category.

The below case is the parallel tool call for two information.

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)

Both ordered tool_calls should be acceptable, while current ToolCallAccuracy only supports ordered tool_calls.
That is, ToolCallAccuracy is not suitable for parallel calling.

scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0

As for the solution, I have two ideas:

  1. Add another metric, 'ToolCallParallelAccuracy', which focus on parallel tool calls.
  2. Add a new parameter to ToolCallAccuracy, 'ordered_tool_calls=True/False', indicating whether the tool calls should be ordered.

which one would you think is more flexible?

Besides, I also have a small concern about current ToolCallAccuracy.

# ref from _tool_call_accuracy:76~
sequence_aligned = int(
    self.is_sequence_aligned(tool_call_pred_sequence, tool_call_ref_sequence)
)

if pred_tool_calls:
    score = 0.0
    reference_tool_calls = sample.reference_tool_calls
    for ref_tool_call in reference_tool_calls:
        for pred_tool_call in pred_tool_calls:
            ...

As the tool_call_pred and tool_call_pred is checked aligned, the reference_tool_calls and pred_tool_calls should not be go through the double loop.
It would result in the following inconsistency:

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0
sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
@MattZ-99 MattZ-99 added the question Further information is requested label Nov 12, 2024
@dosubot dosubot bot added the module-metrics this is part of metrics module label Nov 12, 2024
@MattZ-99
Copy link
Author

Update: [report bug]

Now I'm sure the existing bug. Here I would list several ridiculous samples but with error evaluation scores.

sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
# 1.0
sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0
sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
sample = [
    HumanMessage(content="What's the weather like in New York?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
    HumanMessage(content="So what about Shanghai?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0

@shahules786
Copy link
Member

Hey @MattZ-99 this is incredibly useful.

  1. parallel tool call: I also had thought about it while building tool call metrics. In real-world scenarios both parallel and sequence tool calls can occur within one agent. Ideally user should be able to evaluate both using single interface.
  2. Order of tool calls are important (if not parallel), ie tool_Y might be consuming output of tool_X, so in this case it is important to make sure that tool_Y is only called after tool_X. There are many other such scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module-metrics this is part of metrics module question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants