Issue about tool call metrics. #1658

MattZ-99 · 2024-11-12T08:51:41Z

Hi all, it's surprising to see the ToolUse metrics in Ragas version v0.2.

Based on the base Metric class, I have implemented several useful metrics before.
After the update of ToolUse section, I'm glad to contribute them for the open source.

As a new here, I'd like to have a discussion before pull request.

Metrics for parallel calling.

Parallel function calling now is a common feature for current llm, where parallel means multiple independent/unordered function calls. For example, Berkeley Function-Calling Leaderboard provides a specialized "parallel" category.

The below case is the parallel tool call for two information.

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)

Both ordered tool_calls should be acceptable, while current ToolCallAccuracy only supports ordered tool_calls.
That is, ToolCallAccuracy is not suitable for parallel calling.

scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0

As for the solution, I have two ideas:

Add another metric, 'ToolCallParallelAccuracy', which focus on parallel tool calls.
Add a new parameter to ToolCallAccuracy, 'ordered_tool_calls=True/False', indicating whether the tool calls should be ordered.

which one would you think is more flexible?

Besides, I also have a small concern about current ToolCallAccuracy.

# ref from _tool_call_accuracy:76~
sequence_aligned = int(
    self.is_sequence_aligned(tool_call_pred_sequence, tool_call_ref_sequence)
)

if pred_tool_calls:
    score = 0.0
    reference_tool_calls = sample.reference_tool_calls
    for ref_tool_call in reference_tool_calls:
        for pred_tool_call in pred_tool_calls:
            ...

As the tool_call_pred and tool_call_pred is checked aligned, the reference_tool_calls and pred_tool_calls should not be go through the double loop.
It would result in the following inconsistency:

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0

sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0

The text was updated successfully, but these errors were encountered:

MattZ-99 · 2024-11-13T03:50:39Z

Update: [report bug]

Now I'm sure the existing bug. Here I would list several ridiculous samples but with error evaluation scores.

sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
# 1.0

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0

sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0

sample = [
    HumanMessage(content="What's the weather like in New York?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
    HumanMessage(content="So what about Shanghai?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0

shahules786 · 2024-11-13T09:46:23Z

Hey @MattZ-99 this is incredibly useful.

parallel tool call: I also had thought about it while building tool call metrics. In real-world scenarios both parallel and sequence tool calls can occur within one agent. Ideally user should be able to evaluate both using single interface.
Order of tool calls are important (if not parallel), ie tool_Y might be consuming output of tool_X, so in this case it is important to make sure that tool_Y is only called after tool_X. There are many other such scenarios.

MattZ-99 added the question Further information is requested label Nov 12, 2024

dosubot bot added the module-metrics this is part of metrics module label Nov 12, 2024

MattZ-99 mentioned this issue Nov 13, 2024

Update tool call accuracy. #1665

Open

sahusiddharth linked a pull request Nov 17, 2024 that will close this issue

Added parallel tool call evaluation to ToolCallAccuracy metric #1687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about tool call metrics. #1658

Issue about tool call metrics. #1658

MattZ-99 commented Nov 12, 2024

MattZ-99 commented Nov 13, 2024

shahules786 commented Nov 13, 2024

Issue about tool call metrics. #1658

Issue about tool call metrics. #1658

Comments

MattZ-99 commented Nov 12, 2024

MattZ-99 commented Nov 13, 2024

shahules786 commented Nov 13, 2024