Dria Pythonic Agent Benchmark (DPAB)

Community Article Published January 15, 2025

Introduction

The DPAB-α Benchmark

Initial Results

Future Work

References

Introduction

The overwhelming majority, if not all of the (as far as we know) of large language model (LLM) function calling benchmarks work through JSON-based structured output from models, which contain metadata such as function names and argument(s) to be passed to the function [1]. This approach is very straightforward and easy to implement, and make things very deterministic which is great for creating reproducible benchmarks. However, function calling with structured output is not the only (nor the best, according to us) way to do function calling. Earlier this week, we released the first edition of Dria-Agent models, Dria-Agent-α-3B and Dria-Agent-α-7B. These models employ Pythonic Function Calling [2], which prompts the model to output a block of Python code that can be executed to produce the desired output. The motivations for this approach are explained in detail in the Dria-Agent-a blog post.

The DPAB-α Benchmark

As a follow-up to our Dria-Agent-α models, we have created a new benchmark, DPAB-α, which is a collection of 100 problems synthetically generated & validated with a pipeline very similar to the one used to create the training data for the Dria-Agent-α models. Each dataset row contains the following fields:

difficulty: The difficulty of the problem, which is either easy or hard.
function_schema_python: The function definitions, with no implementation, in Python.
function_schema_json: The function schemas in JSON format.
mock_functions: The mock functions, implemented with return values, in Python. These are used generate and validate the checklist.
user_query: The user query, which is a natural language question that the model needs to answer/solve.
checklist: The checklist, which is a list of function names and values that need to be in the output of the code execution. An example checklist is shown below:

"checklist": {
    "functions": [
      "identify_large_files"
    ],
    "values": [
      [
        "/dev/projects/project_a/large_file_1.zip",
        "/dev/projects/project_b/large_dataset.csv"
      ]
    ]
}

this checklist enforces that the model must use identify_large_files function and have the values ["/dev/projects/project_a/large_file_1.zip", "/dev/projects/project_b/large_dataset.csv"] in the execution output. How do we produce the execution output? We use the execution engine defined in exec-python, a python package that allows us to execute any python code with any amount of predefined functions and return the output. The package was developed hand-in-hand with the DPAB-α benchmark. Now, a question you might have is: how do we generate and validate the checklist? We used the methodology described in the Data Validations section of the Dria-Agent-a blog post, in which we basically used a 3-step pipeline to generate a valid checklist:

Decision: The validator model decides whether the checklist is valid or not.
Justification: The validator model provides a justification for its decision, given the checklist, mock functions, and user query.
Revision: The validator model revises the checklist if it is not valid, given the justification.

Initial Results

Pythonic function calling performance often outstrips JSON-based function calling in scenarios that require creative or multi-step solutions, reinforcing the premise that Pythonic function calling can be more natural and powerful.

We have run the first edition of the DPAB-α benchmark many open and closed-source models in strict mode, and the results are shown below:

Model Name	Pythonic	JSON
Closed Models
Claude 3.5 Sonnet	87	45
o1-preview-2024-09-12	55	39
o1-mini-2024-09-12	59	35
gpt-4o-2024-11-20	60	30
Open Models
> 100B Parameters
DeepSeek V3 (685B)	63	33
MiniMax-01	62	40
Llama-3.1-405B-Instruct	60	38
> 30B Parameters
Qwen-2.5-Coder-32b-Instruct	68	32
Qwen-2.5-72b-instruct	65	39
Llama-3.3-70b-Instruct	59	40
QwQ-32b-Preview	47	21
< 20B Parameters
Dria-Agent-a-7B	70	38
Qwen2.5-Coder-7B-Instruct	44	39
Dria-Agent-a-3B	72	31
Qwen2.5-Coder-3B-Instruct	26	37
Qwen-2.5-7B-Instruct	47	34
Phi-4 (14B)	55	35

Clone DBAP repo to run evaluations.

Future Work

Alongside the Dria-Agent series of models, we will also improve upon the first edition of DPAB, and release DPAB-β with a new agentic setup and harder problems.

References

[1] Yan, Fanjia, et al. Berkeley Function Calling Leaderboard. 2024, https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html.
[2] andthattoo, ‘Atakan Tekparmak’. Dria-Agent-a. https://huggingface.co/blog/andthattoo/dria-agent-a.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote