Dria Pythonic Agent Benchmark (DPAB)

Community Article Published January 15, 2025

Introduction

The overwhelming majority, if not all of the (as far as we know) of large language model (LLM) function calling benchmarks work through JSON-based structured output from models, which contain metadata such as function names and argument(s) to be passed to the function [1]. This approach is very straightforward and easy to implement, and make things very deterministic which is great for creating reproducible benchmarks. However, function calling with structured output is not the only (nor the best, according to us) way to do function calling. Earlier this week, we released the first edition of Dria-Agent models, Dria-Agent-α-3B and Dria-Agent-α-7B. These models employ Pythonic Function Calling [2], which prompts the model to output a block of Python code that can be executed to produce the desired output. The motivations for this approach are explained in detail in the Dria-Agent-a blog post.

The DPAB-α Benchmark

As a follow-up to our Dria-Agent-α models, we have created a new benchmark, DPAB-α, which is a collection of 100 problems synthetically generated & validated with a pipeline very similar to the one used to create the training data for the Dria-Agent-α models. Each dataset row contains the following fields:

  • difficulty: The difficulty of the problem, which is either easy or hard.
  • function_schema_python: The function definitions, with no implementation, in Python.
  • function_schema_json: The function schemas in JSON format.
  • mock_functions: The mock functions, implemented with return values, in Python. These are used generate and validate the checklist.
  • user_query: The user query, which is a natural language question that the model needs to answer/solve.
  • checklist: The checklist, which is a list of function names and values that need to be in the output of the code execution. An example checklist is shown below:
"checklist": {
    "functions": [
      "identify_large_files"
    ],
    "values": [
      [
        "/dev/projects/project_a/large_file_1.zip",
        "/dev/projects/project_b/large_dataset.csv"
      ]
    ]
}

this checklist enforces that the model must use identify_large_files function and have the values ["/dev/projects/project_a/large_file_1.zip", "/dev/projects/project_b/large_dataset.csv"] in the execution output. How do we produce the execution output? We use the execution engine defined in exec-python, a python package that allows us to execute any python code with any amount of predefined functions and return the output. The package was developed hand-in-hand with the DPAB-α benchmark. Now, a question you might have is: how do we generate and validate the checklist? We used the methodology described in the Data Validations section of the Dria-Agent-a blog post, in which we basically used a 3-step pipeline to generate a valid checklist:

  • Decision: The validator model decides whether the checklist is valid or not.
  • Justification: The validator model provides a justification for its decision, given the checklist, mock functions, and user query.
  • Revision: The validator model revises the checklist if it is not valid, given the justification.

Initial Results

Pythonic function calling performance often outstrips JSON-based function calling in scenarios that require creative or multi-step solutions, reinforcing the premise that Pythonic function calling can be more natural and powerful.

We have run the first edition of the DPAB-α benchmark many open and closed-source models in strict mode, and the results are shown below:

Model Name Pythonic JSON
Closed Models
Claude 3.5 Sonnet 87 45
o1-preview-2024-09-12 55 39
o1-mini-2024-09-12 59 35
gpt-4o-2024-11-20 60 30
Open Models
> 100B Parameters
DeepSeek V3 (685B) 63 33
MiniMax-01 62 40
Llama-3.1-405B-Instruct 60 38
> 30B Parameters
Qwen-2.5-Coder-32b-Instruct 68 32
Qwen-2.5-72b-instruct 65 39
Llama-3.3-70b-Instruct 59 40
QwQ-32b-Preview 47 21
< 20B Parameters
Dria-Agent-a-7B 70 38
Qwen2.5-Coder-7B-Instruct 44 39
Dria-Agent-a-3B 72 31
Qwen2.5-Coder-3B-Instruct 26 37
Qwen-2.5-7B-Instruct 47 34
Phi-4 (14B) 55 35

Clone DBAP repo to run evaluations.

Future Work

Alongside the Dria-Agent series of models, we will also improve upon the first edition of DPAB, and release DPAB-β with a new agentic setup and harder problems.

References

Community

Sign up or log in to comment