OpenEnv · AgentBeats Phase 2

ToolOrchestratorEnv

ToolOrchestratorEnv teaches LLM agents when a tool call is worth its cost, and when the smarter move is to stop spending.

GitHub HF Space OpenEnv 6 Tools 4 Benchmarks GRPO
Andrew Lara — Franklin and Marshall College
Yashaswi Sharma, Defu Cao, Muyan Weng — University of Southern California
Live Environment Space → GitHub Repo Submission Blog Space

The Problem with Free Tools

Modern LLM agents have access to tools — search engines, calculators, code interpreters, databases. In every real deployment, these tools cost something: API fees, latency, rate limits, compute time. But almost every existing RL benchmark treats tools as free and unlimited.

This creates a gap between research and reality. An agent trained on "use whatever tools you want" behaves terribly in production where every call costs money. ToolOrchestratorEnv closes that gap.

The agent is given a fixed budget of 50 cost units to spend across 10 questions. Every tool call deducts from that budget. The agent must decide: which tool is worth calling for this question? How many times should I call tools before committing an answer? Is it worth spending 2.0 on an LLM call, or can a 0.1 calculator solve this?

In one sentence: ToolOrchestratorEnv is a deployed, test-covered OpenEnv environment for studying whether reinforcement learning can teach LLM agents to route across heterogeneous tools under a shared budget, rather than treating every tool call as free.

Where this came from

ToolOrchestratorEnv generalizes SearchEconomicsEnv (Yashaswi Sharma, University of Southern California / Ceramic AI), which posed a simpler version: given a fixed number of search calls, can an RL agent learn to answer HotpotQA questions efficiently? That work showed agents could learn non-trivial search strategies. We ask the harder question: can the same principle scale to multiple tools and multiple domains?

flowchart LR A["Weitzman search economics
information has a cost"] --> B["SearchEconomicsEnv
1 search tool + HotpotQA"] B --> C["ToolOrchestratorEnv
6 tools + 4 domains + shared budget"] C --> D["AgentBeats Phase 2
OpenEnv submission"]
Research lineage: from economic search theory to a deployed multi-tool RL environment.
SearchEconomicsEnvToolOrchestratorEnv
Tools1 (search)6 (search, wiki, calc, code, LLM, commit)
DatasetsHotpotQA onlyHotpotQA + MATH + GPQA + HumanEval
Budget unit# of search callsCost units per tool
Core challengeHow many searches?Which tool, when, under budget pressure?

Environment Design

An OpenEnv-native sequential MDP where an LLM agent selects from 6 tools with heterogeneous costs under a shared episode budget, across 10 questions from 4 domains.

Episode structure

flowchart TD A["Reset episode
budget = 50 units"] --> B["Sample 10 questions
HotpotQA / MATH / GPQA / HumanEval"] B --> C["Show observation
question + domain + budget + context"] C --> D{"Agent action"} D -->|"tool call"| E["Run selected tool
charge cost + append result"] E --> F{"Budget exhausted
or max 8 steps?"} F -->|"no"| C F -->|"yes"| I["Advance or end episode"] D -->|"commit"| G["Grade answer
Exact Match + token F1"] G --> H["Compute commit reward
quality + efficiency bonus"] H --> I I --> J{"Questions remain
and budget remains?"} J -->|"yes"| C J -->|"no"| K["Episode done"]
One episode is a sequential budget-management problem, not ten isolated QA calls.
START EPISODE
  Budget = 50.0 units
  Draw 10 questions (mix: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval)

  FOR each question:
    Show agent: question text, domain, remaining budget, context window

    LOOP (max 8 steps per question):
      Agent picks a tool + sends a query
      Environment runs the tool, charges the cost, returns results
      Results added to agent's context window

      IF agent calls "commit" → grade answer, compute reward, next question
      IF budget exhausted → episode ends immediately
END EPISODE

The six tools

ToolCostWhat it doesBest for
calculator0.1Safe AST-based math expression evaluatorMATH arithmetic
code_executor0.3Sandboxed Python exec() with import blockingHumanEval, complex algebra
wiki_lookup0.5Wikipedia REST API, first paragraphEntity lookups
ceramic_search1.0Ceramic AI web search API, top-5 resultsHotpotQA multi-hop
llm_reason2.0Together AI LLM call (Llama-3-8B), 512 tokensGPQA graduate-level
commit0.0Submit answer for gradingAlways free

Costs span a 20:1 ratio from calculator to llm_reason. A single LLM reasoning call burns 4% of the entire episode budget. The agent must learn that this is sometimes worth it (GPQA) and sometimes wasteful (simple arithmetic).

flowchart LR Q["Question + domain"] --> R{"Routing decision"} R -->|"arithmetic / symbolic"| C["calculator
0.1"] R -->|"code execution"| X["code_executor
0.3"] R -->|"entity fact"| W["wiki_lookup
0.5"] R -->|"multi-hop factual"| S["ceramic_search
1.0"] R -->|"hard science reasoning"| L["llm_reason
2.0"] C --> A["Context window"] X --> A W --> A S --> A L --> A A --> M{"Confident?"} M -->|"yes"| K["commit
0.0"] M -->|"no"| R
The core policy problem is routing: choose the cheapest tool that is likely to change the answer.

Observation space

At every step, the agent sees: the question text and domain tag, remaining budget and fraction thereof, tool call history and results for the current question, number of questions remaining, and running accuracy. The agent emits a structured action specifying tool selection, query/expression/code, and (for commit) an answer.

Four domains

DomainMixWhy it matters for tool selection
HotpotQA40%Multi-hop factual QA — needs ceramic_search or wiki_lookup (multiple calls)
MATH30%Competition math — calculator for arithmetic, code_executor for algebra, llm_reason for proofs
GPQA20%Graduate-level science — often requires llm_reason, which costs 2.0
HumanEval10%Code generation — needs code_executor to verify, maybe llm_reason to plan

The Reward Formula — Deep Dive

This is the core intellectual contribution. The reward has two components that create constant pressure to be both correct and frugal.

Part 1: Step reward (every tool call)

$$R_{\text{step}} = -\text{tool\_cost}$$

Every tool call produces a negative reward equal to its cost. This creates a running penalty — the agent pays for every action it takes.

Part 2: Commit reward (on answer submission)

flowchart TD A["Prediction + gold answer"] --> B["Normalize text
lowercase, punctuation, articles"] B --> C["Compute Exact Match"] B --> D["Compute token F1"] C --> E["quality = 1.0 if EM
otherwise token F1"] D --> E E --> F["base = -0.5 + quality * 1.5"] E --> G{"quality >= 0.5?"} G -->|"yes"| H["bonus = 0.1 * remaining_budget_ratio"] G -->|"no"| I["bonus = 0"] F --> J["commit reward = base + bonus"] H --> J I --> J K["tool costs already charged
as step rewards"] --> L["episode return"] J --> L
The reward separates spending pressure from answer quality, then gates efficiency bonus on a useful answer.
$$R_{\text{commit}} = \underbrace{r_{\text{wrong}} + \text{quality} \times (r_{\text{right}} - r_{\text{wrong}})}_{\text{base}} + \underbrace{\eta \cdot \gamma \cdot \frac{B_{\text{remaining}}}{B_{\text{total}}}}_{\text{efficiency bonus}}$$

where:

Worked examples

ScenarioTools usedRstepQualityRcommitTotal
A: Right, cheap1× calculator (0.1)−0.11.0+1.10+1.00
B: Right, expensive3× ceramic_search (3.0)−3.01.0+1.09−1.91
C: Wrong1× wiki_lookup (0.5)−0.50.0−0.50−1.00
D: Partial (F1=0.6)1× llm_reason (2.0)−2.00.6+0.49−1.51

Scenario A is the dream: right answer, cheap tool, big total reward. Scenario B shows that even a correct answer with excessive tool use produces a negative total. The formula makes cost-awareness unavoidable.

Why this formula shape

Answer grading

Grading produces a quality score in $[0,1]$. The pipeline: (1) extract the answer from the agent's response (JSON parsing → prefix matching → last-line fallback), (2) normalize both prediction and ground truth (lowercase, remove articles and punctuation, tokenize), (3) compute Exact Match (binary) and Token F1 (precision × recall harmonic mean), (4) quality = 1.0 if EM, else F1.

Why GRPO

We train with Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath (Shao et al., 2024) and used to train DeepSeek-R1.

No critic model needed. PPO requires a separate value network, doubling memory. GRPO eliminates it by estimating advantages from the relative quality of sampled completions within each batch.

Natural fit for verifiable rewards. Our reward is deterministic arithmetic — no learned reward model needed.

Accessible training. Single-node, significantly less VRAM than PPO.

The GRPO objective

For each prompt $q$, GRPO samples $G$ completions $\{o_1, \ldots, o_G\}$. Each gets reward $r_i$. Advantage is a z-score:

$$\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}$$

The policy maximizes:

$$J_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left\{\min\left[\frac{\pi_\theta(a_{i,t} \mid s, a_{i,<t})}{\pi_{\theta_{\text{old}}}(a_{i,t} \mid s, a_{i,<t})} \hat{A}_{i,t},\; \text{clip}(\cdot, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_{i,t}\right] - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]\right\}$$

In our setup, each completion is an entire multi-question episode trajectory: the agent's sequence of tool selections, queries, and commits across 10 questions under a shared budget. A trajectory that routes correctly and cheaply gets high reward; one that wastes budget or answers wrong gets low reward.

sequenceDiagram participant GRPO as GRPO trainer participant Policy as LLM policy participant Client as OpenEnv client participant Env as ToolOrchestratorEnv participant Tools as Tool registry GRPO->>Policy: sample G trajectories Policy->>Client: structured tool actions Client->>Env: WebSocket /step Env->>Tools: dispatch selected tool Tools-->>Env: tool result + cost Env-->>Client: observation + reward + done Client-->>GRPO: completed trajectory returns GRPO->>GRPO: z-score rewards into advantages GRPO->>Policy: policy update
OpenEnv keeps training decoupled from the environment server: the policy learns from complete budgeted trajectories.

Why RL is the right framework

What a Trained Agent Should Learn

A well-trained agent should exhibit these behaviors — none of which are explicitly programmed:

These are the behaviors that baselines cannot exhibit — they require learning from feedback across thousands of episodes.

What We Actually Built

Baselines and Honest Results Status

Three shipped baselines

BaselinePolicyWhat it isolates
Random toolPicks tool uniformly at random; commits after 3 steps with "I don't know"Absolute floor — any RL agent that can't beat this is broken
Cheapest firstCalls tools in ascending cost order: calc→code→wiki→search→LLMGreat budget efficiency, terrible accuracy on factual/science questions
Domain oracleHardcoded domain→tool mapping (HotpotQA→search, MATH→calc, GPQA→LLM, HumanEval→code)Performance ceiling for non-learning approaches

Expected performance targets

MetricRandomCheapest-firstDomain oracleTarget: RL-trained
Accuracy (avg)~20-30%~40-50%~65-75%≥ oracle accuracy
Avg budget spent~35/50~8/50~25/50< oracle, ≥ cheapest
Cost-adjusted rewardnegativelow-positivemediumhighest

The core claim fails if the trained policy cannot beat the domain oracle on cost-adjusted reward.

Honest status of the trained policy

⚠ HONEST STATUS: Environment complete, no converged checkpoint claimed

We do not claim a converged, baseline-beating trained checkpoint. The research contribution in this submission is the environment, reward design, baselines, deployment path, and the training-ready interface. What we have is:

GRPO Training: What This Environment Is Built to Test

The next research step is to train with TRL's GRPO via an OpenEnv-compatible rollout loop. We are careful about the claim: this submission ships the environment and baselines, not a final trained policy. The limiting factor was not the reward design or environment server; it was Env Factory integration. Our environment requires a model to make structured, repeated multi-tool calls across an episode, and we were not able to make that interaction reliable enough inside the current Env Factory path to produce trustworthy training logs before submission. We plan to continue the experiments as Env Factory stabilizes and as more post-training model series become available.

The failure modes below are the concrete behaviors the environment is designed to make measurable once that post-training loop is stable.

flowchart TD A["Training risk"] --> B["Commit immediately
spend nothing, answer poorly"] A --> C["Overuse expensive tools
solve early, lose budget"] A --> D["Collapse to one domain
same tool everywhere"] A --> E["Variable trajectory lengths
harder batching and credit assignment"] B --> F["Measured by quality gate
and wrong-answer penalty"] C --> G["Measured by shared budget
and cumulative step costs"] D --> H["Measured against
domain-oracle baseline"] E --> I["Exposed by multi-step
OpenEnv episodes"]
The benchmark is useful because the common training failures are observable in reward, budget, and baseline comparisons.

1. Reward scale mismatch

Step rewards (−0.1 to −2.0) and commit rewards (−0.5 to +1.1) operate on different scales. A useful trained policy must learn that a costly call can still be rational when it raises answer quality enough to recover the cost.

2. Budget-exhaustion cliff

When the agent exhausts its shared budget, the episode ends. This makes early overspending visible: a policy that solves the first few questions with expensive tools can lose the rest of the episode.

3. Variable-length trajectory handling

Episodes can end after different numbers of tool calls because agents commit, run out of steps, or spend the budget. That makes batching and credit assignment harder than single-turn QA, and it is exactly why a realistic tool-use environment matters.

4. Domain-collapse risk

The domain mix is intentionally uneven: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval. A weak policy can overfit to the most common domain and call the same tool everywhere. The domain-oracle baseline makes that failure easy to detect.

5. The commit-immediately attractor

Because commit is free, an untrained agent can minimize spending by answering immediately. The quality gate blocks the efficiency bonus for poor answers, so a successful policy must learn the marginal value of information rather than simply learning to spend nothing.

Why OpenEnv

OpenEnv provides: (1) a standard WebSocket contract consumable by training clients, (2) per-session state with concurrent session support, and (3) a uniform deployment path — same code runs in-process for tests, as Docker for dev, and as a HF Space for training. The remaining integration work is specifically at the Env Factory/model-control layer: making repeated structured multi-tool calls stable enough for post-training rollouts.

Prior Work and Foundations

Quick Start

# 1. Run the env locally
pip install -r requirements.txt
python app.py    # FastAPI on port 8000

# 2. Or use the HF Space
export ENV_BASE_URL="https://landrew9-toolorchestratorenv.hf.space"

# 3. Run baselines
python baselines/random_tool.py
python baselines/cheapest_first.py
python baselines/oracle.py

# 4. Train with GRPO (requires TRL + vLLM)
# See training client docs in RESEARCH.md

All episodes are seeded and reproducible. The Ceramic AI fallback client provides deterministic offline results when no API key is set, so the full environment runs without external dependencies.

What We Did Not Do (Yet)

💰 Can an LLM learn that a calculator is worth 0.1 and an LLM call is worth 2.0 — and route accordingly?
Cost-aware tool selection under budget pressure is the test.

Conclusion

ToolOrchestratorEnv reframes a practical engineering problem — "LLM agents waste money on tools" — as a verifiable RL task. Six tools with a 20:1 cost ratio, four domains requiring fundamentally different tool strategies, a shared episode budget, and a decomposed reward that penalizes every tool call while rewarding correct-and-frugal commits. The completed contribution is the environment: a deployed OpenEnv-native benchmark, explicit cost model, reward implementation, baselines, tests, and submission artifact. The research question — can a GRPO-trained LLM beat the domain oracle baseline on cost-adjusted score? — is now ready to evaluate cleanly. Convergence is the next milestone, not a current claim.

References

  1. Weitzman, M. "Optimal Search for the Best Alternative." Econometrica, 1979.
  2. Yao, S., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
  3. Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023.
  4. Shao, Z., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024.
  5. Guo, D., et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
  6. Wu, Y., et al. "CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning." ICCV 2025.
  7. Cheng, M., et al. "Agent-R1: Training Powerful LLM Agents with End-to-End RL." arXiv:2511.14460, 2025.
  8. Yang, Z., et al. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA." EMNLP 2018.
  9. Hendrycks, D., et al. "Measuring Mathematical Problem Solving with the MATH Dataset." NeurIPS 2021.
  10. Rein, D., et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022, 2023.
  11. Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
  12. Schulman, J., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.