Modern LLM agents have access to tools — search engines, calculators, code interpreters, databases. In every real deployment, these tools cost something: API fees, latency, rate limits, compute time. But almost every existing RL benchmark treats tools as free and unlimited.
This creates a gap between research and reality. An agent trained on "use whatever tools you want" behaves terribly in production where every call costs money. ToolOrchestratorEnv closes that gap.
The agent is given a fixed budget of 50 cost units to spend across 10 questions. Every tool call deducts from that budget. The agent must decide: which tool is worth calling for this question? How many times should I call tools before committing an answer? Is it worth spending 2.0 on an LLM call, or can a 0.1 calculator solve this?
In one sentence: ToolOrchestratorEnv is a deployed, test-covered OpenEnv environment for studying whether reinforcement learning can teach LLM agents to route across heterogeneous tools under a shared budget, rather than treating every tool call as free.
Where this came from
ToolOrchestratorEnv generalizes SearchEconomicsEnv (Yashaswi Sharma, University of Southern California / Ceramic AI), which posed a simpler version: given a fixed number of search calls, can an RL agent learn to answer HotpotQA questions efficiently? That work showed agents could learn non-trivial search strategies. We ask the harder question: can the same principle scale to multiple tools and multiple domains?
flowchart LR
A["Weitzman search economics information has a cost"] --> B["SearchEconomicsEnv 1 search tool + HotpotQA"]
B --> C["ToolOrchestratorEnv 6 tools + 4 domains + shared budget"]
C --> D["AgentBeats Phase 2 OpenEnv submission"]
Research lineage: from economic search theory to a deployed multi-tool RL environment.
SearchEconomicsEnv
ToolOrchestratorEnv
Tools
1 (search)
6 (search, wiki, calc, code, LLM, commit)
Datasets
HotpotQA only
HotpotQA + MATH + GPQA + HumanEval
Budget unit
# of search calls
Cost units per tool
Core challenge
How many searches?
Which tool, when, under budget pressure?
Environment Design
An OpenEnv-native sequential MDP where an LLM agent selects from 6 tools with heterogeneous costs under a shared episode budget, across 10 questions from 4 domains.
Episode structure
flowchart TD
A["Reset episode budget = 50 units"] --> B["Sample 10 questions HotpotQA / MATH / GPQA / HumanEval"]
B --> C["Show observation question + domain + budget + context"]
C --> D{"Agent action"}
D -->|"tool call"| E["Run selected tool charge cost + append result"]
E --> F{"Budget exhausted or max 8 steps?"}
F -->|"no"| C
F -->|"yes"| I["Advance or end episode"]
D -->|"commit"| G["Grade answer Exact Match + token F1"]
G --> H["Compute commit reward quality + efficiency bonus"]
H --> I
I --> J{"Questions remain and budget remains?"}
J -->|"yes"| C
J -->|"no"| K["Episode done"]
One episode is a sequential budget-management problem, not ten isolated QA calls.
START EPISODE
Budget = 50.0 units
Draw 10 questions (mix: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval)
FOR each question:
Show agent: question text, domain, remaining budget, context window
LOOP (max 8 steps per question):
Agent picks a tool + sends a query
Environment runs the tool, charges the cost, returns results
Results added to agent's context window
IF agent calls "commit" → grade answer, compute reward, next question
IF budget exhausted → episode ends immediately
END EPISODE
The six tools
Tool
Cost
What it does
Best for
calculator
0.1
Safe AST-based math expression evaluator
MATH arithmetic
code_executor
0.3
Sandboxed Python exec() with import blocking
HumanEval, complex algebra
wiki_lookup
0.5
Wikipedia REST API, first paragraph
Entity lookups
ceramic_search
1.0
Ceramic AI web search API, top-5 results
HotpotQA multi-hop
llm_reason
2.0
Together AI LLM call (Llama-3-8B), 512 tokens
GPQA graduate-level
commit
0.0
Submit answer for grading
Always free
Costs span a 20:1 ratio from calculator to llm_reason. A single LLM reasoning call burns 4% of the entire episode budget. The agent must learn that this is sometimes worth it (GPQA) and sometimes wasteful (simple arithmetic).
flowchart LR
Q["Question + domain"] --> R{"Routing decision"}
R -->|"arithmetic / symbolic"| C["calculator 0.1"]
R -->|"code execution"| X["code_executor 0.3"]
R -->|"entity fact"| W["wiki_lookup 0.5"]
R -->|"multi-hop factual"| S["ceramic_search 1.0"]
R -->|"hard science reasoning"| L["llm_reason 2.0"]
C --> A["Context window"]
X --> A
W --> A
S --> A
L --> A
A --> M{"Confident?"}
M -->|"yes"| K["commit 0.0"]
M -->|"no"| R
The core policy problem is routing: choose the cheapest tool that is likely to change the answer.
Observation space
At every step, the agent sees: the question text and domain tag, remaining budget and fraction thereof, tool call history and results for the current question, number of questions remaining, and running accuracy. The agent emits a structured action specifying tool selection, query/expression/code, and (for commit) an answer.
Four domains
Domain
Mix
Why it matters for tool selection
HotpotQA
40%
Multi-hop factual QA — needs ceramic_search or wiki_lookup (multiple calls)
MATH
30%
Competition math — calculator for arithmetic, code_executor for algebra, llm_reason for proofs
GPQA
20%
Graduate-level science — often requires llm_reason, which costs 2.0
HumanEval
10%
Code generation — needs code_executor to verify, maybe llm_reason to plan
The Reward Formula — Deep Dive
This is the core intellectual contribution. The reward has two components that create constant pressure to be both correct and frugal.
Part 1: Step reward (every tool call)
$$R_{\text{step}} = -\text{tool\_cost}$$
Every tool call produces a negative reward equal to its cost. This creates a running penalty — the agent pays for every action it takes.
Part 2: Commit reward (on answer submission)
flowchart TD
A["Prediction + gold answer"] --> B["Normalize text lowercase, punctuation, articles"]
B --> C["Compute Exact Match"]
B --> D["Compute token F1"]
C --> E["quality = 1.0 if EM otherwise token F1"]
D --> E
E --> F["base = -0.5 + quality * 1.5"]
E --> G{"quality >= 0.5?"}
G -->|"yes"| H["bonus = 0.1 * remaining_budget_ratio"]
G -->|"no"| I["bonus = 0"]
F --> J["commit reward = base + bonus"]
H --> J
I --> J
K["tool costs already charged as step rewards"] --> L["episode return"]
J --> L
The reward separates spending pressure from answer quality, then gates efficiency bonus on a useful answer.
$\text{quality} \in [0,1]$ — computed from Exact Match (1.0) or Token F1 (partial credit)
$\eta = \mathbb{1}[\text{quality} \geq 0.5]$ — efficiency bonus gate: only awarded if answer is at least half-right
$\gamma = 0.1$ — efficiency weight
$B_{\text{remaining}} / B_{\text{total}}$ — fraction of budget still unspent
Worked examples
Scenario
Tools used
Rstep
Quality
Rcommit
Total
A: Right, cheap
1× calculator (0.1)
−0.1
1.0
+1.10
+1.00
B: Right, expensive
3× ceramic_search (3.0)
−3.0
1.0
+1.09
−1.91
C: Wrong
1× wiki_lookup (0.5)
−0.5
0.0
−0.50
−1.00
D: Partial (F1=0.6)
1× llm_reason (2.0)
−2.0
0.6
+0.49
−1.51
Scenario A is the dream: right answer, cheap tool, big total reward. Scenario B shows that even a correct answer with excessive tool use produces a negative total. The formula makes cost-awareness unavoidable.
Why this formula shape
The efficiency bonus gate ($\eta$): Prevents a degenerate strategy where the agent commits immediately with a random guess to collect efficiency bonus without trying.
Linear quality scaling: Partial credit (via Token F1) provides gradient signal even for close-but-not-exact answers, making learning easier.
Budget-ratio efficiency: As budget drains, each correct answer is worth slightly less bonus, pushing the agent to be consistently frugal across all 10 questions.
Answer grading
Grading produces a quality score in $[0,1]$. The pipeline: (1) extract the answer from the agent's response (JSON parsing → prefix matching → last-line fallback), (2) normalize both prediction and ground truth (lowercase, remove articles and punctuation, tokenize), (3) compute Exact Match (binary) and Token F1 (precision × recall harmonic mean), (4) quality = 1.0 if EM, else F1.
Why GRPO
We train with Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath (Shao et al., 2024) and used to train DeepSeek-R1.
No critic model needed. PPO requires a separate value network, doubling memory. GRPO eliminates it by estimating advantages from the relative quality of sampled completions within each batch.
Natural fit for verifiable rewards. Our reward is deterministic arithmetic — no learned reward model needed.
Accessible training. Single-node, significantly less VRAM than PPO.
The GRPO objective
For each prompt $q$, GRPO samples $G$ completions $\{o_1, \ldots, o_G\}$. Each gets reward $r_i$. Advantage is a z-score:
In our setup, each completion is an entire multi-question episode trajectory: the agent's sequence of tool selections, queries, and commits across 10 questions under a shared budget. A trajectory that routes correctly and cheaply gets high reward; one that wastes budget or answers wrong gets low reward.
sequenceDiagram
participant GRPO as GRPO trainer
participant Policy as LLM policy
participant Client as OpenEnv client
participant Env as ToolOrchestratorEnv
participant Tools as Tool registry
GRPO->>Policy: sample G trajectories
Policy->>Client: structured tool actions
Client->>Env: WebSocket /step
Env->>Tools: dispatch selected tool
Tools-->>Env: tool result + cost
Env-->>Client: observation + reward + done
Client-->>GRPO: completed trajectory returns
GRPO->>GRPO: z-score rewards into advantages
GRPO->>Policy: policy update
OpenEnv keeps training decoupled from the environment server: the policy learns from complete budgeted trajectories.
Why RL is the right framework
Delayed rewards. You don't know if a tool call was helpful until you commit. The agent must assign credit backwards.
Exploration. The agent must try different tool combinations to discover which work best per domain. No labeled "correct tool sequence" exists.
Multi-step planning. 10 questions share one budget. A good agent plans across the whole episode — spending too much early leaves nothing for later.
What a Trained Agent Should Learn
A well-trained agent should exhibit these behaviors — none of which are explicitly programmed:
Domain routing. Math question → calculator. Factual multi-hop → search. Graduate science → llm_reason. The agent learns the domain→tool mapping from reward signal alone.
Confidence-based committing. If the calculator returns a clean number for an arithmetic question, commit immediately. Don't waste 0.5 on a Wikipedia lookup you don't need.
Budget awareness. In early questions with plenty of budget, use ceramic_search. By question 8 with only 5 units left and 3 questions remaining, switch to calculator-only even for non-math questions.
Failure recovery. If the first tool returns garbage, try a different tool rather than committing a bad answer.
These are the behaviors that baselines cannot exhibit — they require learning from feedback across thousands of episodes.
What We Actually Built
A complete environment server. FastAPI endpoints, WebSocket client support, per-session state, Docker/Hugging Face deployment metadata, and a browser demo route.
A six-tool action space. Ceramic search, Wikipedia lookup, calculator, Python executor, LLM reasoning, and commit, each with explicit costs and normalized error handling.
A verifiable reward function. Every tool call is penalized by cost; commits are scored with Exact Match, token F1, and an efficiency bonus gated by answer quality.
Reference baselines and tests. Random, cheapest-first, and domain-oracle policies ship with unit tests covering the API, tools, sandbox behavior, and reward-facing contracts.
Baselines and Honest Results Status
Three shipped baselines
Baseline
Policy
What it isolates
Random tool
Picks tool uniformly at random; commits after 3 steps with "I don't know"
Absolute floor — any RL agent that can't beat this is broken
Cheapest first
Calls tools in ascending cost order: calc→code→wiki→search→LLM
Great budget efficiency, terrible accuracy on factual/science questions
The core claim fails if the trained policy cannot beat the domain oracle on cost-adjusted reward.
Honest status of the trained policy
⚠ HONEST STATUS: Environment complete, no converged checkpoint claimed
We do not claim a converged, baseline-beating trained checkpoint. The research contribution in this submission is the environment, reward design, baselines, deployment path, and the training-ready interface. What we have is:
Environment validated end-to-end. Reset/step API, tool dispatch, answer grading, reward calculation, concurrent session handling, and browser demo are implemented and covered by tests.
Environment deployed and tested. HF Space serves concurrent sessions. WebSocket client connects, steps episodes, returns structured observations.
All three baselines functional. They provide sanity checks for random exploration, low-cost heuristics, and hardcoded domain routing.
No training logs yet. We were unable to complete Env Factory integration during the submission window because the current interface did not support our multi-tool action flow cleanly enough for reliable rollouts.
Training risks identified honestly. See the next section for the concrete failure modes this environment is designed to expose.
GRPO Training: What This Environment Is Built to Test
The next research step is to train with TRL's GRPO via an OpenEnv-compatible rollout loop. We are careful about the claim: this submission ships the environment and baselines, not a final trained policy. The limiting factor was not the reward design or environment server; it was Env Factory integration. Our environment requires a model to make structured, repeated multi-tool calls across an episode, and we were not able to make that interaction reliable enough inside the current Env Factory path to produce trustworthy training logs before submission. We plan to continue the experiments as Env Factory stabilizes and as more post-training model series become available.
The failure modes below are the concrete behaviors the environment is designed to make measurable once that post-training loop is stable.
flowchart TD
A["Training risk"] --> B["Commit immediately spend nothing, answer poorly"]
A --> C["Overuse expensive tools solve early, lose budget"]
A --> D["Collapse to one domain same tool everywhere"]
A --> E["Variable trajectory lengths harder batching and credit assignment"]
B --> F["Measured by quality gate and wrong-answer penalty"]
C --> G["Measured by shared budget and cumulative step costs"]
D --> H["Measured against domain-oracle baseline"]
E --> I["Exposed by multi-step OpenEnv episodes"]
The benchmark is useful because the common training failures are observable in reward, budget, and baseline comparisons.
1. Reward scale mismatch
Step rewards (−0.1 to −2.0) and commit rewards (−0.5 to +1.1) operate on different scales. A useful trained policy must learn that a costly call can still be rational when it raises answer quality enough to recover the cost.
2. Budget-exhaustion cliff
When the agent exhausts its shared budget, the episode ends. This makes early overspending visible: a policy that solves the first few questions with expensive tools can lose the rest of the episode.
3. Variable-length trajectory handling
Episodes can end after different numbers of tool calls because agents commit, run out of steps, or spend the budget. That makes batching and credit assignment harder than single-turn QA, and it is exactly why a realistic tool-use environment matters.
4. Domain-collapse risk
The domain mix is intentionally uneven: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval. A weak policy can overfit to the most common domain and call the same tool everywhere. The domain-oracle baseline makes that failure easy to detect.
5. The commit-immediately attractor
Because commit is free, an untrained agent can minimize spending by answering immediately. The quality gate blocks the efficiency bonus for poor answers, so a successful policy must learn the marginal value of information rather than simply learning to spend nothing.
Why OpenEnv
OpenEnv provides: (1) a standard WebSocket contract consumable by training clients, (2) per-session state with concurrent session support, and (3) a uniform deployment path — same code runs in-process for tests, as Docker for dev, and as a HF Space for training. The remaining integration work is specifically at the Env Factory/model-control layer: making repeated structured multi-tool calls stable enough for post-training rollouts.
Prior Work and Foundations
Weitzman (1979) "Optimal Search for the Best Alternative" — foundational search economics. Information has a cost; rational agents should not search beyond expected marginal gain.
SearchEconomicsEnv (Yashaswi Sharma / University of Southern California / Ceramic AI) — direct predecessor. Single-tool (search), single-dataset (HotpotQA), budget-constrained. Proved the principle.
ReAct (Yao et al., 2022) — interleaving reasoning and tool calls. The paradigm our agent operates within.
Toolformer (Schick et al., 2023) — self-supervised tool learning for LLMs.
GRPO / DeepSeekMath (Shao et al., 2024) — group-relative advantages. Our training algorithm.
DeepSeek-R1 (Guo et al., 2025) — GRPO at scale for reasoning.
CATP-LLM (Wu et al., ICCV 2025) — cost-aware tool planning via offline RL. We differ: online GRPO, episode-level budget, broader benchmarks.
Agent-R1 (Cheng et al., 2025) — end-to-end RL for LLM agents. Complementary: capability + our cost-awareness could compose.
OpenEnv (Meta PyTorch) — base types, WebSocket protocol, submission framework.
Quick Start
# 1. Run the env locally
pip install -r requirements.txt
python app.py # FastAPI on port 8000
# 2. Or use the HF Space
export ENV_BASE_URL="https://landrew9-toolorchestratorenv.hf.space"
# 3. Run baselines
python baselines/random_tool.py
python baselines/cheapest_first.py
python baselines/oracle.py
# 4. Train with GRPO (requires TRL + vLLM)
# See training client docs in RESEARCH.md
All episodes are seeded and reproducible. The Ceramic AI fallback client provides deterministic offline results when no API key is set, so the full environment runs without external dependencies.
What We Did Not Do (Yet)
No converged checkpoint. Environment, baselines, tests, and deployment are complete; convergence is the next milestone.
No Env Factory training logs yet. We could not complete a reliable Env Factory integration for repeated multi-tool calls in time for submission. This is planned follow-up work as the Env Factory path and available post-training model series mature.
Fixed cost model. Real API costs are dynamic. Our fixed costs are a useful simplification.
No human respondent. All grading is automated (EM + F1). Human evaluation of answer quality is future work.
Single budget per episode. Per-question budgets or adaptive budgets are natural extensions.
Ceramic AI dependency. Live web search requires an API key. Fallback client enables offline training but loses real-world retrieval quality.
💰 Can an LLM learn that a calculator is worth 0.1 and an LLM call is worth 2.0 — and route accordingly?
Cost-aware tool selection under budget pressure is the test.
Conclusion
ToolOrchestratorEnv reframes a practical engineering problem — "LLM agents waste money on tools" — as a verifiable RL task. Six tools with a 20:1 cost ratio, four domains requiring fundamentally different tool strategies, a shared episode budget, and a decomposed reward that penalizes every tool call while rewarding correct-and-frugal commits. The completed contribution is the environment: a deployed OpenEnv-native benchmark, explicit cost model, reward implementation, baselines, tests, and submission artifact. The research question — can a GRPO-trained LLM beat the domain oracle baseline on cost-adjusted score? — is now ready to evaluate cleanly. Convergence is the next milestone, not a current claim.
References
Weitzman, M. "Optimal Search for the Best Alternative." Econometrica, 1979.
Yao, S., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023.
Shao, Z., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024.
Guo, D., et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
Wu, Y., et al. "CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning." ICCV 2025.
Cheng, M., et al. "Agent-R1: Training Powerful LLM Agents with End-to-End RL." arXiv:2511.14460, 2025.
Yang, Z., et al. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA." EMNLP 2018.
Hendrycks, D., et al. "Measuring Mathematical Problem Solving with the MATH Dataset." NeurIPS 2021.
Rein, D., et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022, 2023.
Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
Schulman, J., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.