OpenEnv · AgentBeats Phase 2

ToolOrchestratorEnv

ToolOrchestratorEnv teaches LLM agents when a tool call is worth its cost, and when the smarter move is to stop spending.

Andrew Lara — Franklin and Marshall College
Yashaswi Sharma, Defu Cao, Muyan Weng — University of Southern California

ToolOrchestratorEnv · AgentBeats Phase 2 · OpenEnv Challenge Submission

Live Environment Space → GitHub Repo Submission Blog Space

The Problem with Free Tools

Modern LLM agents have access to tools — search engines, calculators, code interpreters, databases. In every real deployment, these tools cost something: API fees, latency, rate limits, compute time. But almost every existing RL benchmark treats tools as free and unlimited.

This creates a gap between research and reality. An agent trained on "use whatever tools you want" behaves terribly in production where every call costs money. ToolOrchestratorEnv closes that gap.

The agent is given a fixed budget of 50 cost units to spend across 10 questions. Every tool call deducts from that budget. The agent must decide: which tool is worth calling for this question? How many times should I call tools before committing an answer? Is it worth spending 2.0 on an LLM call, or can a 0.1 calculator solve this?

In one sentence: ToolOrchestratorEnv is a deployed, test-covered OpenEnv environment for studying whether reinforcement learning can teach LLM agents to route across heterogeneous tools under a shared budget, rather than treating every tool call as free.

Where this came from

ToolOrchestratorEnv generalizes SearchEconomicsEnv (Yashaswi Sharma, University of Southern California / Ceramic AI), which posed a simpler version: given a fixed number of search calls, can an RL agent learn to answer HotpotQA questions efficiently? That work showed agents could learn non-trivial search strategies. We ask the harder question: can the same principle scale to multiple tools and multiple domains?

flowchart LR A["Weitzman search economics
information has a cost"] --> B["SearchEconomicsEnv
1 search tool + HotpotQA"] B --> C["ToolOrchestratorEnv
6 tools + 4 domains + shared budget"] C --> D["AgentBeats Phase 2
OpenEnv submission"]

Research lineage: from economic search theory to a deployed multi-tool RL environment.

	SearchEconomicsEnv	ToolOrchestratorEnv
Tools	1 (search)	6 (search, wiki, calc, code, LLM, commit)
Datasets	HotpotQA only	HotpotQA + MATH + GPQA + HumanEval
Budget unit	# of search calls	Cost units per tool
Core challenge	How many searches?	Which tool, when, under budget pressure?

Environment Design

An OpenEnv-native sequential MDP where an LLM agent selects from 6 tools with heterogeneous costs under a shared episode budget, across 10 questions from 4 domains.

Episode structure

flowchart TD A["Reset episode
budget = 50 units"] --> B["Sample 10 questions
HotpotQA / MATH / GPQA / HumanEval"] B --> C["Show observation
question + domain + budget + context"] C --> D{"Agent action"} D -->|"tool call"| E["Run selected tool
charge cost + append result"] E --> F{"Budget exhausted
or max 8 steps?"} F -->|"no"| C F -->|"yes"| I["Advance or end episode"] D -->|"commit"| G["Grade answer
Exact Match + token F1"] G --> H["Compute commit reward
quality + efficiency bonus"] H --> I I --> J{"Questions remain
and budget remains?"} J -->|"yes"| C J -->|"no"| K["Episode done"]

One episode is a sequential budget-management problem, not ten isolated QA calls.

START EPISODE
  Budget = 50.0 units
  Draw 10 questions (mix: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval)

  FOR each question:
    Show agent: question text, domain, remaining budget, context window

    LOOP (max 8 steps per question):
      Agent picks a tool + sends a query
      Environment runs the tool, charges the cost, returns results
      Results added to agent's context window

      IF agent calls "commit" → grade answer, compute reward, next question
      IF budget exhausted → episode ends immediately
END EPISODE

The six tools

Tool	Cost	What it does	Best for
`calculator`	0.1	Safe AST-based math expression evaluator	MATH arithmetic
`code_executor`	0.3	Sandboxed Python `exec()` with import blocking	HumanEval, complex algebra
`wiki_lookup`	0.5	Wikipedia REST API, first paragraph	Entity lookups
`ceramic_search`	1.0	Ceramic AI web search API, top-5 results	HotpotQA multi-hop
`llm_reason`	2.0	Together AI LLM call (Llama-3-8B), 512 tokens	GPQA graduate-level
`commit`	0.0	Submit answer for grading	Always free

Costs span a 20:1 ratio from calculator to llm_reason. A single LLM reasoning call burns 4% of the entire episode budget. The agent must learn that this is sometimes worth it (GPQA) and sometimes wasteful (simple arithmetic).

flowchart LR Q["Question + domain"] --> R{"Routing decision"} R -->|"arithmetic / symbolic"| C["calculator
0.1"] R -->|"code execution"| X["code_executor
0.3"] R -->|"entity fact"| W["wiki_lookup
0.5"] R -->|"multi-hop factual"| S["ceramic_search
1.0"] R -->|"hard science reasoning"| L["llm_reason
2.0"] C --> A["Context window"] X --> A W --> A S --> A L --> A A --> M{"Confident?"} M -->|"yes"| K["commit
0.0"] M -->|"no"| R

The core policy problem is routing: choose the cheapest tool that is likely to change the answer.

Observation space

At every step, the agent sees: the question text and domain tag, remaining budget and fraction thereof, tool call history and results for the current question, number of questions remaining, and running accuracy. The agent emits a structured action specifying tool selection, query/expression/code, and (for commit) an answer.

Four domains

Domain	Mix	Why it matters for tool selection
HotpotQA	40%	Multi-hop factual QA — needs `ceramic_search` or `wiki_lookup` (multiple calls)
MATH	30%	Competition math — `calculator` for arithmetic, `code_executor` for algebra, `llm_reason` for proofs
GPQA	20%	Graduate-level science — often requires `llm_reason`, which costs 2.0
HumanEval	10%	Code generation — needs `code_executor` to verify, maybe `llm_reason` to plan

The Reward Formula — Deep Dive

This is the core intellectual contribution. The reward has two components that create constant pressure to be both correct and frugal.

Part 1: Step reward (every tool call)

R_{\text{step}} = -\text{tool\_cost}

Every tool call produces a negative reward equal to its cost. This creates a running penalty — the agent pays for every action it takes.

Part 2: Commit reward (on answer submission)

flowchart TD A["Prediction + gold answer"] --> B["Normalize text
lowercase, punctuation, articles"] B --> C["Compute Exact Match"] B --> D["Compute token F1"] C --> E["quality = 1.0 if EM
otherwise token F1"] D --> E E --> F["base = -0.5 + quality * 1.5"] E --> G{"quality >= 0.5?"} G -->|"yes"| H["bonus = 0.1 * remaining_budget_ratio"] G -->|"no"| I["bonus = 0"] F --> J["commit reward = base + bonus"] H --> J I --> J K["tool costs already charged
as step rewards"] --> L["episode return"] J --> L

The reward separates spending pressure from answer quality, then gates efficiency bonus on a useful answer.

R_{\text{commit}} = \underbrace{r_{\text{wrong}} + \text{quality} \times (r_{\text{right}} - r_{\text{wrong}})}_{\text{base}} + \underbrace{\eta \cdot \gamma \cdot \frac{B_{\text{remaining}}}{B_{\text{total}}}}_{\text{efficiency bonus}}

where:

$r_{\text{wrong}} = -0.5$, $r_{\text{right}} = 1.0$ — wrong answers are punished, correct ones rewarded
$\text{quality} \in [0,1]$ — computed from Exact Match (1.0) or Token F1 (partial credit)
$\eta = \mathbb{1}[\text{quality} \geq 0.5]$ — efficiency bonus gate: only awarded if answer is at least half-right
$\gamma = 0.1$ — efficiency weight
$B_{\text{remaining}} / B_{\text{total}}$ — fraction of budget still unspent

Worked examples

Scenario	Tools used	R_step	Quality	R_commit	Total
A: Right, cheap	1× calculator (0.1)	−0.1	1.0	+1.10	+1.00
B: Right, expensive	3× ceramic_search (3.0)	−3.0	1.0	+1.09	−1.91
C: Wrong	1× wiki_lookup (0.5)	−0.5	0.0	−0.50	−1.00
D: Partial (F1=0.6)	1× llm_reason (2.0)	−2.0	0.6	+0.49	−1.51

Scenario A is the dream: right answer, cheap tool, big total reward. Scenario B shows that even a correct answer with excessive tool use produces a negative total. The formula makes cost-awareness unavoidable.

Why this formula shape

The efficiency bonus gate ($\eta$): Prevents a degenerate strategy where the agent commits immediately with a random guess to collect efficiency bonus without trying.
Linear quality scaling: Partial credit (via Token F1) provides gradient signal even for close-but-not-exact answers, making learning easier.
Budget-ratio efficiency: As budget drains, each correct answer is worth slightly less bonus, pushing the agent to be consistently frugal across all 10 questions.

Answer grading

Grading produces a quality score in $[0,1]$. The pipeline: (1) extract the answer from the agent's response (JSON parsing → prefix matching → last-line fallback), (2) normalize both prediction and ground truth (lowercase, remove articles and punctuation, tokenize), (3) compute Exact Match (binary) and Token F1 (precision × recall harmonic mean), (4) quality = 1.0 if EM, else F1.

Why GRPO

We train with Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath (Shao et al., 2024) and used to train DeepSeek-R1.

No critic model needed. PPO requires a separate value network, doubling memory. GRPO eliminates it by estimating advantages from the relative quality of sampled completions within each batch.

Natural fit for verifiable rewards. Our reward is deterministic arithmetic — no learned reward model needed.

Accessible training. Single-node, significantly less VRAM than PPO.

The GRPO objective

For each prompt $q$, GRPO samples $G$ completions $\{o_1, \ldots, o_G\}$. Each gets reward $r_i$. Advantage is a z-score:

\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}

The policy maximizes:

J_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left\{\min\left[\frac{\pi_\theta(a_{i,t} \mid s, a_{i,<t})}{\pi_{\theta_{\text{old}}}(a_{i,t} \mid s, a_{i,<t})} \hat{A}_{i,t},\; \text{clip}(\cdot, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_{i,t}\right] - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]\right\}

In our setup, each completion is an entire multi-question episode trajectory: the agent's sequence of tool selections, queries, and commits across 10 questions under a shared budget. A trajectory that routes correctly and cheaply gets high reward; one that wastes budget or answers wrong gets low reward.

sequenceDiagram participant GRPO as GRPO trainer participant Policy as LLM policy participant Client as OpenEnv client participant Env as ToolOrchestratorEnv participant Tools as Tool registry GRPO->>Policy: sample G trajectories Policy->>Client: structured tool actions Client->>Env: WebSocket /step Env->>Tools: dispatch selected tool Tools-->>Env: tool result + cost Env-->>Client: observation + reward + done Client-->>GRPO: completed trajectory returns GRPO->>GRPO: z-score rewards into advantages GRPO->>Policy: policy update

OpenEnv keeps training decoupled from the environment server: the policy learns from complete budgeted trajectories.

Why RL is the right framework

Delayed rewards. You don't know if a tool call was helpful until you commit. The agent must assign credit backwards.
Exploration. The agent must try different tool combinations to discover which work best per domain. No labeled "correct tool sequence" exists.
Multi-step planning. 10 questions share one budget. A good agent plans across the whole episode — spending too much early leaves nothing for later.

What a Trained Agent Should Learn

A well-trained agent should exhibit these behaviors — none of which are explicitly programmed:

Domain routing. Math question → calculator. Factual multi-hop → search. Graduate science → llm_reason. The agent learns the domain→tool mapping from reward signal alone.
Confidence-based committing. If the calculator returns a clean number for an arithmetic question, commit immediately. Don't waste 0.5 on a Wikipedia lookup you don't need.
Budget awareness. In early questions with plenty of budget, use ceramic_search. By question 8 with only 5 units left and 3 questions remaining, switch to calculator-only even for non-math questions.
Failure recovery. If the first tool returns garbage, try a different tool rather than committing a bad answer.

These are the behaviors that baselines cannot exhibit — they require learning from feedback across thousands of episodes.

What We Actually Built

A complete environment server. FastAPI endpoints, WebSocket client support, per-session state, Docker/Hugging Face deployment metadata, and a browser demo route.
A six-tool action space. Ceramic search, Wikipedia lookup, calculator, Python executor, LLM reasoning, and commit, each with explicit costs and normalized error handling.
A verifiable reward function. Every tool call is penalized by cost; commits are scored with Exact Match, token F1, and an efficiency bonus gated by answer quality.
Reference baselines and tests. Random, cheapest-first, and domain-oracle policies ship with unit tests covering the API, tools, sandbox behavior, and reward-facing contracts.

Baselines and Honest Results Status

Three shipped baselines

Baseline	Policy	What it isolates
Random tool	Picks tool uniformly at random; commits after 3 steps with "I don't know"	Absolute floor — any RL agent that can't beat this is broken
Cheapest first	Calls tools in ascending cost order: calc→code→wiki→search→LLM	Great budget efficiency, terrible accuracy on factual/science questions
Domain oracle	Hardcoded domain→tool mapping (HotpotQA→search, MATH→calc, GPQA→LLM, HumanEval→code)	Performance ceiling for non-learning approaches

Expected performance targets

Metric	Random	Cheapest-first	Domain oracle	Target: RL-trained
Accuracy (avg)	~20-30%	~40-50%	~65-75%	≥ oracle accuracy
Avg budget spent	~35/50	~8/50	~25/50	< oracle, ≥ cheapest
Cost-adjusted reward	negative	low-positive	medium	highest

The core claim fails if the trained policy cannot beat the domain oracle on cost-adjusted reward.

Honest status of the trained policy

⚠ HONEST STATUS: Environment complete, no converged checkpoint claimed

We do not claim a converged, baseline-beating trained checkpoint. The research contribution in this submission is the environment, reward design, baselines, deployment path, and the training-ready interface. What we have is:

Environment validated end-to-end. Reset/step API, tool dispatch, answer grading, reward calculation, concurrent session handling, and browser demo are implemented and covered by tests.
Environment deployed and tested. HF Space serves concurrent sessions. WebSocket client connects, steps episodes, returns structured observations.
All three baselines functional. They provide sanity checks for random exploration, low-cost heuristics, and hardcoded domain routing.
No training logs yet. We were unable to complete Env Factory integration during the submission window because the current interface did not support our multi-tool action flow cleanly enough for reliable rollouts.
Training risks identified honestly. See the next section for the concrete failure modes this environment is designed to expose.

GRPO Training: What This Environment Is Built to Test

The next research step is to train with TRL's GRPO via an OpenEnv-compatible rollout loop. We are careful about the claim: this submission ships the environment and baselines, not a final trained policy. The limiting factor was not the reward design or environment server; it was Env Factory integration. Our environment requires a model to make structured, repeated multi-tool calls across an episode, and we were not able to make that interaction reliable enough inside the current Env Factory path to produce trustworthy training logs before submission. We plan to continue the experiments as Env Factory stabilizes and as more post-training model series become available.

The failure modes below are the concrete behaviors the environment is designed to make measurable once that post-training loop is stable.

flowchart TD A["Training risk"] --> B["Commit immediately
spend nothing, answer poorly"] A --> C["Overuse expensive tools
solve early, lose budget"] A --> D["Collapse to one domain
same tool everywhere"] A --> E["Variable trajectory lengths
harder batching and credit assignment"] B --> F["Measured by quality gate
and wrong-answer penalty"] C --> G["Measured by shared budget
and cumulative step costs"] D --> H["Measured against
domain-oracle baseline"] E --> I["Exposed by multi-step
OpenEnv episodes"]

The benchmark is useful because the common training failures are observable in reward, budget, and baseline comparisons.

1. Reward scale mismatch

Step rewards (−0.1 to −2.0) and commit rewards (−0.5 to +1.1) operate on different scales. A useful trained policy must learn that a costly call can still be rational when it raises answer quality enough to recover the cost.

2. Budget-exhaustion cliff

When the agent exhausts its shared budget, the episode ends. This makes early overspending visible: a policy that solves the first few questions with expensive tools can lose the rest of the episode.

3. Variable-length trajectory handling

Episodes can end after different numbers of tool calls because agents commit, run out of steps, or spend the budget. That makes batching and credit assignment harder than single-turn QA, and it is exactly why a realistic tool-use environment matters.

4. Domain-collapse risk

The domain mix is intentionally uneven: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval. A weak policy can overfit to the most common domain and call the same tool everywhere. The domain-oracle baseline makes that failure easy to detect.

5. The commit-immediately attractor

Because commit is free, an untrained agent can minimize spending by answering immediately. The quality gate blocks the efficiency bonus for poor answers, so a successful policy must learn the marginal value of information rather than simply learning to spend nothing.

Why OpenEnv

OpenEnv provides: (1) a standard WebSocket contract consumable by training clients, (2) per-session state with concurrent session support, and (3) a uniform deployment path — same code runs in-process for tests, as Docker for dev, and as a HF Space for training. The remaining integration work is specifically at the Env Factory/model-control layer: making repeated structured multi-tool calls stable enough for post-training rollouts.

Prior Work and Foundations

Weitzman (1979) "Optimal Search for the Best Alternative" — foundational search economics. Information has a cost; rational agents should not search beyond expected marginal gain.
SearchEconomicsEnv (Yashaswi Sharma / University of Southern California / Ceramic AI) — direct predecessor. Single-tool (search), single-dataset (HotpotQA), budget-constrained. Proved the principle.
ReAct (Yao et al., 2022) — interleaving reasoning and tool calls. The paradigm our agent operates within.
Toolformer (Schick et al., 2023) — self-supervised tool learning for LLMs.
GRPO / DeepSeekMath (Shao et al., 2024) — group-relative advantages. Our training algorithm.
DeepSeek-R1 (Guo et al., 2025) — GRPO at scale for reasoning.
CATP-LLM (Wu et al., ICCV 2025) — cost-aware tool planning via offline RL. We differ: online GRPO, episode-level budget, broader benchmarks.
Agent-R1 (Cheng et al., 2025) — end-to-end RL for LLM agents. Complementary: capability + our cost-awareness could compose.
OpenEnv (Meta PyTorch) — base types, WebSocket protocol, submission framework.

Quick Start

# 1. Run the env locally
pip install -r requirements.txt
python app.py    # FastAPI on port 8000

# 2. Or use the HF Space
export ENV_BASE_URL="https://landrew9-toolorchestratorenv.hf.space"

# 3. Run baselines
python baselines/random_tool.py
python baselines/cheapest_first.py
python baselines/oracle.py

# 4. Train with GRPO (requires TRL + vLLM)
# See training client docs in RESEARCH.md

All episodes are seeded and reproducible. The Ceramic AI fallback client provides deterministic offline results when no API key is set, so the full environment runs without external dependencies.

What We Did Not Do (Yet)

No converged checkpoint. Environment, baselines, tests, and deployment are complete; convergence is the next milestone.
No Env Factory training logs yet. We could not complete a reliable Env Factory integration for repeated multi-tool calls in time for submission. This is planned follow-up work as the Env Factory path and available post-training model series mature.
Fixed cost model. Real API costs are dynamic. Our fixed costs are a useful simplification.
No human respondent. All grading is automated (EM + F1). Human evaluation of answer quality is future work.
Single budget per episode. Per-question budgets or adaptive budgets are natural extensions.
Ceramic AI dependency. Live web search requires an API key. Fallback client enables offline training but loses real-world retrieval quality.

💰 Can an LLM learn that a calculator is worth 0.1 and an LLM call is worth 2.0 — and route accordingly?

Cost-aware tool selection under budget pressure is the test.

Conclusion

ToolOrchestratorEnv reframes a practical engineering problem — "LLM agents waste money on tools" — as a verifiable RL task. Six tools with a 20:1 cost ratio, four domains requiring fundamentally different tool strategies, a shared episode budget, and a decomposed reward that penalizes every tool call while rewarding correct-and-frugal commits. The completed contribution is the environment: a deployed OpenEnv-native benchmark, explicit cost model, reward implementation, baselines, tests, and submission artifact. The research question — can a GRPO-trained LLM beat the domain oracle baseline on cost-adjusted score? — is now ready to evaluate cleanly. Convergence is the next milestone, not a current claim.

References

Weitzman, M. "Optimal Search for the Best Alternative." Econometrica, 1979.
Yao, S., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023.
Shao, Z., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024.
Guo, D., et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
Wu, Y., et al. "CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning." ICCV 2025.
Cheng, M., et al. "Agent-R1: Training Powerful LLM Agents with End-to-End RL." arXiv:2511.14460, 2025.
Yang, Z., et al. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA." EMNLP 2018.
Hendrycks, D., et al. "Measuring Mathematical Problem Solving with the MATH Dataset." NeurIPS 2021.
Rein, D., et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022, 2023.
Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
Schulman, J., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.