Multi-Agent RL Environment  ·  Always Live

The AI Agent That
Pays Invoices
— and Catches Fraud

AP Commander trains large language models to navigate enterprise Accounts Payable workflows with the rigor a CFO would require. 24 tasks. 2 AI agents. Rewards with no shortcuts.

▶  See Episode Demo API Documentation ↗
24 Tasks
·
2 AI Agents
·
16-step episodes
·
5-component reward
24
Tasks
2
AI Agents
Episodes Run
Mean Score

From Invoice to Reward Signal

Every episode is a complete enterprise financial scenario — generated fresh from a seeded RNG. No static dataset. The agent must reason, not memorise.

📄
Step 01

Invoice Arrives

The agent receives a structured observation: vendor invoice, purchase orders, goods receipts, paid ledger, and company policy — all randomised per seed.

Invoice PO GRN Policy
🔍
Step 02

Agent Investigates

Multi-step reasoning up to 16 steps. Intermediate actions trigger simulated workplace actors — a vendor response, a manager escalation, a compliance review.

QUERY_VENDOR ESCALATE HOLD
💯
Step 03

Reward Scored

Five-component partial credit: decision accuracy, amount within 1%, reason code, explanation quality, and process bonus for the correct investigative sequence.

Partial credit No shortcuts

See the Agent Reason

A real episode from the environment — same invoice, same policy, same reward function the model trains against. This is what a well-trained agent learns to do.

 long_invoice_dispute
Seed 42
max 12 steps
difficulty: long-horizon
3 actions taken
Invoice
INV-2024-7831
TechProcure Global
$4,662.00
ItemQtyUnitTotal
ThinkPad L15 Gen-412$385.00$4,620.00
Freight$42.00
Purchase Order
PO-2847 OPEN TechProcure Global
ThinkPad L15 Gen-4 · qty 12 @ $350.00
⚠ Invoice price $385.00 exceeds agreed $350.00 by 10%
Goods Receipt
GRN-1094
ThinkPad L15 Gen-4 · received 12
Company Policy
Unit prices must match agreed PO price.
Deviation >1.0% must be queried
and rejected until corrected invoice received.
Step 1 of 3 QUERY_VENDOR
"decision": "QUERY_VENDOR", "approved_amount": 0.00, "reason_code": "PENDING_CLARIFICATION", "explanation": "Invoice unit price $385.00 exceeds PO agreed price $350.00 by 10%. Querying vendor to document discrepancy before final decision."
[ VENDOR ] Vendor acknowledges pricing error. Corrected invoice at $350.00 will be reissued.
Step 2 of 3 ESCALATE
"decision": "ESCALATE", "approved_amount": 0.00, "reason_code": "MANAGER_REVIEW", "explanation": "Vendor acknowledged the $385.00 error. Escalating to Finance Manager to confirm rejection and request corrected invoice."
[ MANAGER ] Finance Manager confirmed: reject original, request reissued invoice at agreed price $350.00.
Step 3 — Terminal REJECT ✓
"decision": "REJECT", "approved_amount": 0.00, "reason_code": "PRICE_DISCREPANCY", "explanation": "Invoice price $385.00 vs PO agreed $350.00 — 10% deviation exceeds 1.0% threshold. Vendor confirmed error; corrected invoice required per Policy Rule 4."
0.901
Accumulated reward
Full investigation + correct terminal decision
vs ~0.40 for skipping straight to REJECT
Reward Breakdown — Why 0.901
Decision accuracy1.00
Amount accuracy1.00
Reason code1.00
Explanation quality0.90
Process bonus (correct intermediate sequence)+0.10
Shortcut REJECT — no investigation~0.40
γ0×0.01 + γ1×0.01 + γ2×terminal  =  0.01 + 0.9×0.01 + 0.81×terminal  =  0.901
The reward teaches the right process — not just the right answer. Same terminal decision without the investigation sequence scores ~0.40.