AP Commander — Accounts Payable AI Agent

How It Works

From Invoice to Reward Signal

Every episode is a complete enterprise financial scenario — generated fresh from a seeded RNG. No static dataset. The agent must reason, not memorise.

📄

Step 01

Invoice Arrives

The agent receives a structured observation: vendor invoice, purchase orders, goods receipts, paid ledger, and company policy — all randomised per seed.

Invoice PO GRN Policy

🔍

Step 02

Agent Investigates

Multi-step reasoning up to 16 steps. Intermediate actions trigger simulated workplace actors — a vendor response, a manager escalation, a compliance review.

QUERY_VENDOR ESCALATE HOLD

💯

Step 03

Reward Scored

Five-component partial credit: decision accuracy, amount within 1%, reason code, explanation quality, and process bonus for the correct investigative sequence.

Partial credit No shortcuts

Episode Walkthrough

See the Agent Reason

A real episode from the environment — same invoice, same policy, same reward function the model trains against. This is what a well-trained agent learns to do.

long_invoice_dispute

Seed 42

max 12 steps

difficulty: long-horizon

3 actions taken

Invoice

INV-2024-7831

TechProcure Global

$4,662.00

Item	Qty	Unit	Total
ThinkPad L15 Gen-4	12	$385.00	$4,620.00
Freight			$42.00

Purchase Order

PO-2847 OPEN TechProcure Global

ThinkPad L15 Gen-4 · qty 12 @ $350.00

⚠ Invoice price $385.00 exceeds agreed $350.00 by 10%

Goods Receipt

GRN-1094

ThinkPad L15 Gen-4 · received 12

Company Policy

            Unit prices must match agreed PO price.

            Deviation >1.0% must be queried

            and rejected until corrected invoice received.

Step 1 of 3 QUERY_VENDOR

"decision": "QUERY_VENDOR", "approved_amount": 0.00, "reason_code": "PENDING_CLARIFICATION", "explanation": "Invoice unit price $385.00 exceeds PO agreed price $350.00 by 10%. Querying vendor to document discrepancy before final decision."

[ VENDOR ] Vendor acknowledges pricing error. Corrected invoice at $350.00 will be reissued.

↓

Step 2 of 3 ESCALATE

"decision": "ESCALATE", "approved_amount": 0.00, "reason_code": "MANAGER_REVIEW", "explanation": "Vendor acknowledged the $385.00 error. Escalating to Finance Manager to confirm rejection and request corrected invoice."

[ MANAGER ] Finance Manager confirmed: reject original, request reissued invoice at agreed price $350.00.

↓

Step 3 — Terminal REJECT ✓

"decision": "REJECT", "approved_amount": 0.00, "reason_code": "PRICE_DISCREPANCY", "explanation": "Invoice price $385.00 vs PO agreed $350.00 — 10% deviation exceeds 1.0% threshold. Vendor confirmed error; corrected invoice required per Policy Rule 4."

0.901

Accumulated reward
Full investigation + correct terminal decision
vs ~0.40 for skipping straight to REJECT

Reward Breakdown — Why 0.901

Decision accuracy1.00

Amount accuracy1.00

Reason code1.00

Explanation quality0.90

Process bonus (correct intermediate sequence)+0.10

Shortcut REJECT — no investigation~0.40

γ⁰×0.01 + γ¹×0.01 + γ²×terminal = 0.01 + 0.9×0.01 + 0.81×terminal = 0.901
The reward teaches the right process — not just the right answer. Same terminal decision without the investigation sequence scores ~0.40.

The AI Agent ThatPays Invoices— and Catches Fraud

From Invoice to Reward Signal

Invoice Arrives

Agent Investigates

Reward Scored

See the Agent Reason

The AI Agent That
Pays Invoices
— and Catches Fraud