GenAI unit economics: a CFO/COO model to keep margins intact

Executive summary

If you adopt GenAI broadly, your cost structure changes. What looks like a “software seat” problem becomes a variable-cost problem: requests drive tokens, tokens drive compute, compute drives dollars.

That’s not bad — it just needs the same discipline you apply to cloud spend, payments, or customer support: unit economics, guardrails, and monitoring.

This post gives you a CFO/COO-friendly model you can drop into a spreadsheet, plus the three biggest margin-protection levers:

Routing (send each task to the cheapest model that meets quality)
Caching (don’t pay twice for the same work)
Caps & quotas (budgetable AI, by team and workflow)

The core model (requests → tokens → $)

At the simplest level:

Requests: How many AI calls you make (per day/week/month)
Tokens: How “big” each call is (input + output)
Cost per token: What the provider charges for the model you used

A basic cost equation:

Monthly AI Cost = Requests × Avg Tokens/Request × Cost/Token

In practice, you’ll model by workflow, because workflows have different volumes and tolerance for latency/quality.

A practical worksheet structure

Create a table with rows as workflows (examples below) and columns:

Workflow name
Team owner
Requests/month
Avg input tokens
Avg output tokens
Total tokens/request
Model used (or route mix)
Effective cost/token
Monthly cost
Success metric (time saved, cycle time, error rate)

Example workflows to start with:

Sales: inbound lead qualification + draft reply
Finance: vendor invoice triage + coding suggestion
Ops: weekly KPI narrative + anomaly explanation
CS: ticket categorization + suggested response

Why averages are dangerous (and how to account for variance)

Most teams underestimate variance:

Some requests are tiny (“rewrite this sentence”).
Some requests are huge (long context + long output).
Some requests trigger tool calls or multi-step agent loops.

Two fixes:

Track p50 and p95 tokens/request per workflow (not just average).
Add a “loop multiplier” for agentic workflows:

Effective tokens = tokens/request × average steps/run

If an “agent” does 6 model calls per run, your cost is 6× even if each call is modest.

Margin framing: treat GenAI like cloud + labor, not like SaaS

CFO/COO framing that holds up in a board meeting:

GenAI replaces or augments labor hours.
It also consumes variable compute.
Your goal is to ensure the compute line grows slower than the value created.

A useful metric:

$/hour saved = Monthly AI cost / Hours saved

Then compare that to the fully loaded cost of the role(s) impacted, and adjust for quality risk.

If the AI costs $12K/month and saves 600 hours/month, that’s $20/hour saved. If the impacted time is worth $80–$150/hour fully loaded, you have room.

But only if quality and control are real.

The three biggest levers to protect margin

1) Routing: cheapest model that meets quality

Most companies overspend by sending everything to the best model. Instead, implement a routing policy:

“Good enough” tasks → cheaper/faster model
High-stakes tasks → best model
Sensitive tasks → restricted model/tooling

A simple three-tier policy:

Tier A (low risk): drafting, summarizing, formatting
Tier B (medium risk): analysis with citations, structured extraction
Tier C (high risk): customer-facing commitments, finance approvals, legal language

Routing rules should be explicit and auditable.

Result: you often cut costs 30–70% with minimal quality loss.

2) Caching: don’t pay twice

Two types of caching matter:

Response caching: identical prompt + identical context → reuse output
Embedding / retrieval caching: reuse retrieved context, don’t re-embed unchanged docs

Where caching pays immediately:

Knowledge-base Q&A
Policy questions (expense policy, refund policy)
Standard operating procedures
Repeatable internal reporting narratives

Caching turns “variable cost” into “mostly fixed” for repeat queries.

3) Caps & quotas: budgetable AI

If GenAI is going to be a real operating capability, it must be budgetable. Set caps at multiple levels:

Per user (daily token budgets)
Per team (monthly budgets)
Per workflow (hard ceilings)
Per tool (e.g., max calls to the premium model)

Add “graceful degradation”:

When a cap is hit, route to cheaper model
Or require approval for premium model
Or pause non-critical workflows

This avoids the classic failure mode: AI adoption succeeds, and your cost spikes with it.

Governance: the minimum viable controls

You don’t need a bureaucracy, but you do need a control plane. Minimum viable governance for CFO/COO comfort:

Usage telemetry by workflow/team
Cost attribution (who/what is spending)
Quality checks (sampling + clear failure categories)
Audit logs for agentic actions
Change control for model upgrades and prompt changes
Kill switch for runaway workflows

If you can’t answer “what changed?” when cost or quality shifts, you don’t have control.

A rollout pattern that reduces risk

A practical pattern for scaling without surprises:

Start with 3–5 workflows that are repeatable and measurable.
Instrument requests/tokens/cost from day 1.
Set caps early (even if generous).
Implement routing before you scale.
Add caching once you see repeat patterns.
Review weekly for 4–6 weeks; then move to monthly cadence.

What to do next (fast)

If you want this to be real (not a demo), do two things this week:

Build a workflow cost table (10 rows max) with requests/tokens/cost.
Decide your authority ladder for AI actions:
- Read-only
- Draft-only
- Execute with approval
- Execute with caps

That’s enough to unlock meaningful adoption while keeping margins protected.

If your team wants help building the cost model and the guardrails, CDS can deliver a lightweight “AI unit economics + governance” sprint that results in:

a unit-econ dashboard,
routing + quota policies,
and 1–2 high-ROI workflows running in production.