Executive summary
If you adopt GenAI broadly, your cost structure changes. What looks like a “software seat” problem becomes a variable-cost problem: requests drive tokens, tokens drive compute, compute drives dollars.
That’s not bad — it just needs the same discipline you apply to cloud spend, payments, or customer support: unit economics, guardrails, and monitoring.
This post gives you a CFO/COO-friendly model you can drop into a spreadsheet, plus the three biggest margin-protection levers:
- Routing (send each task to the cheapest model that meets quality)
- Caching (don’t pay twice for the same work)
- Caps & quotas (budgetable AI, by team and workflow)
The core model (requests → tokens → $)
At the simplest level:
- Requests: How many AI calls you make (per day/week/month)
- Tokens: How “big” each call is (input + output)
- Cost per token: What the provider charges for the model you used
A basic cost equation:
Monthly AI Cost = Requests × Avg Tokens/Request × Cost/Token
In practice, you’ll model by workflow, because workflows have different volumes and tolerance for latency/quality.
A practical worksheet structure
Create a table with rows as workflows (examples below) and columns:
- Workflow name
- Team owner
- Requests/month
- Avg input tokens
- Avg output tokens
- Total tokens/request
- Model used (or route mix)
- Effective cost/token
- Monthly cost
- Success metric (time saved, cycle time, error rate)
Example workflows to start with:
- Sales: inbound lead qualification + draft reply
- Finance: vendor invoice triage + coding suggestion
- Ops: weekly KPI narrative + anomaly explanation
- CS: ticket categorization + suggested response
Why averages are dangerous (and how to account for variance)
Most teams underestimate variance:
- Some requests are tiny (“rewrite this sentence”).
- Some requests are huge (long context + long output).
- Some requests trigger tool calls or multi-step agent loops.
Two fixes:
- Track p50 and p95 tokens/request per workflow (not just average).
- Add a “loop multiplier” for agentic workflows:
Effective tokens = tokens/request × average steps/run
If an “agent” does 6 model calls per run, your cost is 6× even if each call is modest.
Margin framing: treat GenAI like cloud + labor, not like SaaS
CFO/COO framing that holds up in a board meeting:
- GenAI replaces or augments labor hours.
- It also consumes variable compute.
- Your goal is to ensure the compute line grows slower than the value created.
A useful metric:
$/hour saved = Monthly AI cost / Hours saved
Then compare that to the fully loaded cost of the role(s) impacted, and adjust for quality risk.
If the AI costs $12K/month and saves 600 hours/month, that’s $20/hour saved. If the impacted time is worth $80–$150/hour fully loaded, you have room.
But only if quality and control are real.
The three biggest levers to protect margin
1) Routing: cheapest model that meets quality
Most companies overspend by sending everything to the best model. Instead, implement a routing policy:
- “Good enough” tasks → cheaper/faster model
- High-stakes tasks → best model
- Sensitive tasks → restricted model/tooling
A simple three-tier policy:
- Tier A (low risk): drafting, summarizing, formatting
- Tier B (medium risk): analysis with citations, structured extraction
- Tier C (high risk): customer-facing commitments, finance approvals, legal language
Routing rules should be explicit and auditable.
Result: you often cut costs 30–70% with minimal quality loss.
2) Caching: don’t pay twice
Two types of caching matter:
- Response caching: identical prompt + identical context → reuse output
- Embedding / retrieval caching: reuse retrieved context, don’t re-embed unchanged docs
Where caching pays immediately:
- Knowledge-base Q&A
- Policy questions (expense policy, refund policy)
- Standard operating procedures
- Repeatable internal reporting narratives
Caching turns “variable cost” into “mostly fixed” for repeat queries.
3) Caps & quotas: budgetable AI
If GenAI is going to be a real operating capability, it must be budgetable. Set caps at multiple levels:
- Per user (daily token budgets)
- Per team (monthly budgets)
- Per workflow (hard ceilings)
- Per tool (e.g., max calls to the premium model)
Add “graceful degradation”:
- When a cap is hit, route to cheaper model
- Or require approval for premium model
- Or pause non-critical workflows
This avoids the classic failure mode: AI adoption succeeds, and your cost spikes with it.
Governance: the minimum viable controls
You don’t need a bureaucracy, but you do need a control plane. Minimum viable governance for CFO/COO comfort:
- Usage telemetry by workflow/team
- Cost attribution (who/what is spending)
- Quality checks (sampling + clear failure categories)
- Audit logs for agentic actions
- Change control for model upgrades and prompt changes
- Kill switch for runaway workflows
If you can’t answer “what changed?” when cost or quality shifts, you don’t have control.
A rollout pattern that reduces risk
A practical pattern for scaling without surprises:
- Start with 3–5 workflows that are repeatable and measurable.
- Instrument requests/tokens/cost from day 1.
- Set caps early (even if generous).
- Implement routing before you scale.
- Add caching once you see repeat patterns.
- Review weekly for 4–6 weeks; then move to monthly cadence.
What to do next (fast)
If you want this to be real (not a demo), do two things this week:
- Build a workflow cost table (10 rows max) with requests/tokens/cost.
- Decide your authority ladder for AI actions:
- Read-only
- Draft-only
- Execute with approval
- Execute with caps
That’s enough to unlock meaningful adoption while keeping margins protected.
If your team wants help building the cost model and the guardrails, CDS can deliver a lightweight “AI unit economics + governance” sprint that results in:
- a unit-econ dashboard,
- routing + quota policies,
- and 1–2 high-ROI workflows running in production.