A standard PRD fails for LLM features. Not because the format is wrong — because the questions are wrong.
Standard PRDs ask: what does the feature do? LLM PRDs need to ask: what does the feature do when the model is wrong? When the response is slow? When it's accurate but unhelpful? When it's confident but hallucinating?
These questions don't appear in most PRD templates. They should.
Why standard PRDs break for LLM features
Standard PRD sections handle most features fine. What's missing for LLM features:
- Model selection rationale
- Prompt architecture spec
- Output quality definition
- Failure mode taxonomy
- Evaluation framework
- Human review strategy
- Cost and latency constraints
- Feedback loop design
Every LLM product that ships without thinking through these becomes a production incident waiting to happen.
The 7 additional sections for LLM PRDs
Section 1: AI behavior specification
Standard acceptance criteria: "Feature returns a response when user submits a query."
LLM acceptance criteria are probabilistic and multi-dimensional:
Output Quality Criteria:
- Accuracy: Response is factually correct for X% of test cases
- Relevance: Response addresses user intent for X% of cases
- Format adherence: Response matches specified structure in X% of cases
- Hallucination rate: Response contains invented facts in X% of cases (target: <Y%)
- Latency: Response begins streaming within X seconds at p95
Section 2: Model selection rationale
Document why you chose the specific model. Forces clear thinking upfront and protects the decision in retrospect.
Model: [Name + version]
Selection rationale:
- [Capability reason]
- [Cost reason]
- [Latency reason]
- [Context window reason]
Alternatives considered: [Models evaluated and why rejected]
Switching triggers: [Conditions under which we'd migrate]
Section 3: Prompt architecture specification
The prompt is the product. It deserves the same rigor as any other core component.
System prompt role: [What role/persona the model assumes]
Context injection: [What dynamic context is injected per request and from where]
Output format: [Exact schema — JSON preferred]
Constraints: [What the model must not do]
Examples: [Few-shot examples if used]
Prompt version: [Versioning system for prompt iterations]
Section 4: Failure mode taxonomy
Four categories to define before shipping:
Type 1 — Silent failure: Response returned but incorrect/irrelevant
Detection: LLM-as-a-Judge evaluation, user feedback signals
Recovery: Regeneration option, flag for human review
Type 2 — Visible failure: API error, timeout, empty response
Detection: Error monitoring (Sentry, etc.)
Recovery: Retry logic, graceful error state, fallback content
Type 3 — Harmful output: False, harmful, or inappropriate content
Detection: Output filters, human review queue, user reporting
Recovery: Immediate removal, filter update, model instruction update
Type 4 — Cost overrun: Usage generates unexpected API costs
Detection: Cost monitoring dashboard, budget alerts
Recovery: Rate limiting, caching strategy, model downgrade
Section 5: Evaluation framework
Pre-launch:
- Test set: N examples covering key use cases, edge cases, failure modes
- Evaluation method: LLM-as-a-Judge / Human review / Automated checks
- Pass threshold: Minimum scores required across each quality dimension
Post-launch:
- User acceptance signal: Copy rate, regeneration rate, thumbs up/down
- Quality sampling: X% of responses reviewed weekly by [who]
- Regression testing: How model updates are tested before rollout
Section 6: Human review strategy
Async review (post-delivery):
- Low-stakes outputs, high-volume features
- Sample X% for quality monitoring
- Escalation triggers: signals that pull a response into review queue
Sync review (pre-delivery):
- High-stakes outputs (medical, legal, financial)
- Specific user segments (enterprise, regulated industries)
- New feature rollout (first 48 hours)
Section 7: Cost and latency constraints
Target cost per successful interaction: $X
Current estimated cost: $Y at [model] with [context length]
Cost levers: caching, model selection, context pruning
Target latency: Xms to first token, Yms to full response
Latency levers: streaming (default), response length limits, parallel requests
The complete LLM PRD template
# [Feature Name] — LLM Feature PRD
Status: [Draft / Review / Approved]
PM: [Name]
Last updated: [Date]
Prompt version: v[X]
---
## Problem
[1 paragraph: specific user pain, not generic]
## User Stories
- As a [user], I want to [action] so that [outcome]
- As a [user], when the AI response is wrong, I want to [action]
- As a [user], when the AI is slow, I want to [action]
## AI Behavior Specification
Input: [What the user provides]
Output: [What the model returns — include schema if structured]
Quality targets:
Accuracy: X%
Format adherence: X%
Hallucination rate: <X%
p95 latency: <Xs
## Model Specification
Model: [Name + version]
Rationale: [Why this model]
Alternatives considered: [Others and rejection reason]
## Prompt Architecture
Role: [System prompt role]
Context: [Dynamic context sources]
Format: [Output format spec]
Constraints: [Hard no's for the model]
## Failure Modes and Recovery
[Type 1–4 table: detection + recovery for each]
## Evaluation Framework
Pre-launch: [Test set size, evaluation method, pass threshold]
Post-launch: [Monitoring signals, sampling rate, review cadence]
## Human Review Policy
[When required and who does it]
## Cost and Latency Targets
Cost per interaction: $X target
p95 latency: Xms target
Optimization strategy: [caching, model selection, context pruning]
## Success Metrics
Primary: [Business metric]
AI quality: [Output quality metric]
Efficiency: [Cost/latency metric]
User signal: [Acceptance rate, regeneration rate]
## Launch Criteria
- [ ] Test set pass threshold met
- [ ] Failure mode handling implemented
- [ ] Human review queue operational
- [ ] Cost monitoring dashboard live
- [ ] Rollback plan documented
The meta-point
Writing this PRD forces thinking most teams skip. The failure mode taxonomy alone has prevented three production incidents I can point to specifically.
If you can't write the failure modes before launch, you don't understand the feature well enough to ship it.
Sujit Chankhore is an AI Product Manager and founder based in Pune, India.