Why 80% of AI projects don't ship
We see the same pattern: the board demands "deploy AI". The team picks a nice use case (product description generation, say), runs a demo, everyone applauds, 3 months later the board asks "where's the revenue?" — and the pilot quietly dies.
Reasons we see most often:
- No baked‑in metric — "deployed" ≠ "working".
- Use case not tied to revenue — impossible to defend with the CFO.
- No observability — nobody knows how AI behaves in production.
- Agency delivers without co‑delivery — client team doesn't know, nobody owns it.
Processes with ROI in 8 weeks
Four candidates where ROI is measurable in 6–8 weeks and visible even to a skeptical CFO. Everything else is R&D — and should be called R&D, not a pilot.
1. AI PDP personalization
"Similar products" and "frequently bought with" powered by embedding search. KPI: +8–15% PDP conversion, +5–10% AOV. Top e‑com sites get up to 31% of revenue from recommendations (McKinsey, 2024).
2. Semantic catalog search
Elasticsearch + reranker. "Warm winter jacket" returns parkas even when "warm" isn't in any SKU. KPI: zero‑result share → 0, search CR +20%.
3. Admin copilot for content managers
AI assistant inside the Bitrix admin: description generation, SEO copy, bulk card updates. KPI: −60% time to onboard an SKU, saves 4.4 hrs/week per engineer (getdx, 2025).
4. Chatbot with CRM escalation
RAG over FAQ and knowledge base + scenario‑driven escalation to amoCRM / Bitrix24. KPI: 60%+ self‑service, < 30s first reply.
Outcome contract: what to bake in
The key trick with an outcome contract is aligning both sides on the metric, not the hours.
- Base fee 60–70% — covers infra, discovery, integration. Not KPI‑tied.
- Outcome bonus 30–40% — paid on KPI hit in a defined window (e.g. +10% PDP CR over 30 days).
- Penalty clause — if metric drops > 5%, agency refunds a share of base fee.
- Observability access — client sees all logs, prompts, latency, errors. Otherwise the metric can't be trusted.
Open‑source vs API — when to choose what
| Criterion | API (Claude / GPT) | Open‑source (Llama / Qwen) |
|---|---|---|
| Time to launch | Days | Weeks |
| Unit cost | Variable (per‑token) | Fixed (GPU) |
| Compliance (GDPR / 152‑FZ / gov) | No | Yes (on‑prem) |
| Russian/Kazakh quality | High | Medium (Qwen is better) |
| Domain fine‑tuning | Limited | Full |
Guardrails and observability
- Prompt and response logging with 30‑day TTL.
- Rate limits per user and IP.
- PII filters in and out (no passports in logs).
- Hallucination detection: verify answers cite sources.
- Kill‑switch: one command disables the AI layer, traffic falls back.
How to measure impact
Never eyeball AI. Always run A/B: half the users see AI, half see control. Window — minimum 2 weeks and 10,000 sessions per branch. Metrics — not "clicks", but money: CR, AOV, LTV.
Rollout without panic
Start with 5–10% traffic on the AI variant. Ramp every 3 days while metrics are stable. Switch to 100% only after a 2‑week confirmed lift window. Any metric drop — kill switch, review, fix, relaunch.