Why 80% of AI projects don't ship
We see the same pattern: the board demands "deploy AI". The team picks a nice use case (product description generation, say), runs a demo, everyone applauds, 3 months later the board asks "where's the revenue?" — and the pilot quietly dies.
Reasons we see most often:
- No baked‑in metric — "deployed" ≠ "working".
- Use case not tied to revenue — impossible to defend with the CFO.
- No observability — nobody knows how AI behaves in production.
- Agency delivers without co‑delivery — client team doesn't know, nobody owns it.
Processes with ROI in 8 weeks
Four candidates where ROI is measurable in 6–8 weeks and visible even to a skeptical CFO. Everything else is R&D — and should be called R&D, not a pilot.
1. AI PDP personalization
"Similar products" and "frequently bought with" powered by embedding search. KPI: +8–15% PDP conversion, +5–10% AOV. Top e‑com sites get up to 31% of revenue from recommendations (McKinsey, 2024).
2. Semantic catalog search
Elasticsearch + reranker. "Warm winter jacket" returns parkas even when "warm" isn't in any SKU. KPI: zero‑result share → 0, search CR +20%.
3. Admin copilot for content managers
AI assistant inside the Bitrix admin: description generation, SEO copy, bulk card updates. KPI: −60% time to onboard an SKU, saves 4.4 hrs/week per engineer (getdx, 2025).
4. Chatbot with CRM escalation
RAG over FAQ and knowledge base + scenario‑driven escalation to amoCRM / Bitrix24. KPI: 60%+ self‑service, < 30s first reply.
Outcome contract: what to bake in
The core principle of an outcome contract is aligning both sides on the metric, not the hours.
- Base fee 60–70% — covers infra, discovery, integration. Not KPI‑tied.
- Outcome bonus 30–40% — paid on KPI hit in a defined window (e.g. +10% PDP CR over 30 days).
- Penalty clause — if metric drops > 5%, agency refunds a share of base fee.
- Observability access — client sees all logs, prompts, latency, errors. Otherwise the metric can't be trusted.
Open‑source vs API — when to choose what
| Criterion | API (Claude / GPT) | Open‑source (Llama / Qwen) |
|---|---|---|
| Time to launch | Days | Weeks |
| Unit cost | Variable (per‑token) | Fixed (GPU) |
| Compliance (GDPR / 152‑FZ / gov) | No | Yes (on‑prem) |
| Russian/Kazakh quality | High | Medium (Qwen is better) |
| Domain fine‑tuning | Limited | Full |
Guardrails and observability
- Prompt and response logging with 30‑day TTL.
- Rate limits per user and IP.
- PII filters in and out (no passports in logs).
- Hallucination detection: verify answers cite sources.
- Kill‑switch: one command disables the AI layer, traffic falls back.
How to measure impact
Never eyeball AI. Always run A/B: half the users see AI, half see control. Window — minimum 2 weeks and 10,000 sessions per branch. Metrics — not "clicks", but money: CR, AOV, LTV.
Rollout without panic
Start with 5–10% traffic on the AI variant. Ramp every 3 days while metrics are stable. Switch to 100% only after a 2‑week confirmed lift window. Any metric drop — kill switch, review, fix, relaunch.
FAQ
What counts as a successful AI pilot?
One where the metric is locked in the contract before kickoff, measured through A/B, and visible to the CFO within 6–8 weeks. "Deployed" without a number is R&D, not a pilot.
How long does a pilot take?
6–10 weeks: 2 weeks of discovery + integration, 2–4 weeks of A/B with at least 10,000 sessions per branch, 2 weeks of confirmed lift. Earlier is statistically insufficient.
What if the KPI isn't met?
The penalty clause in the contract refunds a share of the base fee. Next: review the hypothesis, change the use case or parameters. A failed pilot is data, not a disaster.
Do we need GPT / Claude or will open-source work?
Three questions: (1) Volume — API is cheaper below ~1M tokens/day. (2) Compliance — GDPR / 152-FZ / government requires on-prem, so open-source. (3) Speed — API launches in days, open-source in weeks.
How do we protect customer data when using AI APIs?
PII filters in and out, 30-day log TTL, server-side API keys only. For data under GDPR or 152-FZ — on-prem models only.
Which use case should we pick first?
One where ROI is measurable in money, data already exists in the system, and the team can own it after handoff. Best first pilots: PDP personalization and semantic search — ROI is visible within 6–8 weeks.
What happens after the pilot?
Confirmed lift: roll out to 100% of traffic, move to an AI retainer or the next pilot. Failed metric: kill-switch, review, new use case.
Do we need a dedicated AI team?
No. A pilot runs with 1–2 engineers on our side and a product owner on yours. Scaling comes after confirmed lift, not before.