Black Friday Postmortem: Five Log Fields You're Missing

Author: WebGoodPeople

Friday, 18:03. The client's catalog stopped returning data. The server on the dashboard was green.

This is a real incident from our projects. The details are generalized and the company isn't named. But the mechanics repeat in every fifth e-com we work with. If you run Bitrix or any monolith with search bolted on top, this pattern is almost certainly there too. I'll break it down the way we break it down ourselves, so you can check yourself before it fires.

Timeline


  • 18:03. A user clicks a filter in a category page. The page loads in 912 ms and returns an empty catalog. No errors.
  • 18:07. A second user. A third. A fifth. All of them see "Nothing found for your query."
  • 18:11. The duty manager writes to us in chat: "The site works, but the cart is empty. No sales."
  • 18:17. We find the problem in the API logs.

Fourteen minutes. Not an hour. Not a day. But at Black Friday traffic that's roughly 600,000 RUB in lost revenue, a conservative estimate based on 500k sessions/mo, 1.2% conversion, and a 4,500 RUB average order value, stretched across the peak window.

What the server dashboard showed


Green charts:

  • CPU — 34%
  • RAM — 61%
  • Disk I/O — normal
  • 5xx errors — 0
  • 200 OK response time — 900 ms (a bit slower than usual, but within range)

No alerts. No triggers. A classic "the site works."

What the API log showed


2026-XX-XX 18:03:14  req=a3f1  /api/catalog/filter  200  bytes=127  latency=912ms  index_version=v17-rebuild
2026-XX-XX 18:03:14  req=b4e2  /api/catalog/filter  200  bytes=127  latency=847ms  index_version=v17-rebuild


Two details that changed everything:

bytes=127. That's the response size. 127 bytes is the JSON {"items":[],"total":0,"took":...}. An empty result. HTTP 200, but no content. Standard monitoring doesn't catch this, because the status code is correct.

index_version=v17-rebuild. That's our field. It says the request hit an Elasticsearch index that was rebuilding at that moment (a scheduled catalog reindex after the import from 1C). The partially built index answered queries with zero hits, but it answered.

Search returned "nothing found." The front showed an empty category. The user left.

Why standard monitoring missed it


Every tool usually set up in production looks at three things:

  1. Server resources (CPU, RAM, disk). An empty response eats no resources.
  2. HTTP codes (4xx, 5xx). Our response is 200.
  3. Latency (p95/p99 response time). 912 ms is no catastrophe.

None of these metrics can say "you answered with emptiness where a product should have been." For that you need to look not at the infrastructure level, but at the business logic level.

This is what we call the green-dashboard blind spot. Every metric reads "all good," while the business result is lost.

What we changed that same evening


1. An alert on a drop in the p95 result count


Not on the absence of results. On a drop relative to baseline.

For each critical endpoint we know how much it returns on average:

  • /api/catalog/filter — p95 = 48 products
  • /api/search/query — p95 = 12 products
  • /api/recommendations — p95 = 6 products

If p95 drops by more than 80% over a 5-minute window, on-call gets paged. Not "the site is down," but "we're returning less product than usual."

Setup took about an hour across all the critical endpoints.

2. A data_version field in every API response


Now every API response carries, in its metadata, the version of the data it was built from. For Elasticsearch that's index_version, for the Redis cache it's cache_epoch, for materialized views it's mv_refreshed_at.

When an anomaly happens, we see right away which data version returned it. Was it a stale cache? A partially rebuilt index? A stuck MV? It all shows up as one filter in Grafana, and it becomes clear immediately.

3. A one-page runbook


A runbook is the instruction for "what to do when an alert fires." Before, it lived in one developer's head. Now it's in Notion, on a single page, with three blocks:

  • What happened (the first 3 checks: which alert, which endpoint, which data version)
  • How to roll back fast (the rollback button in the admin, switching to an index snapshot)
  • Who to wake next if the first 3 steps didn't work

A runbook is written for an on-call engineer running on 15 minutes of sleep. Not for a CTO with a fresh head. Those are different texts.

The 5 API log fields we won't work without


After this incident we locked in a minimal field set. If you have these five fields, incidents like "green dashboard, empty catalog" get caught in 60 seconds, not 14 minutes.

  1. req_id — a unique request identifier. It travels through every service (front → API → DB → external). It lets you stitch any request across the logs.

  2. endpoint — a logical identifier, not a URL. /api/catalog/filter, not /api/catalog/filter?cat=12&price=100-500. Parameters are stored separately. Otherwise the metrics smear across thousands of "unique" URLs.

  3. status + bytes — both the HTTP status and the response size. HTTP 200 with bytes < 200 on an endpoint that usually returns 10 KB is a signal.

  4. latency — in milliseconds, split into p50/p95/p99. The average is useless (it hides the tails).

  5. data_version — the version of the data the response was built from. An index, a cache, a snapshot, any data model that can go inconsistent.

Everything beyond these five is useful, but not required. These five are must-have.

What this means for you


If you run e-com on Bitrix or another monolith, it's almost certain you have one or more of these blind spots:

  • Search returns emptiness during reindexing (and no one knows)
  • The cache serves stale data after an update (and no one knows)
  • One endpoint answers with a different latency distribution than the rest (and no one knows)
  • The front's "nothing found" error is indistinguishable from a successful empty result (and no one knows)

Checking yourself is simple: open your production logs. Does every line carry req_idendpointstatusbyteslatencydata_version? If not, you have the same 14 minutes we did.

The fix costs 40 minutes of alert setup and 1–2 days of work to add the fields to the API. The payback math is one week in any e-com with revenue from 30M RUB/mo.


If you want us to look at your API log and point out the blind spots, that's part of our 48-hour audit. Free, no obligations. our headless Next.js + Elasticsearch audit

Black Friday Postmortem: Five Log Fields You're Missing — WebGoodPeople