Elasticsearch Relevance: A 7-Step E-Com Playbook

Author: WebGoodPeople

If you have 1,000+ SKUs and the built-in search from Bitrix, WordPress, or Shopify, your search probably works like this in 80% of cases: a user types a query and gets either "nothing found" or a result list where the first product isn't what they were looking for. They leave. Search drives 15-25% of all sales, and that revenue is leaking out.

This article is a 7-step playbook we apply on every Elasticsearch project. The order isn't arbitrary: each step needs data from the one before it.

Background: why Elasticsearch specifically

The alternatives:

  • Built-in Bitrix search runs on MySQL FULLTEXT: no typo tolerance, no synonyms, relevance is hit or miss.
  • OpenSearch is Amazon's fork of ES 7.x. It's 95% compatible with these steps.
  • Meilisearch and Typesense are good for smaller catalogs (<50k) but limited in custom scoring.
  • Algolia is an excellent product, but it runs $500+/mo for a mid-size e-com.

For catalogs of 50k+ SKUs with real control requirements, use ES or OpenSearch.

Step 1. Analyze current queries and zero-result rate

What we do: collect every search query from the last 30 days. Calculate the share of queries that returned nothing (zero_result_rate).

-- In Loki or ClickHouse
SELECT 
  query,
  count(*) as cnt,
  countIf(hits = 0) as zero_results,
  countIf(hits = 0) / count(*) as zero_rate
FROM search_logs
WHERE ts >= now() - interval 30 day
GROUP BY query
ORDER BY cnt DESC
LIMIT 200;

What we look for: queries in the top 200 by frequency with zero_rate > 5%. That's a direct loss: the user searched, found nothing, left.

Typical findings: typos (dres instead of dress), synonyms (kicks instead of sneakers), foreign-language variants (an English query on a Russian-language site).

Expected baseline: a well-tuned e-com sits at zero_rate < 3%. A typical untuned one runs 15-25%.

Step 2. Ship search queries to a dedicated index

What we do: log every search query with its metadata:

{
  "ts": "2026-05-19T10:00:00Z",
  "user_id": "u42",
  "query": "red dress",
  "hits": 24,
  "first_click_position": 3,
  "converted": false
}

We call the index search_queries. Retention is 90 days, which is enough for analytics.

Why: without this index, every step that follows is shooting at targets you can't see.

Step 3. Synonyms (manual + ML-derived dictionary)

What we do: build a synonym dictionary. Two layers:

Manual (30-50 pairs, fast):

sneakers, kicks, trainers
dress, dres, gown
sweater, jumper

ML-derived (from logs): run Word2Vec or fastText on the query log to find word pairs that often appear in semantically similar contexts.

Elasticsearch takes this as:

{
  "settings": {
    "analysis": {
      "filter": {
        "en_synonyms": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms-en.txt"
        }
      },
      "analyzer": {
        "en_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "en_synonyms", "english_stemmer"]
        }
      }
    }
  }
}

Expected effect: zero_rate drops 30-50% on the first iteration.

Step 4. Field boosting (what matters more, title or description)

What we do: set field weights in the multi_match query:

{
  "query": {
    "multi_match": {
      "query": "red dress",
      "fields": [
        "title^5",
        "brand^3",
        "categories^2",
        "description^1",
        "attributes.*^0.5"
      ],
      "type": "best_fields",
      "fuzziness": "AUTO"
    }
  }
}

Why these weights: in e-com, users search by product name in 70% of cases. Brand is a strong signal when it's there. Description is weaker, since it's stuffed with keywords by the SEO team.

Important: these weights are a starting point. The final ones come after step 7 (A/B testing).

Step 5. Fuzziness (typo tolerance)

What we do: turn on fuzzy matching, but carefully. "fuzziness": "AUTO" gives you:

  • 0 typos for words up to 3 characters
  • 1 typo for words of 3-5 characters
  • 2 typos for words of 6+ characters

That's usually enough. For brands and SKU codes, turn fuzziness off, so that iPhone 15 doesn't match iPhone 16.

"query": {
  "bool": {
    "should": [
      {
        "multi_match": {
          "query": "{{query}}",
          "fields": ["title^5", "description^1"],
          "fuzziness": "AUTO"
        }
      },
      {
        "term": { "sku": "{{query}}" }  // exact match for SKU
      }
    ]
  }
}

Step 6. Function scoring: popularity, stock, recency

What we do: rank not only by text relevance, but by business signals too.

{
  "query": {
    "function_score": {
      "query": { /* multi_match from step 4 */ },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity_score",
            "modifier": "log1p",
            "missing": 1
          }
        },
        {
          "filter": { "term": { "in_stock": true }},
          "weight": 2
        },
        {
          "exp": {
            "created_at": {
              "origin": "now",
              "scale": "90d",
              "decay": 0.5
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

What each function does:

  • field_value_factor on popularity_score pushes popular products higher. popularity_score is computed separately (for example, views over the last 30 days).
  • filter on in_stock gives in-stock items a 2x boost. Out-of-stock items sink.
  • exp decay on created_at gives new arrivals a bonus that fades over 90 days.

Expected effect: first-click position rises from 4-5 to 1-2.

Step 7. A/B test relevance with interleaving

What we do: compare two scoring versions with team-draft interleaving. The user sees a mixed result list (half from A, half from B), and we count which side gets more clicks.

def interleave(results_a, results_b, n=20):
    out = []
    turn = random.choice(['A', 'B'])
    i_a, i_b = 0, 0
    while len(out) < n and (i_a < len(results_a) or i_b < len(results_b)):
        if turn == 'A' and i_a < len(results_a):
            out.append((results_a[i_a], 'A'))
            i_a += 1
            turn = 'B'
        elif i_b < len(results_b):
            out.append((results_b[i_b], 'B'))
            i_b += 1
            turn = 'A'
    return out

Why interleaving: classic A/B needs large samples (4-6 weeks for statistical significance). Interleaving is 10x faster (2-3 days).

Expected outcome after 7 steps

On a 50k+ SKU catalog with zero relevance tuning:

  • zero_result_rate: from 20% to 2%
  • Average first-click position: from 5 to 1.5
  • Search conversion: +30-60%
  • Search latency (p95): 20-60 ms per query (correct ES index tuning delivers this even at 500k SKUs)

What we don't do

  • We don't add ML-based ranking on the first iteration. Learning-to-rank makes sense only after the baseline setup works.
  • We don't tune 100+ fields at once. 5-7 key fields with the right weights get you 80% of the win.
  • We don't do "personalization" without data. Personalization works at 100+ sessions per user. Most e-coms don't have that depth.

If you want it faster

A 48-hour search audit: we look at your ES index and hand you a concrete query template plus a synonym dictionary built for your catalog. Free, no strings attached.


Headless Next.js + Elasticsearch for e-commerce

Elasticsearch Relevance: A 7-Step E-Com Playbook — WebGoodPeople