Elasticsearch Relevance: A 7-Step E-Com Playbook
Author: WebGoodPeople
If you have 1,000+ SKUs and the built-in search from Bitrix, WordPress, or Shopify, your search probably works like this in 80% of cases: a user types a query and gets either "nothing found" or a result list where the first product isn't what they were looking for. They leave. Search drives 15-25% of all sales, and that revenue is leaking out.
This article is a 7-step playbook we apply on every Elasticsearch project. The order isn't arbitrary: each step needs data from the one before it.
Background: why Elasticsearch specifically
The alternatives:
- Built-in Bitrix search runs on MySQL FULLTEXT: no typo tolerance, no synonyms, relevance is hit or miss.
- OpenSearch is Amazon's fork of ES 7.x. It's 95% compatible with these steps.
- Meilisearch and Typesense are good for smaller catalogs (<50k) but limited in custom scoring.
- Algolia is an excellent product, but it runs $500+/mo for a mid-size e-com.
For catalogs of 50k+ SKUs with real control requirements, use ES or OpenSearch.
Step 1. Analyze current queries and zero-result rate
What we do: collect every search query from the last 30 days. Calculate the share of queries that returned nothing (zero_result_rate).
-- In Loki or ClickHouse SELECT query, count(*) as cnt, countIf(hits = 0) as zero_results, countIf(hits = 0) / count(*) as zero_rate FROM search_logs WHERE ts >= now() - interval 30 day GROUP BY query ORDER BY cnt DESC LIMIT 200;
What we look for: queries in the top 200 by frequency with zero_rate > 5%. That's a direct loss: the user searched, found nothing, left.
Typical findings: typos (dres instead of dress), synonyms (kicks instead of sneakers), foreign-language variants (an English query on a Russian-language site).
Expected baseline: a well-tuned e-com sits at zero_rate < 3%. A typical untuned one runs 15-25%.
Step 2. Ship search queries to a dedicated index
What we do: log every search query with its metadata:
{
"ts": "2026-05-19T10:00:00Z",
"user_id": "u42",
"query": "red dress",
"hits": 24,
"first_click_position": 3,
"converted": false
}
We call the index search_queries. Retention is 90 days, which is enough for analytics.
Why: without this index, every step that follows is shooting at targets you can't see.
Step 3. Synonyms (manual + ML-derived dictionary)
What we do: build a synonym dictionary. Two layers:
Manual (30-50 pairs, fast):
sneakers, kicks, trainers dress, dres, gown sweater, jumper
ML-derived (from logs): run Word2Vec or fastText on the query log to find word pairs that often appear in semantically similar contexts.
Elasticsearch takes this as:
{
"settings": {
"analysis": {
"filter": {
"en_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonyms-en.txt"
}
},
"analyzer": {
"en_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "en_synonyms", "english_stemmer"]
}
}
}
}
}
Expected effect: zero_rate drops 30-50% on the first iteration.
Step 4. Field boosting (what matters more, title or description)
What we do: set field weights in the multi_match query:
{
"query": {
"multi_match": {
"query": "red dress",
"fields": [
"title^5",
"brand^3",
"categories^2",
"description^1",
"attributes.*^0.5"
],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}
Why these weights: in e-com, users search by product name in 70% of cases. Brand is a strong signal when it's there. Description is weaker, since it's stuffed with keywords by the SEO team.
Important: these weights are a starting point. The final ones come after step 7 (A/B testing).
Step 5. Fuzziness (typo tolerance)
What we do: turn on fuzzy matching, but carefully. "fuzziness": "AUTO" gives you:
- 0 typos for words up to 3 characters
- 1 typo for words of 3-5 characters
- 2 typos for words of 6+ characters
That's usually enough. For brands and SKU codes, turn fuzziness off, so that iPhone 15 doesn't match iPhone 16.
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "{{query}}",
"fields": ["title^5", "description^1"],
"fuzziness": "AUTO"
}
},
{
"term": { "sku": "{{query}}" } // exact match for SKU
}
]
}
}
Step 6. Function scoring: popularity, stock, recency
What we do: rank not only by text relevance, but by business signals too.
{
"query": {
"function_score": {
"query": { /* multi_match from step 4 */ },
"functions": [
{
"field_value_factor": {
"field": "popularity_score",
"modifier": "log1p",
"missing": 1
}
},
{
"filter": { "term": { "in_stock": true }},
"weight": 2
},
{
"exp": {
"created_at": {
"origin": "now",
"scale": "90d",
"decay": 0.5
}
}
}
],
"score_mode": "multiply",
"boost_mode": "multiply"
}
}
}
What each function does:
field_value_factoronpopularity_scorepushes popular products higher.popularity_scoreis computed separately (for example, views over the last 30 days).filteronin_stockgives in-stock items a 2x boost. Out-of-stock items sink.expdecay oncreated_atgives new arrivals a bonus that fades over 90 days.
Expected effect: first-click position rises from 4-5 to 1-2.
Step 7. A/B test relevance with interleaving
What we do: compare two scoring versions with team-draft interleaving. The user sees a mixed result list (half from A, half from B), and we count which side gets more clicks.
def interleave(results_a, results_b, n=20):
out = []
turn = random.choice(['A', 'B'])
i_a, i_b = 0, 0
while len(out) < n and (i_a < len(results_a) or i_b < len(results_b)):
if turn == 'A' and i_a < len(results_a):
out.append((results_a[i_a], 'A'))
i_a += 1
turn = 'B'
elif i_b < len(results_b):
out.append((results_b[i_b], 'B'))
i_b += 1
turn = 'A'
return out
Why interleaving: classic A/B needs large samples (4-6 weeks for statistical significance). Interleaving is 10x faster (2-3 days).
Expected outcome after 7 steps
On a 50k+ SKU catalog with zero relevance tuning:
zero_result_rate: from 20% to 2%- Average first-click position: from 5 to 1.5
- Search conversion: +30-60%
- Search latency (p95): 20-60 ms per query (correct ES index tuning delivers this even at 500k SKUs)
What we don't do
- We don't add ML-based ranking on the first iteration. Learning-to-rank makes sense only after the baseline setup works.
- We don't tune 100+ fields at once. 5-7 key fields with the right weights get you 80% of the win.
- We don't do "personalization" without data. Personalization works at 100+ sessions per user. Most e-coms don't have that depth.
If you want it faster
A 48-hour search audit: we look at your ES index and hand you a concrete query template plus a synonym dictionary built for your catalog. Free, no strings attached.