Every search request in OpenSearch travels through two distinct phases before results return to the user. The query phase finds matching documents and scores them. The fetch phase retrieves the actual document content. This split is not an implementation detail - it is a fundamental design choice that shapes latency, bandwidth consumption, and result accuracy. Understanding both phases at the shard level is critical for tuning production clusters that handle millions of queries per day.
The Two-Phase Split: Why OpenSearch Does Not Return Documents Immediately
When a client sends a search request, the coordinating node does not simply broadcast a "find and return everything" command to all shards. Instead, it issues a lightweight query phase request that asks each shard to return only the top-N document IDs plus their scores. This design decision solves a specific problem: if you have 100 shards and request the top 10 results, retrieving full documents from all 100 shards would waste enormous bandwidth. Most of those documents will never make it into the final result set.
The query phase returns a minimal payload per shard: document IDs, scores, sort values, and any requested doc values. The coordinating node merges these shard-level results, re-ranks globally, and then asks only the winning shards to fetch the actual _source content in the second phase. This two-phase approach reduces network transfer by orders of magnitude for large shard counts.
The trade-off is latency. Two network round-trips means the query phase completion time and fetch phase completion time add up. For a cluster with 50 shards across 10 nodes, the query phase might take 15ms and the fetch phase another 10ms, giving a total of 25ms before the coordinating node can respond. This is why size and from parameters matter so much - a larger size means more documents to fetch, and a deeper from means more documents to skip and re-score.
Query Phase Deep Dive: Scoring, Aggregation, and Top-N Selection Per Shard
At the shard level, the query phase executes in a single thread per shard (or multiple threads if search_type is dfs_query_then_fetch with parallel execution enabled). The query arrives at the shard, is parsed into a Lucene Query object, and is executed against the current index reader. The index reader sees a point-in-time view of the index - any documents indexed after the search started will not appear in results until the next refresh.
How the Query Phase Executes on a Single Shard
The query phase follows this sequence inside each shard:
-
Parse the query - The coordinating node sends the query DSL JSON, which the shard parses into a Lucene
Querytree. Boolean queries wrap term queries, range queries, and any custom query implementations registered by plugins. -
Create the weight and scorer - Lucene creates a
Weightobject that holds scoring context, and aScorerthat walks matching documents. The scorer is the heart of the query phase: it produces a stream of(doc, score)pairs. -
Collect top-N results - A
TopDocsCollector(orTopFieldCollectorfor sorting) consumes the scorer stream and maintains a priority queue of the best results. The default queue size isfrom + size. - Compute aggregations - If the request includes aggregations, the shard runs them during the same document iteration. This is efficient because aggregations see the same document stream as the scorer.
-
Return to coordinator - The shard serializes the top
(from + size)document IDs, scores, sort values, and aggregation results into the response.
For a query that matches 1 million documents on a single shard, the scorer still only touches each matching document once. But the priority queue must evaluate every match to determine if it belongs in the top (from + size). This is why deep pagination (from: 10000, size: 10) is expensive - the queue must be 10010 entries deep, and every matching document must be compared against the queue's worst entry.
The Role of the Index Reader and Segment Visibility
The query phase reads from the shard's current IndexReader, which is a composite view of all segments visible at the moment the search started. A refresh operation creates a new reader that includes recently indexed documents. This means:
- Documents indexed after the search began are invisible to that search.
- Documents deleted after the search began still appear until the search completes (Lucene's snapshot semantics).
- The index reader is reference-counted and remains valid for the entire query execution, even if a refresh happens during the search.
This snapshot behavior is crucial for consistency. If a refresh occurred mid-search, the query would see a mix of old and new segments, producing incoherent results. OpenSearch prevents this by pinning the reader for the search lifetime.
Fetch Phase Deep Dive: Retrieving Stored Fields, Highlighting, and Source
After the coordinating node merges query-phase results from all shards, it determines which shards hold the actual documents that made the global top-N. The fetch phase then sends targeted requests to only those shards, asking for the full document content.
What the Fetch Phase Actually Retrieves
The fetch phase is responsible for:
-
Stored fields (
_source,_id, and any explicitly stored fields) - Highlighting - Computing snippet fragments for matched terms using the Fast Vector Highlighter, Unified Highlighter, or Plain Highlighter.
- Script fields - Executing Painless scripts to compute derived values.
- Doc values - Retrieving columnar data for fields that are not stored but have doc values enabled.
- Inner hits - For nested documents and parent-child queries, retrieving the matching inner documents.
Each of these operations requires random I/O into the stored fields file or doc values data structures. Unlike the query phase, which walks documents in a streaming manner through the scorer, the fetch phase performs random access lookups by document ID. This is why the fetch phase can be slower than the query phase even though it processes fewer documents.
Stored Fields Retrieval: The _source Field Lookup
The _source field is stored as a compressed binary blob in the .fdt (stored fields) file. When the fetch phase requests _source for a document, Lucene performs:
- A lookup in the
FieldInfosto confirm the field is stored. - A seek to the document's position in the
.fdtfile using thedocBase+docIDoffset. - Decompression of the stored field chunk (Lucene stores fields in compressed blocks, typically 64-128 documents per block).
- Parsing the decompressed JSON back into the
_sourcemap structure.
For a single document, this is fast. But if the fetch phase requests 100 documents from a shard, and those documents are scattered across many blocks, the decompression overhead multiplies. This is why doc_values are preferred for fields that are frequently used in sorting, aggregation, or scripting - they are columnar and avoid the stored-field decompression penalty.
Highlighting: Why It Requires the Query Phase Context
Highlighting cannot be done in the query phase because the query phase does not have the original term positions. The fetch phase receives the original query along with the document IDs, and re-executes a specialized query that records term positions and offsets. The highlighter then uses these positions to extract the best-matching fragments from the _source.
This design has a subtle cost: the fetch phase effectively runs a second query for highlighting purposes. For requests with highlight.fields set to *, this can double the query execution time. The fragment_size and number_of_fragments parameters control how much text is analyzed, but the fundamental cost remains - every highlighted field requires a re-analysis pass.
How size and from Interact with the Two Phases
The size and from parameters are the primary controls for pagination, but their interaction with the two-phase architecture is not always obvious.
Query Phase Behavior with from and size
In the query phase, each shard computes from + size results. If the global request is from: 90, size: 10, each shard must return the top 100 results to the coordinator. This is because the coordinator cannot know which shard's 91st document might outrank another shard's 10th document. Every shard must over-deliver so the coordinator can perform a global merge.
The consequence is that deep pagination increases load on every shard, not just the coordinator. A request with from: 10000, size: 10 forces each shard to maintain a priority queue of 10010 entries and return all of them. For a cluster with 50 shards, this means the coordinator must merge 500,500 document entries to find the true global top 10. This is why search_after and scroll APIs exist - they avoid the deep from penalty by using a live cursor or a point-in-time search context.
Fetch Phase Behavior with from and size
After the coordinator determines the global top from + size documents, it only asks the relevant shards to fetch the documents that actually made the cut. However, the fetch phase still retrieves size documents per shard if multiple shards contributed to the global top-N. For example, if the global top 10 consists of 5 documents from shard A and 5 from shard B, the fetch phase sends 5-document requests to both shards.
The fetch phase does not honor from - it only fetches the documents that the coordinator needs. This is why the total fetch cost is proportional to size (the number of documents returned to the user), not from + size. The query phase bears the from cost; the fetch phase bears the size cost.
Performance Trade-offs: When the Two-Phase Model Helps and Hurts
The two-phase model is excellent for typical search workloads where size is small (10-100) and the number of matching documents is large. It minimizes network transfer and allows the coordinator to perform a global merge. But there are scenarios where the model becomes a bottleneck.
When the Two-Phase Model Is Optimal
- High match counts, low result size - A query that matches 1 million documents but only returns the top 10. The query phase returns 10 document IDs per shard, and the fetch phase retrieves only 10 full documents. Network transfer is minimal.
- Balanced shard sizes - When shards have roughly equal document counts, each shard's local top-N is a good approximation of the global top-N. The coordinator's merge work is straightforward.
-
Simple queries without heavy aggregation - Queries that only need scoring and basic filtering. The query phase runs efficiently, and the fetch phase has little work beyond
_sourceretrieval.
When the Two-Phase Model Creates Bottlenecks
-
Large
sizerequests - If a user requestssize: 1000, the coordinator must merge 1000 results per shard. For 50 shards, this is 50,000 entries in the merge heap. Memory usage on the coordinating node spikes, and the fetch phase must retrieve 1000 documents from each contributing shard. -
Heavy aggregation across many shards - Aggregations are computed per shard in the query phase, and the coordinator must merge aggregation buckets. For a
termsaggregation with 1000 buckets across 50 shards, the coordinator receives 50,000 buckets and must deduplicate and merge them. This is whyshard_sizeexists - it limits the number of buckets per shard before merging. - Skewed shard sizes - If one shard has 10x more documents than others, it dominates the query phase timing. The coordinator waits for the slowest shard, and the fast shards' results sit idle in memory.
- Highlighting on many fields - The fetch phase's re-query for highlighting can take longer than the original query phase. For 10 highlighted fields across 100 documents, the fetch phase runs 1000 mini-queries.
Production Tuning: Controlling Phase Behavior
Several settings and API choices influence how the two phases behave in production.
search_type: Query Then Fetch vs DFS Query Then Fetch
The default search_type is query_then_fetch. In this mode, each shard scores documents using its local term statistics (document frequency, term frequency). This means scores are relative to the shard's own data, not the global index. For a term that appears in 100 documents on shard A and 10 documents on shard B, the same term will produce different scores on each shard. This is usually fine for most use cases, but it can cause relevance issues when shards are unbalanced.
dfs_query_then_fetch adds a pre-query phase where the coordinator collects global term statistics from all shards. Each shard then uses these global statistics for scoring, producing consistent scores across the cluster. The cost is an extra round-trip: DFS phase, query phase, fetch phase. For balanced shards, the relevance improvement is often negligible. For heavily skewed shards, it can prevent the "hot shard" problem where a shard with few documents over-ranks rare terms.
query_cache and request_cache: Caching Phase Results
The query cache (shard-level request cache) stores the complete query phase results for a given request. The cache key is a combination of the query, size, from, and sort parameters. If the same request arrives twice, the shard can return cached results without re-executing the query. However, the cache is invalidated on every refresh, so it is only useful for indices that are read-heavy and rarely updated.
The request cache stores the final merged response from the coordinator. This is useful for repeated identical requests but is also invalidated on index changes. For time-series data where indices are rolled over daily, yesterday's index is effectively immutable, and the request cache becomes highly effective.
size and terminate_after: Limiting Phase Work
terminate_after is a query-level parameter that tells each shard to stop after finding a specified number of matches. Unlike size, which affects the priority queue, terminate_after is a hard stop. The shard returns whatever it has found so far, and the coordinator works with incomplete results. This is useful for exploration queries where approximate results are acceptable, but it must not be used for pagination or critical business logic.
fetch_source and _source Filtering
The fetch_source parameter controls whether the fetch phase retrieves _source at all. If set to false, the fetch phase only returns document IDs and metadata. This is useful for workflows that only need the IDs for subsequent lookups or when _source is large and not needed. For example, a query that feeds document IDs into a downstream processing pipeline can skip _source retrieval entirely, saving significant fetch phase time.
_source filtering (_source: ["field1", "field2"]) is applied at the fetch phase after the full _source is decompressed. This means the stored field file is still read and decompressed, but only the requested fields are returned to the client. The I/O cost is not reduced, but the network transfer is. For true I/O reduction, use stored_fields to retrieve only the fields that are explicitly stored, or use doc_values for fields that do not need the original _source structure.
Common Pitfalls and Edge Cases
Several subtle behaviors in the two-phase model catch operators off guard in production.
The from + size Limit and the index.max_result_window Setting
By default, index.max_result_window is set to 10,000. This means from + size cannot exceed 10,000 for standard searches. This limit exists because the query phase's priority queue becomes unmanageably large for deep pagination. Attempting from: 10000, size: 10 will fail unless the index setting is increased. However, increasing it is generally a mistake - deep pagination should use search_after or point_in_time APIs instead.
Why scroll API Bypasses the Two-Phase Model
The scroll API maintains a point-in-time search context that keeps the index reader pinned. The first request executes the query phase and stores the results in the context. Subsequent scroll requests retrieve the next batch from the stored results without re-executing the query phase. The fetch phase still runs per batch, but the query phase cost is paid only once. This makes scroll efficient for exporting large result sets, though it is deprecated in favor of search_after with point_in_time.
search_after: The Modern Replacement for Deep Pagination
search_after uses a sort key (the last document's sort values from the previous page) to find the next page. It avoids the from penalty entirely by starting the query phase from the sort key position rather than from the beginning. Each shard uses the sort key to skip documents that are known to be before the cursor, effectively doing a "seek" into the sorted stream. This is the recommended approach for deep pagination in modern OpenSearch.
Relevance Tuning and the Query Phase
Because the query phase computes scores per shard, any operation that changes shard boundaries can affect relevance. Reindexing with a different number of shards, force-merging to a single segment, or adding new nodes that trigger rebalancing - all of these can change the local term statistics and produce different scores for the same query. This is why dfs_query_then_fetch exists, but it is not a complete solution: it fixes scoring consistency but adds latency.
For applications where score stability is critical (e.g., consistent ranking for e-commerce), consider using constant_score queries for filters and function_score for business logic, both of which are less sensitive to term statistics. Alternatively, use rescore queries at the coordinator level to re-rank the top-N results using a more expensive but globally consistent scoring model.
Conclusion
The query phase and fetch phase split in OpenSearch is a deliberate architectural choice that optimizes for the common case: finding a small number of best-matching documents from a large index spread across many shards. The query phase handles the heavy lifting of matching, scoring, and aggregation per shard. The fetch phase handles the targeted retrieval of full documents only for the winners.
Production operators must understand the cost model: the query phase scales with from + size, the fetch phase scales with size, and both are sensitive to shard count, shard balance, and the presence of highlighting or heavy aggregations. Tuning search_type, max_result_window, and pagination strategy based on this understanding is the difference between a cluster that handles peak load gracefully and one that collapses under deep pagination requests.
For read-heavy workloads, leverage the request cache. For write-heavy workloads, tune the refresh interval to balance query-phase visibility with indexing throughput. And for any workload that requires deep pagination, abandon from and adopt search_after with a point_in_time context.
About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.








