LuceneSIM: A Merge Policy Simulator for Fast Policy Comparison

Merge policy tuning in Lucene is one of those problems where every decision is a trade-off between write amplification, read latency, and segment count. The default TieredMergePolicy works well for most workloads, but if you have a write-heavy index with frequent deletes, or a read-heavy index where you want fewer segments, you might want to compare how LogByteSizeMergePolicy or a tuned TieredMergePolicy behave. The problem is that running a full indexing benchmark to compare merge policies takes hours. You need to index documents, wait for merges, measure, change the policy, reindex, and repeat.

I wrote LuceneSIM to solve this specific problem. It is a small Java utility that simulates merge policy decisions without running a real index. The source code is available at github.com/iprithv/lucenesim. It creates fake segments with realistic sizes and document counts, feeds them to Lucene's actual MergePolicy.findMerges() method, and executes the merges by creating new fake segments with prorated sizes. The entire simulation runs in under a second, and you can compare multiple policies across different workload shapes in minutes rather than hours.

This is not a replacement for real benchmarks. It is a tool for fast iteration when you are trying to understand why one policy produces fewer segments than another, or whether increasing segmentsPerTier actually reduces write amplification for your workload. The tool uses Lucene's real merge policy code, so if Lucene changes how TieredMergePolicy works, the simulator automatically picks up the new behavior. But it does not model I/O, it does not index real documents, and it distributes deletes randomly rather than targeting them by term. Use it for policy comparison and parameter tuning, not for predicting absolute production performance.

The Problem: Why Merge Policy Benchmarking Is Slow

A Lucene index is a collection of segments. Each segment is an immutable set of documents with its own inverted index, stored fields, doc values, and points. When you add documents, they are written to new segments. When a segment count threshold is reached, the merge policy selects segments to merge into a new, larger segment. The merge policy is the algorithm that decides which segments to merge and when.

The default TieredMergePolicy organizes segments into tiers. Each tier has a maximum size, and segments within a tier that are close in size are merged together. The policy aims to keep the number of segments logarithmic with respect to the index size, while minimizing write amplification. LogByteSizeMergePolicy, the older default, uses a simpler approach: it merges segments when they exceed a size threshold, creating a logarithmic merge tree.

The problem is that comparing these policies requires a full indexing run. You need to:

Create an IndexWriter with the policy under test.
Index a realistic number of documents (millions for a large index).
Wait for background merges to complete.
Measure the final segment count, merge count, and write amplification.
Repeat for each policy variant.

Each run can take tens of minutes to hours depending on the workload. If you want to compare six workload shapes across three policies, you are looking at a full day of benchmarking. This is fine for a final validation, but it is impractical for iterative exploration. You cannot quickly answer questions like: "What happens if I increase segmentsPerTier from 10 to 15?" or "How does a 10% delete rate affect LogByteSizeMergePolicy compared to TieredMergePolicy?"

The Approach: Fake Segments, Real Merge Logic

LuceneSIM takes a different approach. Instead of indexing real documents and building real segments, it creates fake SegmentCommitInfo objects with the size and document count properties that a real segment would have. These fake segments are fed to Lucene's MergePolicy.findMerges() method, which runs the actual merge policy code. The policy returns a MergeSpecification containing the segments to merge. The simulator then executes the merge by creating a new fake segment whose size is the sum of the merged segments' sizes, prorated by the delete rate.

The Core Simulation Loop

The simulation runs as a discrete event loop:

Generate flushes - The workload specifies how many flushes, how many documents per flush, and the average segment size per flush. The simulator creates a fake segment for each flush.
Apply deletes - After each flush, the simulator applies deletes randomly across existing segments. The delete rate is configurable (e.g., 5% or 10% of total documents).
Run merge policy - The simulator calls MergePolicy.findMerges() with the current set of segments. The policy runs its real code and returns a MergeSpecification.
Execute merges - The simulator executes each merge by creating a new segment whose size is the sum of merged segment sizes minus the prorated delete overhead. The merged segments are marked as deleted.
Repeat - Steps 1-4 repeat until all flushes are processed.
Collect metrics - At the end, the simulator reports WAF, final segment count, number of merges executed, and peak segment count.

The entire loop runs in memory. No files are written, no documents are indexed, no analyzers are run. The only real Lucene code executed is the merge policy itself.

Why This Works

Merge policies in Lucene make decisions based on segment metadata: size, document count, delete count, and whether the segment is in a compound file. They do not look at the actual document content, the inverted index structure, or the field schema. A TieredMergePolicy decides to merge two segments because they are similar in size and fit within a tier budget, not because of what is inside them. This means a simulator that provides accurate segment metadata can produce the same merge decisions as a real index, even without real documents.

The key assumption is that segment size growth follows a predictable pattern. When segments are merged, the new segment's size is approximately the sum of the merged segments' sizes, minus the space reclaimed by deleted documents. LuceneSIM models this by prorating the merge size based on the delete ratio. If two segments of 100MB each are merged, and each has 10% deleted documents, the resulting segment is approximately 180MB (200MB minus 10% overhead).

What the Simulator Does Not Model

This is where the limitations matter. The simulator is honest about what it does not do:

No I/O modeling - The simulator does not measure disk writes, SSD wear, or I/O latency. It reports write amplification in terms of bytes merged relative to bytes flushed, but it cannot tell you whether your NVMe drive can keep up with the merge rate.
No actual documents - The simulator does not index, analyze, or store documents. It does not measure indexing throughput, analyzer overhead, or memory usage during indexing.
No query performance - The simulator reports segment count, which correlates with query performance, but it does not measure query latency, cache behavior, or searcher warming time.
Random delete distribution - Deletes are distributed randomly across segments, not targeted by term like a real IndexWriter. If your workload deletes documents by date range (e.g., all documents older than 30 days), the delete distribution will be different, and the simulator's results will not match reality.
No concurrent indexing - The simulator runs a single thread. It does not model concurrent flushes, concurrent merges, or the backpressure that occurs when merges cannot keep up with flushes.

These limitations are intentional. The simulator is designed for one specific task: comparing merge policy decisions under different workloads. If you need to know how long indexing takes on your hardware, you still need a real benchmark. If you need to know query latency with 50 segments versus 10 segments, you still need a real search benchmark. LuceneSIM is a pre-filtering tool. It helps you narrow down which policies and parameters are worth testing in a full benchmark.

The Workload Model: Describing Your Indexing Pattern

The simulator uses a workload builder that defines the shape of the indexing pattern. A workload consists of:

Flush count - How many flushes occur during the simulation. Each flush creates a new segment.
Documents per flush - How many documents are added per flush. This determines the average segment size if combined with the per-document size estimate.
Flush size - The average size of each flushed segment. The simulator can derive this from doc count and per-doc size, or you can specify it directly.
Delete rate - The percentage of documents that are deleted after each flush. Deletes are distributed randomly across all existing segments.
Seed - The random seed for delete distribution, ensuring reproducible results.

Here is how to define a workload in code:

WorkloadSource workload = SyntheticWorkloadBuilder.create()
    .flushCount(100)
    .flushDocCount(10_000)
    .flushSizeMB(1.0)
    .deleteRate(0.05)
    .seed(42)
    .build();

This workload simulates 100 flushes, each adding 10,000 documents with a 1MB flush size, and a 5% delete rate applied after each flush. The seed of 42 ensures that the random delete distribution is the same across runs, so comparing two policies on the same workload gives deterministic results.

The Sweep Mode: Comparing Six Workload Shapes

The simulator includes a sweep mode that runs six predefined workload shapes and compares the three policies across all of them. The workloads vary in flush count, document count per flush, and delete rate, covering a range from light indexing to heavy write-and-delete patterns. This is useful for understanding how a policy behaves under different conditions without writing custom workload definitions.

To run the sweep:

./gradlew run --args="--sweep"

The output is a table showing WAF, segment count, merge count, and peak segments for each policy on each workload shape. This is the fastest way to get a broad view of policy behavior.

The Scheduler: Serial vs Concurrent Merge Execution

The simulator includes two merge schedulers:

SerialSimScheduler - Executes merges one at a time, in the order returned by the merge policy. This is the simplest model and matches the behavior of a single-threaded merge scheduler.
ConcurrentSimScheduler - Simulates concurrent merge execution with a configurable throughput limit. This is useful for understanding how concurrent merges affect peak segment count, which is the maximum number of segments that exist at any point during the simulation.

The concurrent scheduler does not model actual I/O concurrency. It assumes that merges run in parallel with infinite I/O capacity and only limits the number of concurrent merges. This means it can tell you whether allowing 3 concurrent merges instead of 1 reduces peak segment count, but it cannot tell you whether your disk can sustain 3 concurrent large merges without saturating.

The Metrics: What the Simulator Measures

The simulator collects four primary metrics:

Write Amplification Factor (WAF) - The total bytes merged divided by the total bytes flushed. A WAF of 4.0 means the merge process wrote 4 bytes to disk for every 1 byte of flushed data. Lower WAF is better for write-heavy workloads because it reduces disk wear and indexing latency.
Final Segment Count - The number of segments remaining after all flushes and merges are complete. Fewer segments generally means better query performance, but the relationship is not linear. An index with 5 segments might query similarly to an index with 10 segments, but an index with 100 segments will be noticeably slower.
Number of Merges - The total number of merge operations executed. More merges mean more CPU and I/O work, but they also produce a more compact index.
Peak Segment Count - The maximum number of segments that existed at any point during the simulation. This matters for search performance during indexing, because a search query must check every segment. A high peak segment count means queries during heavy indexing will be slower.

Here is a sample output comparing three policies on a 100-flush, 10K-document, 1MB-flush, 5%-delete workload:

Workload: 100 flushes x 10K docs, 1.0 MiB flush, 5% deletes

+--------------------------------------------+--------+----------+--------+----------+-------------+
| Policy                                     | WAF    | Segments | Merges | Peak Segs| Flush Bytes |
+--------------------------------------------+--------+----------+--------+----------+-------------+
| TieredMergePolicy                          | 4.90   | 6        | 70     | 11       | 500.0 MiB   |
| LogByteSizeMergePolicy                     | 12.54  | 14       | 74     | 25       | 500.0 MiB   |
| TieredMergePolicy(segmentsPerTier=10.0)  | 5.21   | 9        | 54     | 13       | 500.0 MiB   |
+--------------------------------------------+--------+----------+--------+----------+-------------+

The TieredMergePolicy achieves the lowest WAF (4.90) and the fewest segments (6), with 70 merges. LogByteSizeMergePolicy produces more segments (14) and significantly higher WAF (12.54), meaning it writes more than twice as much data to disk during merging. Tuning segmentsPerTier to 10.0 increases WAF slightly (5.21) but reduces merge count (54), which might be desirable if merge CPU cost is a concern.

Why This Matters: The Real Value of Fast Simulation

The value of LuceneSIM is not in the absolute numbers. It is in the speed of comparison. When you can run three policies across six workloads in under a minute, you can explore the parameter space in a way that is impossible with real benchmarks. You can answer questions like:

How sensitive is TieredMergePolicy to delete rate? Does a 10% delete rate change WAF by 10% or 50%?
Does LogByteSizeMergePolicy ever produce fewer segments than TieredMergePolicy, or is it always worse for this workload class?
What is the optimal segmentsPerTier for a workload with 1MB flushes and no deletes? Is it 10, 15, or 20?
How does peak segment count scale with flush count? If I double my flushes, does the peak segment count double or grow logarithmically?

These are design questions that need fast answers. A full benchmark can validate the final choice, but it cannot explore the space. LuceneSIM fills that gap. It is a prototyping tool, not a production performance predictor.

The Implementation: How Fake Segments Feed Real Merge Policies

The simulator's core trick is creating SegmentCommitInfo objects that satisfy the merge policy's expectations without requiring a real index. The implementation uses a vendored lucene-core-11.0.0-SNAPSHOT.jar as its only dependency, ensuring that the merge policy code is the same code that runs in a real Lucene index.

Fake SegmentCommitInfo Creation

A SegmentCommitInfo in Lucene contains:

Segment name and generation
Document count
Delete count
Segment size (in bytes)
Whether the segment is in a compound file (CFS)
Field infos, live docs, and other metadata

The simulator creates a minimal SegmentCommitInfo with only the fields that the merge policy actually reads. For TieredMergePolicy, this is primarily the segment size, document count, and delete count. The simulator populates these fields with the workload-defined values and feeds the segment list to findMerges().

Merge Execution

When the merge policy returns a MergeSpecification, the simulator executes each merge by:

Summing the sizes of the segments to be merged.
Subtracting the prorated delete overhead. If the merged segments have a combined 10% deleted documents, the new segment size is 90% of the sum.
Creating a new fake SegmentCommitInfo with the combined document count (minus deleted docs), the computed size, and no deleted documents (because the merge reclaims deletes).
Marking the merged segments as deleted so they are no longer visible to the merge policy.

This is a simplified model of what IndexWriter does during a real merge. The simplification is that the simulator does not write files, does not build a new inverted index, and does not re-sort documents. It only updates the segment metadata. For the merge policy's purposes, this is sufficient because the policy only cares about segment metadata when making the next merge decision.

The Gold Test: Validating Against Real IndexWriter

The simulator includes a GoldTest that validates the simulation against a real IndexWriter. The test runs the same workload on both the simulator and a real index (using a RAMDirectory to avoid disk I/O) and compares the final segment counts. The simulator is considered valid if the segment count matches within ±1 segment. This test ensures that the fake segment model is accurate enough for policy comparison.

The test exists because the simulator makes assumptions about segment size growth and delete distribution. If these assumptions diverge from reality, the simulator's results become misleading. The GoldTest catches this divergence by comparing against the ground truth of a real IndexWriter.

Limitations and Honest Assessment

This is a small utility. It is not a contribution to Lucene core, it is not a benchmarking framework, and it is not a production tool. It is a personal project that solved a specific problem I had while trying to understand merge policy behavior. The code is straightforward, the scope is narrow, and the value is in the speed of iteration, not in the depth of analysis.

Lucene already has excellent benchmarking tools. The official benchmark module can run full indexing and search workloads with realistic document collections. LuceneSIM does not compete with that. It is a complement: a quick way to compare policies before running the real benchmark. If you need to know absolute performance, use the official tools. If you need to know why one policy produces 6 segments and another produces 14, LuceneSIM can tell you in seconds.

The project is also limited by its assumptions. Random delete distribution is a simplification. Real workloads often have temporal delete patterns (delete all documents from last week), which the simulator does not model. The concurrent scheduler assumes infinite I/O, which real hardware does not provide. The size estimation is a heuristic, not a measurement. These are not bugs; they are trade-offs made to keep the simulation fast and focused.

Getting Started

LuceneSIM requires Java 21. The build uses Gradle, and the only runtime dependency is the vendored Lucene core JAR. Clone the repository and run:

./gradlew run              # default: 100 flushes, 10K docs, 1MB flush, no deletes
./gradlew run --args="--sweep"  # compare across 6 workload shapes
./gradlew test             # run GoldTest, WAFMathTest, CalibrationTest, ConcurrentSchedulerTest

The source code is Apache 2.0 licensed with standard ASF headers. It is a small codebase, easy to read, and easy to modify if you want to add custom workloads or merge policies.

Conclusion

Merge policy tuning is a slow process if you rely on full benchmarks for every iteration. LuceneSIM provides a faster path by simulating merge decisions using fake segments and real policy code. It is not a replacement for real benchmarks, but it is a useful pre-filter. The simulator runs in under a second, supports any merge policy that Lucene supports, and validates its results against real IndexWriter behavior.

The tool is honest about its limitations. It does not model I/O, it does not index real documents, and it makes simplifying assumptions about delete distribution. But for the specific task of comparing merge policies and tuning parameters, it is fast enough to enable exploration that would be impractical with real benchmarks. If you are trying to understand why TieredMergePolicy behaves differently from LogByteSizeMergePolicy on your workload, LuceneSIM can give you an answer in the time it takes to read this paragraph.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.