This is the seventh article in the Exeris Kernel series.
TL;DR: Direct ByteBuffer gives you zero-copy but defers cleanup to the GC.
Arena gives you deterministic cleanup but scopes ownership to a single region.
Neither models a buffer shared across threads with a lifetime longer than any
single fork. LoanedBuffer is the third option Exeris needed: explicit
reference counting, try-with-resources for the boring case, and EX-MEM-1003
when discipline fails. The cost is honest - the compiler will not catch a
missed retain() before a fork().
You can move TLS off the heap. You can ban ThreadLocal
and replace it with ScopedValue. You can structure your concurrency with
StructuredTaskScope. You can
push the TLS boundary into Panama FFM
so that cipher operations no longer allocate. And you will still find allocation
pressure on the hot path - because somewhere, something is moving bytes through a
byte[].
This article is about that boundary.
It is also where one specific JVM-era assumption finally breaks: that direct
ByteBuffer is "the off-heap one." Direct ByteBuffer solves zero-copy. It does
not solve ownership. In a runtime where memory lifecycle is supposed to be part of
the architecture, those are not the same problem.
The Constraint
When I started designing the memory subsystem for Exeris, I had two constraints
that had to hold simultaneously on the request hot path:
- Zero copy. Bytes coming off a socket, through TLS, through HTTP framing, into a request handler - none of those steps may allocate a new array and copy.
- Deterministic cleanup. When work on a buffer is done, native memory must be released now, not whenever the GC notices.
Most JVM memory abstractions give you exactly one of these.
byte[] gives you neither. Heap allocation, GC-managed lifecycle, and a copy every
time you cross a native boundary.
ByteBuffer.wrap(byte[]) is a heap buffer with a different name. Same problems.
ByteBuffer.allocateDirect(n) gives you (1) but not (2). The segment is off-heap, so
crossing into native code does not require a copy. But the underlying memory is freed
when a Cleaner thread observes that the ByteBuffer reference has become unreachable.
Under load, this means buffers can survive arbitrarily long after the work is done.
You do not control when. You cannot ask. There is no close().
Arena.allocate(layout) from Panama FFM gives you (2) but with a coarser ownership
model. An Arena owns a region of memory; closing the arena releases everything in
it. That is fine for a request lifetime. It is less fine when a buffer is shared
between threads, or transferred from one task to another and released by the second,
or part of an in-flight queue.
I needed both. And I needed them composable.
ByteBuffer Solves Half the Problem
What stopped me from using ByteBuffer directly was not API ergonomics. It is
that ByteBuffer does not have an ownership model.
It has an access model - position, limit, capacity, slice, duplicate - but
nothing that answers the question "who is responsible for releasing this memory,
and when?". Direct buffers defer that question to the GC. Heap-backed buffers
defer it to the GC twice - once for the buffer object, once for the array it wraps.
That works fine when allocations are infrequent and lifetimes are obvious. It
breaks when bytes are flowing through 1-VT-per-stream concurrency at network
speed and a Cleaner thread is the only thing standing between you and a slow,
silent native heap leak.
The standard JVM workaround is buffer pooling: keep a ConcurrentLinkedQueue of
direct ByteBuffer instances, hand them out, and trust callers to return them.
This works in practice and underpins frameworks like Netty. It also reintroduces
the exact problem the GC was trying to solve: explicit lifecycle management, with
the additional twist that forgetting to return is now silent - the buffer just
sinks into orphan memory without a Cleaner event to notice.
What I wanted was the lifecycle discipline of an Arena combined with the
flexibility of a pooled buffer - and a way to share ownership across threads
without either a lock or a leak.
That is what LoanedBuffer is. The rest of this article is what it cost.
The Loan Pattern
The basic shape is unsurprising. LoanedBuffer implements AutoCloseable. You
allocate via the SPI. You use try-with-resources:
try (LoanedBuffer buffer = allocator.allocate(AllocationHint.MEDIUM)) {
buffer.writeBytes(payload, 0, payload.length);
transport.send(buffer);
}
Three things are happening here that ByteBuffer does not give you.
First, the allocator is injected via SPI. The application code does not know
whether the underlying allocator is a slab pool, a partitioned arena, or a
specialized native pool optimized for a specific transport. Implementation
blindness is preserved - Core operates exclusively on the MemoryAllocator
contract, resolved at bootstrap via ServiceLoader:
MemoryAllocator allocator = ServiceLoader.load(MemoryAllocator.class)
.findFirst()
.orElseThrow(() -> new KernelBootstrapException(KernelErrorCodes.EX_BOOT_0002));
Second, AllocationHint is a typed enum, not a raw byte count. The hint tells
the allocator which size class is wanted (SMALL, MEDIUM, LARGE,
NETWORK_FRAME). The allocator picks a slab from the matching pool. There is
no math at the call site, no rounding, no "did I just trigger a slow path?".
Third, close() is deterministic and immediate. When the try block exits,
the slab returns to its pool now. The watermark manager updates now. There
is no Cleaner thread, no PhantomReference, no waiting for GC.
That is the boring, single-owner case. The pattern earns its name in the next
case - the one ByteBuffer cannot model at all.
Reference Counting with VarHandle
Inside the kernel, a buffer often has more than one logical owner. The transport
layer wants to hold it while the request handler is reading. The handler wants to
hold it while async work is in flight. The async work might want to retain it for
a follow-up operation.
LoanedBuffer solves this with explicit reference counting. Allocation starts
the count at one. retain() increments. close() decrements. When the count
reaches zero, the slab returns to the pool.
The implementation uses VarHandle for the CAS path. No synchronized, no
AtomicInteger allocation per buffer, no monitor inflation. The reference count
is a primitive int field on the buffer itself, accessed through a
class-level VarHandle:
public final void retain() {
REF_COUNT_HANDLE.getAndAdd(this, 1);
}
public final void close() {
int previous = (int) REF_COUNT_HANDLE.getAndAdd(this, -1);
if (previous == 1) {
fireCloseActions();
}
}
Calling close() more than the buffer was retained is a fatal contract violation.
Calling retain() on a non-owning view - for example, a peek() slice that
exposes a memory region without transferring ownership - is also a contract
violation. The kernel emits EX-MEM-1003 (Peek View Ownership Misuse) as a
glass-box telemetry event when this happens, with the calling method captured in
rawArgs[0]. The call itself is a no-op: it neither increments the count nor
returns silently. It is logged and refused.
The point of refusing is not to be punitive. It is that an unobserved
retain()-on-view bug becomes a use-after-free somewhere else, on a different
thread, at an unpredictable time. Failing fast and loudly at the misuse site
makes the bug local instead of distributed.
The Async Transfer Problem
This is the case that motivated the design. I never had to debug it in
production - I caught it on paper while sketching the ownership model, and
the design followed from there.
A request arrives. The handler reads it into a buffer. The handler then forks two
async sub-tasks under a StructuredTaskScope: one to validate, one to enrich.
Both sub-tasks need to read the same buffer. The handler joins both, then
serializes the response.
In the standard JVM model, this is a sharing problem with no good answer. If you
pass a ByteBuffer to two virtual threads, you have just created an aliasing
hazard with no concurrency model. If you copy the buffer twice, you have just
defeated zero-copy.
In the LoanedBuffer model, sharing is explicit:
try (var scope = StructuredTaskScope.open(Joiner.awaitAllSuccessfulOrThrow())) {
try (LoanedBuffer buffer = allocator.allocate(AllocationHint.NETWORK_FRAME)) {
buffer.retain();
scope.fork(() -> {
try {
return processAsync(buffer);
} finally {
buffer.close();
}
});
scope.join();
}
}
The pattern is:
- The allocator returns the buffer with
refCount = 1. - Before forking, the parent calls
retain(). Count is now 2. - The parent forks the sub-task. The sub-task runs concurrently.
- The sub-task closes the buffer when done. Count drops to 1.
- The parent's
try-with-resourcescloses when the outer block exits. Count drops to 0. Slab returns to the pool.
This is dependency-safe because retain() happened before the fork. If a
caller forgets the retain(), the parent's close() can race the sub-task's
read, and the sub-task observes a slab that has already been recycled. The kernel
catches this in its TCK suite, but the contract is the caller's to honor - there
is no automatic retain on fork. I considered making scope.fork() automatically
retain the buffer if a special wrapper type was passed in, but the cost was
introducing a parallel API surface for what is fundamentally a discipline issue.
The current design keeps the rule visible at the call site: if you fork it, you
retain it first.
This is also the place where the STS-bootstrap article's pattern pays off
again. There, the structured scope owned a startup round - a bounded unit of
work with a clear lifetime. Here, the structured scope owns a fan-out unit with
the same clarity, but with an additional resource - the buffer - whose lifetime
is longer than any single fork and shorter than the enclosing scope. The
ownership model has to support that.
StructuredTaskScope does not. LoanedBuffer does.
The JMM contract underneath this is worth stating directly because it is easy
to get wrong. There are no explicit memory fences in the close-action path. The
visibility of close-action slots written by the allocating thread to the
releasing thread is guaranteed by safe publication of the buffer reference itself.
When the parent passes the buffer into scope.fork(), the JDK's structured-scope
implementation publishes the reference safely - that publication is what makes
all the buffer's fields visible to the sub-task, including the close-action
chain. If you bypassed scope.fork() and handed the buffer to another thread
through, say, a non-volatile field, the model breaks.
This is also why the Community transport's allocator uses shared-arena semantics
for all allocations rather than Arena.ofConfined(). The carrier thread
allocates, but the per-stream Virtual Thread closes - different threads, same
buffer. A confined arena would refuse the cross-thread close(). Shared arena
allows it, with the JMM safely-published buffer reference carrying visibility.
Watermarks and the Pressure Boundary
LoanedBuffer solves per-buffer ownership. It does not solve aggregate pressure.
When the slab pools start running low, the kernel needs to know - and decide
what to do about it - before an EX-MEM-1001 (Off-heap Exhausted) gets
thrown on a request hot path. That is the job of WatermarkManager.
The manager exposes four threshold levels:
| Level | Off-heap utilization |
ResourceArbiter decision |
|---|---|---|
NORMAL |
< 70% |
ALLOW - allocations proceed |
WARNING |
70–85% |
THROTTLE - large allocations rejected |
CRITICAL |
85–95% |
REJECT - only essential traffic |
SHEDDING |
≥ 95% |
SHED_LOAD - EX-MEM-1001 thrown |
The ResourceArbiter reads the current level on each allocation request:
public LoanedBuffer tryAllocate(AllocationHint hint) {
if (watermark.isHighWatermarkBreached()) {
throw new MemoryExhaustedException(hint.bytes(), watermark.availableBytes());
}
return allocator.allocate(hint);
}
This is where LoanedBuffer connects forward to the next architectural layer.
The watermark levels are not just internal accounting - they expose pressure as a
typed signal (WatermarkLevel) that the rest of the kernel - scheduling,
admission, business logic - can react to without inspecting GC counters or
parsing JFR events at runtime. How the transport edge uses that signal to shed
load before work hits the kernel is a separate decision and belongs to its own
article.
Leak Detection: When Discipline Fails
The Loan pattern relies on discipline. Every allocation must be paired with a
close(). Every retain() must be paired with another close(). There is no
GC fallback.
In production, that discipline is enforced by the API surface - try-with-resources,
sealed types, the Glass Box telemetry of EX-MEM-1003. In development and
testing, it is enforced by LeakTracker, which integrates java.lang.ref.Cleaner
to detect buffers that became unreachable without being closed.
When LeakTracker runs in PARANOID mode and observes a LoanedBuffer whose
reference count is non-zero at GC time, it emits EX-MEM-1002 (Arena Leak):
Detected
| Code | Meaning | Glass-Box Payload (rawArgs) |
|---|---|---|
EX-MEM-1001 |
Off-heap Exhausted | [0] long requestedBytes, [1] long availableBytes |
EX-MEM-1002 |
Arena Leak Detected | [0] long segmentAddress, [1] long segmentByteSize |
EX-MEM-1003 |
Peek View Ownership Misuse | [0] String callerMethod |
The leak is logged with the segment address and size, and a JFR event is
emitted. In a long-running test, this turns "I forgot a close somewhere" into a
specific, actionable signal with a stack trace.
This does not catch all leaks. A reference held by a long-lived data structure
will not be GC'd, and LeakTracker will not fire. The terminalStateCatalog
discipline I described in the Flow article
applies here too: long-lived in-memory caches need their own bounded retention
policy. The pattern catches forgotten references, not intentionally retained
ones.
The trade-off is honest. LeakTracker is a development and staging tool, not a
production safety net. In production, the API surface and code review are the
primary defense. In development, PARANOID mode is the difference between
"there is a leak somewhere in 50k LOC" and "the leak is in OrderHandler.java
line 142, allocated from NetworkFrameSlabPool, 4096 bytes."
What Still Remains True
A few things stay true even after this model is in place. Some of them are the
reasons not to adopt it.
ByteBuffer is still the right answer for most Java applications. If you are
building a normal HTTP service and your bottleneck is not allocation pressure
on the request hot path, the Loan Pattern is over-engineering. It costs
cognitive load on every read path, every fork, every cross-thread handoff. That
cost is justified by the constraint, not by aesthetics.
Arena is still useful for request-lifetime allocations that do not need
sharing. Inside a single Virtual Thread, with a clear scope boundary, an
Arena.ofConfined() is simpler than a refcounted buffer. The kernel uses both
patterns where each fits.
GC is still your friend for object graphs. Nothing in the Loan Pattern says
"never allocate on the heap." The pattern is specific to off-heap memory on
the request hot path. Domain objects, plan instances, log records - all of
those still live where Java has always put them.
The pattern does not solve cross-process IPC. If a buffer leaves the process
- shipped over a network, written to a shared memory file, handed to a
different JVM - the reference count stops being meaningful. The
LoanedBuffermodel is correct only for in-process lifetime. The handoff to a different process is its own boundary problem with its own ownership semantics.
Finally, the Loan Pattern does not soften the discipline cost. Every fork must
retain. Every share must retain/close. Every async path must close in finally.
The compiler will not catch the omissions for you. The code review will. The
TCK will. LeakTracker will, in development. The runtime will not.
I considered making this less explicit - a wrapper type that auto-retained on
escape from a method, an annotation that made the compiler enforce paired
close(). Each of those would have either added runtime overhead, added a
parallel API, or relied on a static analyzer that did not exist. The current
design accepts the discipline cost as the price of the architectural model.
The thing I keep coming back to is that this is not a clever data structure.
It is a contract - who owns this memory, who is allowed to extend its
lifetime, and what is observable when someone gets it wrong. The implementation
is unremarkable: a VarHandle, an int, a close-action chain. The work is
deciding that ownership belongs in the type system at all, and accepting that
the discipline cost is the price of the model.
The next architectural decision in the series is what the kernel does with the
pressure signal once it has one - how WatermarkLevel becomes a shed decision
at the network edge, and the single place in the kernel where unstructured
Virtual Threads are deliberately allowed.
Explore the Exeris Kernel - zero-allocation architecture in running code:
🔗 exeris-systems/exeris-kernel
The Memory subsystem lives in exeris-kernel-spi (MemoryAllocator,
LoanedBuffer, AllocationHint) and exeris-kernel-core
(AbstractLoanedBuffer, WatermarkManager, ResourceArbiter, LeakTracker).
If you want to see how the refcount / retain() / close() contract behaves
under cross-thread fork-and-join, the TCK suite in exeris-kernel-tck is the
fastest way in.












