150
Points
57
Comments
justinweiss
Author

Top Comments

kstenerudMay 29
The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.

But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.

The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

i2talicsMay 29
Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.

The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!

The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.

stebalienMay 29
I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).

The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.

[1]: https://github.com/multiformats/multicodec

boricjMay 29
I'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).

This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.

LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.

I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.

omoikaneMay 29
UTF-8 has the same issue ("overlong encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:

https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)

This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.

MarkusQMay 29
> This causes problems for signed data

Given that the context up to this point had been representation of integers, I initially trip on this. :)

wahernMay 29
This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1. Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple.
conaclosMay 29
This is pretty close to SQLite's varints [0]

[0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki

Visit the Original Link

Read the full content on inkandswitch.com

Source
inkandswitch.com
Author
justinweiss
Posted
May 29, 2026 at 03:03 PM


More Top Stories

koenvangilst.nl May 29
Notes from the Mistral AI Now Summit in Paris
15528 commentsby vnglst
Details
obeli.sk May 29
SQLite is all you need for durable workflows
3615 commentsby tomasol
Details
owenmcgrann.com May 29
The dead economy theory
235287 commentsby WillDaSilva
Details
jeffgeerling.com May 29
It's hard to justify buying a Framework 12
84163 commentsby watermelon0
Details
rockstarintel.com May 29
GTA 6 Developers Unionize
330183 commentsby AndrewKemendo
Details
research.roundtable.ai May 29
CAPTCHAs can still detect AI agents
3523 commentsby timshell
Details
👋 Need help with code?