What I Built
STRIDE is a deterministic, field-aware integer analysis engine revived from the abandoned glyph-v8 prototype.
Not a general compressor. A precision primitive that does one thing no existing tool does: profile binary protocol data field by field, build per-field entropy models, and identify exactly where compression gains are possible.
General compressors like zstd see a byte stream. STRIDE sees structure.
The Problem
Binary protocols move billions of messages daily — Protobuf, MessagePack, Thrift. Their integer fields are not random:
• Timestamps delta from the previous value
• Status codes are almost always 200
• IDs increment monotonically
• Enums repeat from a tiny set
zstd doesn’t know this. It compresses the whole stream as if every byte were unpredictable. STRIDE knows field boundaries — and that changes everything about what’s compressible.
Demo
Repository: github.com/yasha1971-coder/glyph-v8
Live benchmark: enwik8 (100,000,000 bytes, OVH EPYC server)
$ stride container-bytefreq enwik8.stridebin --top 5
Total bytes processed: 100,000,000
32 0x20 13,519,824 (13.52%) ← space dominates
101 0x65 8,001,205 (8.00%)
116 0x74 6,154,908 (6.15%)
97 0x61 5,712,026 (5.71%)
105 0x69 5,227,649 (5.23%)
$ stride container-hotspots enwik8.stridebin --top 3
Chunk 635 Entropy: 5.685 ← highest information density
Chunk 634 Entropy: 5.609
Chunk 636 Entropy: 5.534
$ stride container-headersketch enwik8.stridebin --size 8
Bucket 15: 0.574
Bucket 33: 0.663
Bucket 41: 0.605
Bucket 48: 0.660
Timing on 100MB corpus:
| Module | Time | Output |
|---|---|---|
| ByteFreq | 1.97s | 256-byte histogram |
| Hotspots | 4.17s | Entropy map across 1,526 chunks |
| HeaderSketch | 4.40s | 64-slot structural profile |
| Fingerprint | 71.6s | 128 MinHash values (known: O(n·k) rolling hash) |
Proof with SHA256 verification: github.com/yasha1971-coder/glyph-v8/blob/main/proof/enwik8_benchmark.txt
Before → After
Before (glyph-v8, 3 months abandoned):
• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew
After (STRIDE v0):
• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
Architecture
RAW CORPUS
↓
STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]
↓
Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch
↓
Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding
↓
STRIDE v1 (planned): Encoder + Decoder
What Makes STRIDE Different
| grep | zstd | Elasticsearch | STRIDE | |
|---|---|---|---|---|
| Field-aware | ❌ | ❌ | ❌ | ✅ |
| Per-field entropy model | ❌ | ❌ | ❌ | ✅ |
| Deterministic output | ✅ | ✅ | ❌ | ✅ |
| Schema-aware analysis | ❌ | ❌ | partial | ✅ |
| SHA256-verified proof | ❌ | ❌ | ❌ | ✅ |
Honest Benchmark Status
STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.
The encoder/decoder is planned for STRIDE v1. Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.
This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.
How GitHub Copilot Helped
The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:
• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission
Without Copilot the gap between “abandoned prototype” and “installable system with proof” would have taken weeks. It took days.
Project Family
STRIDE is the third primitive in a deterministic systems family:
ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.
GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.
STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.
Same philosophy across all three: deterministic, exact, measurable.
What’s Next
• STRIDE v1: encoder + decoder
• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI
Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.







