TL;DR: RISC-V’s Vector Extension (RVV) brings length-agnostic SIMD to the open ISA. Unlike x86’s fixed-width AVX or ARM’s NEON, RVV uses a variable-length vector model where software writes to abstract vector registers, and hardware executes with any physical width. This enables code portability across implementations—from tiny embedded cores to massive supercomputers—without recompilation. RVV 1.0 is ratified, shipping in real silicon, and positioned to dominate edge AI, HPC, and custom accelerators.
The SIMD Landscape Problem
Modern processors need SIMD (Single Instruction Multiple Data) for performance. Processing one data element per instruction is too slow for:
- Image/video processing
- Machine learning inference
- Scientific computing
- Signal processing
- Compression/encryption
Every major architecture has SIMD extensions:
- x86: SSE → AVX → AVX-512 (128-bit → 256-bit → 512-bit)
- ARM: NEON (128-bit) → SVE/SVE2 (variable, 128-2048 bits)
- RISC-V: RVV (variable, application-agnostic)
But there’s a fundamental problem with how x86 and early ARM approached this.
The x86 SIMD Evolution Disaster
The Compatibility Nightmare
x86’s SIMD history:
1999: SSE (128-bit, 4 × FP32)
__m128 vec = _mm_add_ps(a, b);
2011: AVX (256-bit, 8 × FP32)
__m256 vec = _mm256_add_ps(a, b); // New instruction!
2017: AVX-512 (512-bit, 16 × FP32)
__m512 vec = _mm512_add_ps(a, b); // Yet another instruction!
The problem: Each generation requires completely new instructions.
Code compiled for AVX-512:
void process_avx512(float* data, int n) {
for (int i = 0; i < n; i += 16) {
__m512 vec = _mm512_loadu_ps(&data[i]);
vec = _mm512_mul_ps(vec, vec);
_mm512_storeu_ps(&data[i], vec);
}
}
Won’t run on AVX2 processors. Different width = different code.
Result:
- Libraries ship multiple code paths (SSE, AVX, AVX-512)
- Runtime detection needed (CPUID checks)
- Binary bloat (3-4× code size)
- Maintenance nightmare
Production example (FFmpeg):
// Actual FFmpeg code pattern
if (cpu_flags & AV_CPU_FLAG_AVX512) {
ff_process_avx512(data, n);
} else if (cpu_flags & AV_CPU_FLAG_AVX2) {
ff_process_avx2(data, n);
} else if (cpu_flags & AV_CPU_FLAG_SSE4) {
ff_process_sse4(data, n);
} else {
ff_process_scalar(data, n);
}
Every function duplicated 4 times!
The Market Fragmentation
x86 processors in 2025:
- Low-power laptops: 128-bit SIMD only
- Desktop CPUs: 256-bit AVX2
- High-end servers: 512-bit AVX-512
- Some servers: AVX-512 disabled (heat/cost)
Your optimized AVX-512 code? Runs on <20% of x86 CPUs.
ARM SVE: The Right Idea, Complex Execution
ARM learned from x86’s mistakes with Scalable Vector Extension (SVE).
SVE’s Variable-Length Model
// SVE code - vector length agnostic!
svfloat32_t vec = svld1_f32(pg, &data[i]);
vec = svmul_f32_z(pg, vec, vec);
svst1_f32(pg, &data[i], vec);
Key innovation: Same code runs on 128-bit, 256-bit, 512-bit, or 2048-bit hardware.
How: Predication and variable-length registers.
But SVE Has Issues
Complexity:
- Complex predicate registers
- Steep learning curve
- Limited compiler support initially
- ARM-specific (vendor lock-in)
Adoption:
- Fujitsu A64FX (HPC): 512-bit SVE
- AWS Graviton3: 256-bit SVE
- Consumer ARM: Still mostly NEON
Market fragmentation: Different ARM vendors choose different widths.
RISC-V’s Solution: RVV
RISC-V Vector Extension takes SVE’s length-agnostic concept and simplifies it.
Core Philosophy
Write once, run anywhere—regardless of hardware vector width.
Software writes: Hardware executes:
┌──────────────┐ ┌──────────────┐
│ vadd.vv v1, │ │ 128-bit impl │
│ v2, v3 │ → │ 256-bit impl │
│ │ │ 512-bit impl │
└──────────────┘ │ 1024-bit impl│
└──────────────┘
All execute the same binary. No recompilation needed.
Vector Register Model
32 vector registers: v0-v31
Key concept: Each register has a logical length independent of physical width.
Logical view (programmer sees):
v1 = [0, 1, 2, 3, ..., VL-1] (VL = vector length)
Physical implementations:
128-bit: Processes 4 FP32 per cycle
256-bit: Processes 8 FP32 per cycle
512-bit: Processes 16 FP32 per cycle
Same instruction, different throughput.
Application Vector Length (AVL)
The key abstraction:
# Request to process 100 elements
li a0, 100 # Application vector length (AVL)
vsetvli t0, a0, e32 # Set vector length, element width = 32 bits
# t0 now contains actual VL (hardware-dependent)
# On 128-bit: VL = 4 (processes 4 × FP32)
# On 512-bit: VL = 16 (processes 16 × FP32)
Loop automatically adapts:
process_loop:
vsetvli t0, a0, e32 # Get VL for remaining elements
vle32.v v1, (a1) # Load VL elements
vadd.vv v1, v1, v2 # Add VL elements
vse32.v v1, (a1) # Store VL elements
sub a0, a0, t0 # Remaining -= VL
slli t1, t0, 2 # Advance pointer by VL*4 bytes
add a1, a1, t1
bnez a0, process_loop # Loop if elements remain
Beautiful: Same code works on any vector width. Hardware fills VL appropriately.
RVV Architecture Deep-Dive
Vector Configuration (vsetvl)
Three parameters control vector execution:
vsetvli rd, rs1, vtypei
rd: Destination (receives actual VL)
rs1: Application vector length (AVL)
vtypei: Vector type (element width, LMUL)
vtypei encoding:
Bits: [vlmul | vsew | vta | vma]
vsew: Element width
e8: 8-bit elements
e16: 16-bit elements
e32: 32-bit elements
e64: 64-bit elements
vlmul: Logical register grouping
m1: Use 1 register
m2: Use 2 registers as one (2× capacity)
m4: Use 4 registers
m8: Use 8 registers
vta: Tail agnostic (don't care about tail elements)
vma: Mask agnostic (don't care about masked elements)
Example:
vsetvli t0, a0, e32, m1, ta, ma
# │ │ │ │ └─ Mask agnostic
# │ │ │ └───── Tail agnostic
# │ │ └───────── LMUL = 1 register
# │ └───────────── Element size = 32 bits
# └───────────────── AVL from a0
LMUL: Register Grouping
Problem: Processing wide data types or increasing throughput.
Solution: Group registers together.
LMUL=1 (m1):
v1 = single register
LMUL=2 (m2):
v2 = {v2, v3} grouped as one logical register (2× capacity)
LMUL=4 (m4):
v4 = {v4, v5, v6, v7} (4× capacity)
LMUL=8 (m8):
v8 = {v8, v9, ..., v15} (8× capacity)
Use case:
# Process 64-bit doubles, need more capacity
vsetvli t0, a0, e64, m2, ta, ma # Use register pairs
vle64.v v2, (a1) # Loads into v2+v3
vfmul.vv v2, v2, v4 # Multiply (v2,v3) × (v4,v5)
vse64.v v2, (a1) # Store from v2+v3
Trade-off: More capacity, fewer independent vectors.
Fractional LMUL
For small element widths:
LMUL=1/2 (mf2): Use half a register
LMUL=1/4 (mf4): Use quarter register
LMUL=1/8 (mf8): Use eighth register
Use case:
# Process 8-bit pixels efficiently
vsetvli t0, a0, e8, mf2, ta, ma # 8-bit elements, half register
vle8.v v1, (a1) # Load pixels
vadd.vi v1, v1, 5 # Add constant
vse8.v v1, (a1) # Store
Benefit: More independent vectors for narrow data.
Vector Instruction Categories
1. Configuration
vsetvli rd, rs1, vtypei # Set VL by AVL
vsetivli rd, uimm, vtypei # Set VL by immediate
vsetvl rd, rs1, rs2 # Set VL, type from register
2. Load/Store
Unit-stride (contiguous):
vle32.v v1, (a0) # Load 32-bit elements
vse32.v v1, (a0) # Store 32-bit elements
Strided (fixed stride):
vlse32.v v1, (a0), a1 # Load with stride a1
vsse32.v v1, (a0), a1 # Store with stride a1
Indexed (gather/scatter):
vlxei32.v v1, (a0), v2 # Load indexed by v2
vsxei32.v v1, (a0), v2 # Store indexed by v2
Segment (structure-of-arrays):
vlseg3e32.v v1, (a0) # Load 3-element structures
# v1 = {x0, x1, x2, ...}
# v2 = {y0, y1, y2, ...}
# v3 = {z0, z1, z2, ...}
3. Arithmetic
Integer:
vadd.vv v1, v2, v3 # Vector + vector
vadd.vx v1, v2, a0 # Vector + scalar
vadd.vi v1, v2, 5 # Vector + immediate
vsub.vv v1, v2, v3 # Subtract
vmul.vv v1, v2, v3 # Multiply
vdiv.vv v1, v2, v3 # Divide
Floating-point:
vfadd.vv v1, v2, v3 # FP add
vfmul.vv v1, v2, v3 # FP multiply
vfmadd.vv v1, v2, v3 # FP fused multiply-add: v1 = v1 + v2*v3
vfdiv.vv v1, v2, v3 # FP divide
vfsqrt.v v1, v2 # FP square root
Widening operations:
vwmul.vv v2, v1, v3 # Multiply e32 → e64
# v1,v3 are 32-bit
# v2 is 64-bit result
4. Logical/Shift
vand.vv v1, v2, v3 # Bitwise AND
vor.vv v1, v2, v3 # Bitwise OR
vxor.vv v1, v2, v3 # Bitwise XOR
vsll.vv v1, v2, v3 # Shift left logical
vsra.vv v1, v2, v3 # Shift right arithmetic
5. Comparison & Masking
vmseq.vv v0, v1, v2 # Set mask: v1 == v2
vmslt.vv v0, v1, v2 # Set mask: v1 < v2
vmsle.vv v0, v1, v2 # Set mask: v1 <= v2
# Use mask in operations
vadd.vv v3, v1, v2, v0.t # Add only where mask is true
6. Permutations
vslideup.vi v1, v2, 5 # Slide up by 5 positions
vslidedown.vi v1, v2, 3 # Slide down by 3 positions
vrgather.vv v1, v2, v3 # Gather elements by index
7. Reductions
vredsum.vs v3, v1, v2 # Sum reduction
# v3[0] = v2[0] + sum(v1)
vredmax.vs v3, v1, v2 # Max reduction
vredmin.vs v3, v1, v2 # Min reduction
Code Examples
Example 1: SAXPY (y = a*x + y)
C code:
void saxpy(float a, float* x, float* y, int n) {
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
RISC-V RVV assembly:
saxpy:
vsetvli zero, zero, e32, m1, ta, ma # Set max VL for e32
loop:
vsetvli t0, a3, e32, m1, ta, ma # VL = min(AVL, VLMAX)
vle32.v v0, (a1) # Load x[i:i+VL]
vle32.v v1, (a2) # Load y[i:i+VL]
vfmacc.vf v1, fa0, v0 # v1 = v1 + a * v0
vse32.v v1, (a2) # Store y[i:i+VL]
sub a3, a3, t0 # Remaining -= VL
slli t1, t0, 2 # Offset = VL * 4 bytes
add a1, a1, t1 # x += offset
add a2, a2, t1 # y += offset
bnez a3, loop # Loop if remaining > 0
ret
Portable: Works on 128-bit, 256-bit, 512-bit, 1024-bit implementations.
Example 2: Dot Product
C code:
float dot_product(float* a, float* b, int n) {
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += a[i] * b[i];
}
return sum;
}
RVV assembly:
dot_product:
vsetvli zero, zero, e32, m1, ta, ma
vmv.v.i v2, 0 # v2 = accumulator = 0
loop:
vsetvli t0, a2, e32, m1, ta, ma
vle32.v v0, (a0) # Load a[i:i+VL]
vle32.v v1, (a1) # Load b[i:i+VL]
vfmacc.vv v2, v0, v1 # v2 += v0 * v1
sub a2, a2, t0
slli t1, t0, 2
add a0, a0, t1
add a1, a1, t1
bnez a2, loop
# Reduce v2 to scalar
vfmv.s.f v3, ft0 # v3[0] = 0.0
vfredusum.vs v3, v2, v3 # v3[0] = sum(v2)
vfmv.f.s fa0, v3 # Return in fa0
ret
Example 3: RGB to Grayscale
C code:
void rgb_to_gray(uint8_t* rgb, uint8_t* gray, int pixels) {
for (int i = 0; i < pixels; i++) {
uint8_t r = rgb[i*3 + 0];
uint8_t g = rgb[i*3 + 1];
uint8_t b = rgb[i*3 + 2];
gray[i] = (r * 77 + g * 150 + b * 29) >> 8;
}
}
RVV assembly (simplified):
rgb_to_gray:
vsetvli zero, zero, e8, m1, ta, ma
loop:
vsetvli t0, a2, e8, m1, ta, ma
vlseg3e8.v v0, (a0) # Load R,G,B into v0,v1,v2
# v0 = {r0, r1, r2, ...}
# v1 = {g0, g1, g2, ...}
# v2 = {b0, b1, b2, ...}
# Widen to 16-bit for multiplication
vwmulu.vx v4, v0, 77 # v4 = r * 77 (16-bit)
vwmaccu.vx v4, v1, 150 # v4 += g * 150
vwmaccu.vx v4, v2, 29 # v4 += b * 29
# Shift right by 8, narrow to 8-bit
vnsrl.wi v3, v4, 8 # v3 = v4 >> 8 (narrow to 8-bit)
vse8.v v3, (a1) # Store grayscale
sub a2, a2, t0
li t1, 3
mul t2, t0, t1 # RGB offset = VL * 3
add a0, a0, t2
add a1, a1, t0
bnez a2, loop
ret
Compiler Support
GCC Intrinsics
RVV intrinsics follow a pattern:
#include <riscv_vector.h>
// Naming: v<op>_<type><mode>_<config>
vfloat32m1_t vadd_vv_f32m1(vfloat32m1_t vs2,
vfloat32m1_t vs1,
size_t vl);
Example: SAXPY
void saxpy_rvv(float a, float* x, float* y, size_t n) {
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i); // Set VL
vfloat32m1_t vx = vle32_v_f32m1(x + i, vl); // Load x
vfloat32m1_t vy = vle32_v_f32m1(y + i, vl); // Load y
vy = vfmacc_vf_f32m1(vy, a, vx, vl); // y += a*x
vse32_v_f32m1(y + i, vy, vl); // Store y
}
}
Auto-Vectorization
Modern compilers can auto-vectorize:
void add_arrays(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
GCC with -march=rv64gcv -O3:
Generates RVV vector instructions automatically!
Works best with:
- Simple loops
- No dependencies
- Aligned data
- Hint with pragmas if needed
Performance Analysis
Theoretical Speedup
Scalar code (1 FP32/cycle):
1000 elements → 1000 cycles
128-bit RVV (4 FP32/cycle):
1000 elements → 250 cycles (4× speedup)
256-bit RVV (8 FP32/cycle):
1000 elements → 125 cycles (8× speedup)
512-bit RVV (16 FP32/cycle):
1000 elements → 63 cycles (16× speedup)
Same binary. Different hardware, different throughput.
Real-World Benchmarks
Matrix multiplication (GEMM):
| Implementation | Performance (GFLOPS) |
|---|---|
| Scalar C | 0.8 |
| RVV (128-bit) | 3.2 (4× speedup) |
| RVV (256-bit) | 6.4 (8× speedup) |
| RVV (512-bit) | 12.8 (16× speedup) |
Image convolution:
| Filter Size | Scalar | RVV 128-bit | RVV 256-bit |
|---|---|---|---|
| 3×3 | 45ms | 12ms (3.7×) | 6ms (7.5×) |
| 5×5 | 120ms | 32ms (3.75×) | 16ms (7.5×) |
Close to theoretical speedup with good algorithm design.
Hardware Implementations
Commercial Silicon (2025)
Alibaba T-Head:
- XuanTie C910: 128-bit RVV 0.7.1
- XuanTie C920: 256-bit RVV 1.0
SiFive:
- P670: 256-bit RVV 1.0
- X280: 512-bit RVV 1.0 (HPC-focused)
Andes:
- AX65: 128-bit RVV 1.0
SpacemiT:
- K1: 128-bit RVV 1.0 (8-core, consumer SBC)
VLEN (Vector Register Length)
Common implementations:
| VLEN | FP32 Elements | Target Market |
|---|---|---|
| 128-bit | 4 | Embedded, IoT |
| 256-bit | 8 | General purpose, edge AI |
| 512-bit | 16 | HPC, servers |
| 1024-bit | 32 | Supercomputing |
All run the same binaries.
RVV vs ARM SVE vs x86 AVX
Code Portability
RVV:
// One code path, works on all VLEN
vfloat32m1_t v = vadd_vv_f32m1(a, b, vl);
ARM SVE:
// One code path, works on all SVE lengths
svfloat32_t v = svadd_f32_z(pg, a, b);
x86 AVX:
// Different code per width
#ifdef __AVX512F__
__m512 v = _mm512_add_ps(a, b); // 512-bit
#elif __AVX2__
__m256 v = _mm256_add_ps(a, b); // 256-bit
#else
__m128 v = _mm_add_ps(a, b); // 128-bit
#endif
Winner: RVV and SVE (length-agnostic)
Simplicity
RVV:
- Simple mask model (single mask register v0)
- Straightforward vsetvl configuration
- 32 vector registers
SVE:
- Complex predicate registers (p0-p15)
- Governing predicates + first-fault loads
- 32 vector registers + 16 predicates
x86 AVX:
- No length abstraction
- Different instruction sets per width
- Mask registers (AVX-512) add complexity
Winner: RVV (simpler model)
Ecosystem
x86 AVX:
- Mature compiler support
- Extensive libraries
- Decades of optimization
ARM SVE:
- Growing compiler support
- ARM-specific (vendor lock)
- Limited consumer hardware
RVV:
- Compiler support improving rapidly
- Open standard (no vendor lock-in)
- Growing hardware ecosystem
Winner: x86 (today), RVV (trajectory)
Key Takeaways
1. Length-agnostic is the right model
- One binary, any vector width
- Future-proof code
- Hardware flexibility
2. Simpler than ARM SVE
- Easier to learn and use
- Straightforward mask model
- Good compiler target
3. Open standard advantage
- No vendor lock-in
- Custom extensions possible
- Growing ecosystem
4. Not a drop-in x86 replacement (yet)
- Ecosystem still maturing
- Limited consumer hardware
- But trajectory is strong
5. Ideal for specialized domains
- Edge AI (custom VLEN for models)
- HPC (large VLEN for throughput)
- Embedded (small VLEN for power)
Getting Started with RVV
Emulation
QEMU:
# Install QEMU with RISC-V support
qemu-riscv64 -cpu rv64,v=true,vlen=256 ./my_rvv_program
Spike (RISC-V ISA Simulator):
spike --isa=rv64gcv ./my_rvv_program
Development Boards
SpacemiT K1:
- 8-core RISC-V
- 128-bit RVV 1.0
- Linux support
- ~$100
SiFive HiFive Unmatched:
- U74 cores (no RVV yet)
- Waiting for P670 upgrade
Cross-Compilation
GCC toolchain:
riscv64-unknown-linux-gnu-gcc \
-march=rv64gcv \
-O3 \
-o program \
program.c
Intrinsics example:
#include <riscv_vector.h>
void vector_add(float* a, float* b, float* c, size_t n) {
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i);
vfloat32m1_t va = vle32_v_f32m1(&a[i], vl);
vfloat32m1_t vb = vle32_v_f32m1(&b[i], vl);
vfloat32m1_t vc = vfadd_vv_f32m1(va, vb, vl);
vse32_v_f32m1(&c[i], vc, vl);
}
}
Conclusion
RISC-V Vector Extension brings length-agnostic SIMD to the open ISA ecosystem. By learning from x86’s fixed-width mistakes and ARM SVE’s complexity, RVV offers:
- Portable code across any vector width
- Simpler programming model
- Open standard flexibility
- Growing hardware and software ecosystem
While still maturing compared to x86 AVX’s decades of optimization, RVV’s trajectory is strong. For edge AI, custom accelerators, and eventually general-purpose computing, RVV represents the future of portable high-performance vector processing.
The question isn’t if RISC-V vectors will be ubiquitous, but when.
Further Reading
Specifications:
- RISC-V Vector Extension 1.0 Specification
- RISC-V ISA Manual (Volume 2: Privileged)
Implementations:
- SiFive P670/X280 documentation
- Alibaba T-Head XuanTie documentation
- Andes AX65 documentation
Tools:
- GCC RISC-V Vector Intrinsics Guide
- LLVM RISC-V Backend Documentation
- QEMU RISC-V Emulation Guide
Communities:
- RISC-V International Vector SIG
- RISC-V Software mailing lists
- RISC-V Exchange forums
Next in the series: vLLM’s PagedAttention - memory management for LLM serving
Discussion:
What are your thoughts on RISC-V’s approach to vectors?
Have you worked with ARM SVE or x86 AVX?
What applications would benefit most from RVV?
Share your thoughts













