Peephole Optimization in the Mica Compiler

The Mica backend does not try to make the emitter clever at every single point of code generation. It takes a different route: emit correct assembly with uniform rules first, then clean up the local waste in a dedicated pass.

That final cleanup stage is the peephole optimizer. It runs after x86-64 assembly has been emitted and before the final text is handed off to the assembler. The optimizer is deliberately conservative. It works on short instruction windows, applies only rewrites that are locally defensible, and stops as soon as control flow or aliasing uncertainty makes a transformation questionable.

If you want the full compiler context around this stage, start with The Mica Compiler — A Technical Portrait. This article zooms in on the optimizer itself.

Why Mica Uses a Peephole Pass

Mica’s code generator is designed around clarity and predictable lowering. That keeps the emitter manageable, but it also creates recurring local patterns:

values staged through stack temporaries even when a direct form would do
loads that immediately reuse a value that is already available in a register
call setup code that stores to a temp only to reload the same value into an argument register
stack adjustments and save/restore pairs that become unnecessary after earlier rewrites

Trying to bake every one of those cases directly into the emitter would make the backend harder to reason about. The peephole pass keeps the division of labor clean:

the emitter focuses on correctness and uniform lowering
the optimizer removes the obvious local waste that falls out of that strategy
later global optimization work can be built on top without entangling the emitter in special cases

This is why the pass is intentionally local. It is not trying to perform full SSA-based optimization, loop analysis, or interprocedural reasoning. It is there to make already-correct machine code smaller, cleaner, and less noisy.

How The Current Optimizer Is Structured

In Mica 4.5, the optimizer applies 17 passes and repeats them until an entire iteration produces no further changes. That fixed-point style matters because one local rewrite often creates another:

forwarding can create redundant mov instructions
propagation can make earlier temp stores dead
stack or compare cleanup can expose new adjacent patterns

The passes are arranged in three groups.

Group 1: Temp Materialization And Operand Folding

This first group attacks the “store to temp, load from temp, use value” patterns that appear naturally in a uniform emitter.

Pass	Purpose
Literal propagation	Collapse immediate-through-temp chains
Memory copy propagation	Redirect later loads back to the original source
String descriptor forwarding	Bypass temporary reloads for 16-byte string descriptors
Arithmetic temp folding	Feed arithmetic directly from memory, immediates, or constant-pool entries
Boolean temp folding	Fold byte-sized boolean temporaries into their consumers

Group 2: Redundancy Cleanup And Call-Adjacent Rewrites

This is the largest group. It removes round-trips that are semantically harmless but mechanically expensive.

Pass	Purpose
Redundant register move elimination	Remove `mov Rx, Rx` style no-ops
Call argument forwarding	Skip temp staging when a value only feeds an ABI argument register
Call return forwarding	Reuse return registers instead of storing to a temp and reloading
Adjacent store-load forwarding	Forward a value across a strictly adjacent store/load pair
Windowed store-load forwarding	Do the same across a short bounded gap
Redundant load elimination	Reuse an earlier register value instead of reloading memory
Load-store forwarding	Remove pointless write-backs of values that were just loaded
Dead store elimination	Remove stores that are overwritten before any read

Group 3: Compare And Stack Cleanup

The last group handles the small structural artifacts that remain after the value-flow rewrites have simplified the stream.

Pass	Purpose
Compare temp folding	Fold temp loads directly into `cmp` or `test` consumers
Compare cleanup	Remove back-to-back redundant flag-setting instructions
Stack adjust folding	Combine adjacent `add/sub rsp` updates
Push/pop elimination	Remove adjacent push/pop pairs with no net effect

Representative Rewrites

The easiest way to understand the optimizer is to look at the kind of assembly it shortens.

Literal Propagation

Uniform lowering often stages a literal through a temporary location before writing it to the real destination:

; Before
mov [temp], 42
mov rax, [temp]
mov [dest], rax

; After
mov [dest], 42

The point is not that the emitter is “wrong”. The point is that the peephole pass can remove the detour once the detour is visible in concrete assembly.

Store-Load Forwarding

If a value is stored and immediately reloaded from the exact same address, the second instruction does not need memory at all:

; Before
mov [rbp-8], rax
mov rbx, [rbp-8]

; After
mov [rbp-8], rax
mov rbx, rax

Mica has both an adjacent version of this rewrite and a bounded-window version that can look past a few non-interfering instructions.

Arithmetic Temp Folding

x86-64 can often encode a memory operand directly inside an arithmetic instruction, so a temporary register load becomes unnecessary:

; Before
mov rax, [tempA]
mov rcx, [tempB]
add rax, rcx
mov [tempA], rax

; After
mov rax, [tempA]
add rax, [tempB]
mov [tempA], rax

The same idea also applies to boolean operations and compare/test sequences.

Call Return Forwarding

Call results frequently arrive in the right place already. Storing them to a stack temporary just to read them back is needless traffic:

; Before
call foo
mov [temp], rax
mov rcx, [temp]

; After
call foo
mov rcx, rax

This pass is ABI-aware. It knows which registers carry integer and floating return values and which registers a call may clobber.

Dead Store Elimination

Some stores disappear simply because a later store overwrites the same location before any read can observe the first value:

; Before
mov [rbp-8], r10
mov ecx, [rbp-16]
mov [rbp-8], r11

; After
mov ecx, [rbp-16]
mov [rbp-8], r11

Stack Cleanup

Late cleanup is also a good place to simplify frame noise:

; Before
sub rsp, 32
add rsp, 16

; After
sub rsp, 16

Or, when the two instructions cancel exactly, both can disappear.

Why These Rewrites Are Safe

The optimizer’s main rule is simple: when proof becomes uncertain, it does not rewrite. Several constraints enforce that policy.

Exact memory matching. Two memory operands are treated as the same location only when base register, symbol, offset, and size all match.
Register-family awareness. rax, eax, ax, al, and ah are not independent; a write to one view can invalidate assumptions about the others.
ABI boundaries. Calls are treated according to the active ABI, including caller-saved and callee-saved register rules.
Control-flow barriers. Labels, jumps, returns, and calls terminate many local searches because they break the simple linear model the peephole pass relies on.
Bounded windows. Non-adjacent rewrites search only a small number of instructions, which keeps reasoning local and predictable.
Flag preservation. Rewrites that would change observable flag behavior are rejected unless the optimizer can show the replacement is equivalent for the later consumer.

This is why the optimizer is best described as conservative rather than aggressive. It intentionally leaves some opportunities untouched.

What The Optimizer Measures

The pass reports both the total instruction count before and after cleanup and per-pattern counters for the rewrites it applies. That makes the backend measurable instead of anecdotal. It is possible to tell whether a new pass is actually carrying its weight, and it is possible to detect when a change in earlier compilation stages starts producing worse assembly.

The statistics are also useful for development discipline. A peephole pass is easy to add; a justified peephole pass needs evidence.

What This Optimizer Is Not

The current peephole pass is not Mica’s end-state optimizer. It does not attempt:

whole-function SSA optimization
global value numbering
loop transforms
interprocedural analysis
register allocation strategy changes

Those are later stages on Mica’s roadmap. The peephole optimizer is the local cleanup layer that already ships today, and it provides a solid, testable base for the more ambitious optimizer work planned next.

Where It Goes Next

Mica’s roadmap after 4.5 includes SSA transformation and broader global optimization work. When that arrives, the peephole pass does not disappear. It continues to matter because backend lowering still creates machine-specific local artifacts that higher-level analysis does not see.

That is the lasting role of a peephole optimizer in this compiler: not a substitute for global optimization, but a disciplined last pass that turns uniformly emitted assembly into cleaner final machine code.

Why Mica Uses a Peephole Pass#

How The Current Optimizer Is Structured#

Group 1: Temp Materialization And Operand Folding#

Group 2: Redundancy Cleanup And Call-Adjacent Rewrites#

Group 3: Compare And Stack Cleanup#

Representative Rewrites#

Literal Propagation#

Store-Load Forwarding#

Arithmetic Temp Folding#

Call Return Forwarding#

Dead Store Elimination#

Stack Cleanup#

Why These Rewrites Are Safe#

What The Optimizer Measures#

What This Optimizer Is Not#

Where It Goes Next#