Principles of Mechanical Sympathy

Item: Principles of Mechanical Sympathy
Rating: 5
Author: altexs

Source: martinfowler.com | Author: Caer Sanders | Published: 2026-04-07 Category: pattern | Credibility: high

Executive Summary

Mechanical sympathy — writing software aligned with hardware constraints — can be distilled into four actionable principles: predictable memory access, cache-line padding, the single-writer principle, and natural batching.
The article demonstrates each principle with latency comparisons and architectural examples, arguing these techniques apply from in-memory systems to distributed data platforms.
Sanders emphasizes measurement before optimization: define SLIs, SLOs, and SLAs first, because “you can’t improve what you can’t measure.”

Critical Analysis

Claim: “Linear/sequential memory access significantly outperforms random access because CPUs prefetch and cache contiguous memory”

Evidence quality: benchmark
Assessment: This is well-established computer architecture fact. CPU hardware prefetchers are specifically designed to detect and accelerate stride-1 (sequential) access patterns. L1/L2 cache hit latency (~1–4ns) versus RAM latency (~60–100ns) is a ~25x difference documented across hardware vendors. The article references this directionally but does not quote specific numbers for its ETL example.
Counter-argument: The benefit depends heavily on data structure choice and workload pattern. Column-oriented storage (Parquet, ClickHouse) already exploits this by design — teams already on modern data infrastructure may get this for free without explicit algorithmic effort. Random access is sometimes unavoidable for lookup-heavy workloads (hash tables, sparse graphs), where the advice to “use sequential algorithms” is impractical.
References:
- Mechanical Sympathy blog — CPU Cache Flushing Fallacy
- LMAX Disruptor technical paper

Evidence quality: benchmark
Assessment: The claim is supported by well-replicated benchmarks across multiple independent sources. A Baeldung benchmark shows false-shared counters producing ~1 billion L1 cache misses vs ~120 million with padding — a ~7x reduction in cache misses. A 3.2x wall-clock speedup from padding alone has been independently measured. The “near-linear increase” language is consistent with the cache coherency protocol overhead (MESI/MOESI) being additive per thread.
Counter-argument: Java’s @Contended annotation (available since Java 8 as sun.misc.Contended, promoted to jdk.internal.vm.annotation.Contended in Java 9+) automates this mitigation, reducing the need for manual padding in JVM workloads. The article does not distinguish between languages; C/C++ developers must still pad manually. Overuse of padding wastes cache space and can hurt performance when the working set no longer fits in cache.
References:

Claim: “The Single Writer Principle eliminates mutex contention by dedicating one thread to all writes, enabling asynchronous message-based interaction”

Evidence quality: case-study
Assessment: This is a well-documented pattern originating from Martin Thompson’s work on the LMAX Disruptor (2011), where queue-based actor model prototypes underperformed because thread management overhead exceeded business logic cost. The principle is sound: eliminating CAS (compare-and-swap) or mutex lock contention removes cache coherency traffic on the write path. The article’s text embedding example is illustrative but not benchmarked.
Counter-argument: The single-writer constraint can create a throughput bottleneck if the dedicated thread becomes a hot path itself. This moves contention to the message queue feeding the writer thread rather than eliminating it. Martin Thompson’s own blog notes that naive queue-based actor implementations (e.g., classic Erlang-style mailboxes backed by linked lists) can underperform mutex approaches because of allocation and GC pressure on the queue data structure. The Disruptor solves this with a pre-allocated ring buffer, but standard actor frameworks (Akka, Pekko) may not.
References:
- Martin Thompson — Single Writer Principle
- The LMAX Architecture — Martin Fowler

Claim: “Natural batching achieves 100–200µs latency versus 200–400µs for timeout-based batching, assuming 100µs batch overhead”

Evidence quality: vendor-sponsored
Assessment: The latency comparison is a theoretical model constructed from an assumed 100µs batch overhead constant, not an empirical measurement. The math holds within the model’s assumptions: natural batching waits only for queue drain (1 batch overhead), while timeout batching always pays at least one timeout period (which is often 100–200µs on Linux without RT kernel). The direction of the claim is independently corroborated by Martin Thompson’s 2011 “Smart Batching” blog post (the same technique, renamed). However, the specific numbers (100–200µs) are derived from the article’s own model, not from profiling real systems.
Counter-argument: The 100µs overhead assumption is presented as given rather than derived. In GPU inference workloads (the article’s motivating example with ONNX/TensorFlow), kernel launch overhead and host-device transfer latency often dominate batch processing time and are highly variable — model the actual tail latency distribution before adopting natural batching blindly. For extremely low-latency use cases (<10µs), natural batching may still introduce unacceptable jitter because the thread must check queue emptiness, which is itself a memory access.
References:
- Martin Thompson — Smart Batching
- Support natural batching in flow — Kotlin/kotlinx.coroutines #902

Claim: “LMAX Architecture processes millions of events per second on a single Java thread”

Evidence quality: benchmark
Assessment: The LMAX Disruptor has been publicly benchmarked at over 25 million messages/second with sub-50ns latency on commodity hardware. The LMAX exchange itself was built on this architecture and went live in 2010. The claim is independently corroborated by multiple third-party benchmarks. The “single thread” qualifier is accurate: the Disruptor achieves this through lock-free ring buffers and cache-friendly memory layout rather than parallelism.
Counter-argument: “Millions of events per second on a single Java thread” is the peak throughput benchmark for the Disruptor pattern’s inter-thread messaging path, not necessarily for the full business logic pipeline. Real-world LMAX systems use multiple threads for different pipeline stages; the claim may mislead readers into thinking single-threaded processing is sufficient for complex event processing.
References:
- LMAX Disruptor technical paper
- Low Latency Java with the Disruptor — Scott Logic

Credibility Assessment

Author background: Caer Sanders has direct production experience building data and AI infrastructure at Wayfair and Thoughtworks, robotics platforms at Google, and founded With Caer studio with a patent in cryptographic consensus protocols. This background is consistent with the article’s blend of systems programming and distributed systems knowledge. The article explicitly notes Claude Opus 4.6 was used for draft review but not content generation.
Publication bias: martinfowler.com is one of the most credible independent software architecture publications. Articles are peer-reviewed by the editorial team. No vendor sponsorship is disclosed.
Verdict: high — The four principles are grounded in established hardware architecture literature, and the author’s background supports the practical claims. The latency numbers for natural batching are model-derived rather than empirically measured, which is a minor but flagged limitation.

Referenced in catalog