Dwarkesh Podcast Notify me• May 22, 2026• 1:20:18Interview

Chip design from the bottom up – Reiner Pope

From The Dwarkesh Patel Podcast

Dwarkesh Patel(Podcast host, analyst, and angel investor in…, The Dwarkesh Patel Podcast)•Ron Minsky(Guest)•Rainer Pope(CEO, Maddox)

Get the full transcript next time Dwarkesh Podcast releases an episode

Summary, key quotes, top claims, and the searchable transcript - emailed automatically. No card needed.

Executive Summary

Continue your research

Keep pulling the thread on Reiner Pope.

The Primacy of Multiply-Accumulate (MAC)Compute vs. Communication Trade-off

12 quotes

Concerns Raised

The high cost and energy consumption of data movement relative to computation.
Physical limitations on clock speed due to logic path delays across the chip.
The quadratic scaling of multiplier circuit area with increased bit precision.
The inherent trade-off between the efficiency of large, specialized compute units and the flexibility of smaller, general-purpose ones.

Opportunities Identified

Developing novel architectures like 'splittable systolic arrays' to combine the benefits of large and small compute units.
Increasing throughput by optimizing hardware for lower-precision arithmetic like FP4 and FP8.
Designing specialized ASICs that can achieve an order of magnitude better cost and performance for specific AI workloads.
Improving the compute-to-communication ratio through tighter integration of memory and logic, as seen in the evolution of Tensor Cores.

Key Themes

Research Findings12

NVIDIA's B300 and later generation chips specify that FP4 computation is 3 times faster than FP8, a change from the previous 2x performance scaling.

Maddx is developing a technology called a 'splittable systolic array,' which is designed to allow large systolic arrays to also function as smaller, independent ones.

The introduction of Tensor Cores in NVIDIA's Volta architecture was motivated by the need to solve the high data movement cost from register files, which constituted approximately 7/8ths of the circuit cost.

Google's TPUs use a scratchpad memory architecture where software explicitly issues different instructions to access on-chip versus off-chip memory, unlike a traditional CPU cache which is managed by hardware.

In NVIDIA GPUs, the ability to use the same circuits for different precisions like FP4 and FP8 is limited, and the allocation of die area to each is a primary design choice.

The area required for multiplication circuits scales quadratically with the bit width of the numbers being processed.

Historically, NVIDIA GPUs up to the B100 and B200 generations doubled their computational throughput (flop count) each time the numerical precision was halved.

In pre-Volta NVIDIA GPUs, the circuitry for moving data from the register file to the arithmetic logic unit (ALU) was many times more expensive in terms of area than the ALU itself.

In a Semi Analysis report ranking nearly 100 GPU clouds, Crusoe was one of only five providers to achieve the 'gold tier'.

A GPU's architecture consists of many small, nearly identical units (Streaming Multiprocessors or SMs) tiled across the chip.

A TPU's architecture is composed of a few large, coarse-grained units, such as large matrix units and a central vector unit.

The data movement bandwidth between vector and matrix units is much higher in a GPU than in a TPU because the wiring is distributed across many small SMs rather than concentrated between a few large blocks.

Topics

Processed May 22, 2026Daily intelligence brief → yt-dlp + mlx-whisper + Gemini

Chip design from the bottom up – Reiner Pope

Continue your research

Concerns Raised

Opportunities Identified

Key Themes

The Primacy of Multiply-Accumulate (MAC)

Compute vs. Communication Trade-off

Architectural Divergence: GPU vs. TPU

The Physics of Chip Design

Research Findings12

Topics