
TL;DR / Key results
- Throughput shaped in cache: CHIME’s GPU correlator path ingests UDP from FPGAs and, via DPDK poll-mode + DDIO, parses in L3 and writes non-temporal to exact DRAM offsets, pre-arranged for GPU math.
- Memory ops halved: The design targets ~2 host-memory operations per byte delivered to GPUs (DRAM write, then GPU read), avoiding extra reorder passes.
- Feeding GPUs at line-rate: Legacy CHIME nodes sustained ~25.6 Gb/s per CPU; current upgrades target ~100 Gb/s per NUMA with distributor cores.
- Commodity CPUs, fewer cores: 6-core hosts handle capture/placement because the CPU mostly copies; DPDK minimizes per-packet cycles.
- Portable framework: A single pipeline framework (“Kotekan”) abstracts DPDK boilerplate; different telescopes plug in stages and YAML pipelines.
“DPDK let us look at the header while it was still in L3 and write the payload exactly where the GPU expects it.” We needed the cluster to ingest over 6.4 Tb/s without major CPU resources. – Andre Renard
Opening
Instruments break not with loud bangs but with slow math: a firehose of packets the CPU can’t place, ring buffers that miss by a cacheline, correlators that stall because a matrix never quite arrived in order. When the Canadian Hydrogen Intensity Mapping Experiment (CHIME) started seeing the sky as streams of UDP from thousands of digitizers, Andre Renard had one job that mattered: get every packet where the GPU expects it, on time, without roasting host memory bandwidth.
CHIME’s design bet early on GPUs for correlation, cheap FLOPs, tensor cores on the horizon, rapid iteration. That created a different bottleneck: host memory. Traditional paths (kernel sockets, two-pass reorders, GPU-side reshuffles) burned cycles and DRAM bandwidth they didn’t have. Renard’s team started looking at DPDK.
The move was pragmatic. Poll-mode to avoid context switches; DDIO to inspect headers while the bytes are still in LLC; non-temporal writes to land payloads directly at precomputed strides. One pass across cache, one write to DRAM, one GPU read.
The Human Story
Andre Renard (University of Toronto / CHIME Collaboration) joined CHIME as project staff: a computer scientist embedded in a physicist-led experiment. “It’s definitely not a solo project,” he says. Multiple institutions, from UBC to Perimeter to McGill, share software development; 5-10 engineers contribute at any time across telescopes. Renard took the network path: FPGAs push UDP; GPUs correlate; the host makes it look easy.
“I’m proud we made the world’s largest radio correlator of its time actually work, bandwidth, antennas, the whole thing, and that our piece of the pipeline held up.”
Industry Consensus / Problem Identification
By the time CHIME began building, GPUs for radio astronomy had moved from a curiosity to a credible option. FPGAs and ASICs still dominated the front end, but matrix-heavy, low-bit-depth math made GPUs attractive and cost-effective. CHIME’s architecture took advantage of that:
- F-engine (FPGA): Digitize and channelize. Split broadband into thousands of narrow frequency channels; perform the corner-turn so each downstream node sees all inputs for a subset of frequencies.
- X-engine (GPU): Perform cross-correlation across all inputs (outer products → Hermitian matrices), then hand results to post-processing and imaging.
The catch was scale. The project moved UDP Packets at over 6.4 Tb/s across point-to-point links from F engines to X-engine GPU nodes. The canonical approaches in similar systems—split headers/payloads with verbs, land payloads, then second-pass reorder on CPU or GPU—double-touch DRAM and overuse cores.
“We hit host memory bandwidth early. That was our wall, more than PCIe or GPU FLOPs.”
The idea that “the kernel can take it” was a non-starter. Even older CHIME nodes ran ~25.6 Gb/s per CPU, and upgrades now target ~100 Gb/s per NUMA. That mandates kernel bypass and ruthless avoidance of extra passes.
Technical Challenge
Make a UDP firehose look like a tidy, GPU-ready matrix without:
- Using kernel sockets or copy-heavy paths
- Performing a reorder pass in CPU DRAM
- Wasting GPU global memory to reorder there
- Spinning too many cores on per-packet overhead
CHIME’s additional constraint: they maintain a RAM ring buffer of incoming baseband (raw) data. If an event (e.g., FRB) triggers, they pull the raw segment from RAM. SSDs can’t keep up (endurance and bandwidth), and spinning disks are out of the question at these rates. That rules out “NIC→GPU only” paths: the data must pass through host DRAM anyway.
“The dream of NIC DMA straight into GPU is nice, but our science needs a full-rate copy in host RAM.”
So the path had to both feed the GPU and preserve a DRAM copy, with minimal memory traffic.
The Unconventional Approach
The team leaned into three ideas:
- Poll-mode everywhere (DPDK): Avoid context switches and per-packet kernel overhead; dedicate cores; treat the CPU as a very fast, very predictable copier.
- DDIO locality: Receive into LLC; inspect headers while they’re still in L3; decide final destinations before touching DRAM.
- Non-temporal scatter-writes: From L3, perform NT stores into multiple DRAM offsets per packet, arranged so the GPU sees exactly the matrix tiles it expects.
This flips the usual reorder pattern. Instead of landing payloads “somewhere,” sorting later, and writing again, the RX path places each packet once where it belongs in the final GPU-consumable layout. Then the GPU reads once, and math begins.
“We can even scatter/gather: same packet payload written into multiple precomputed strides so the final matrix shape is perfect for the kernel.”
That last part matters in that correlation is outer products over many inputs and channels. Arranging memory in the right order translates directly into higher GPU occupancy and simpler kernels.
Cultural Translation
CHIME sits at the intersection of astronomy, HPC, and network systems. Each community brings different mental models:
- Astronomers speak in beams, baselines, and FRBs. The requirement is scientific: don’t drop packets; preserve baseband; map the sky.
- HPC/GPU folks want coalesced reads, tensor core throughput, and tile shapes.
- Network engineers obsess over queues, NUMA locality, and cachelines.
CHIME’s software framework, Kodakan, bridges the gap. It hides DPDK boilerplate (NIC init, RX queue mapping, core pinning) behind base classes and YAML pipeline descriptions. Teams across instruments can implement a new “stage” without learning every DPDK nuance or pthread trick.
“One binary can run different telescopes by swapping the YAML pipeline. In some limited cases, you can build a new instrument mostly by writing a new config.”
What It Actually Does
At the packet path level:
- F-engines send UDP frames containing channelized samples.
- DPDK poll-mode RX cores deque packets while they’re still in L3 (DDIO).
- The code parses a custom header (still hot) to compute target offsets.
- It performs non-temporal stores to scatter the payload into DRAM addresses computed from header details.
- A GPU stage then DMA-reads those regions and launches correlation kernels (outer products → Hermitian matrices).
- In parallel, a baseband ring buffer in host RAM retains a rolling window of raw data for later retrieval if a trigger fires.
Scope & limits (explicit):
- Scope: Host-side packet → DRAM placement optimized for GPU consumption; baseband retention in RAM; portable across several telescopes via a shared framework.
- Limits: UDP ingress expects packet order gaps; logic tolerates reordering but assumes very low loss; still host-DRAM mediated (no NIC→GPU direct placement), by design.
Addressing Concerns
“Isn’t verbs/RDMA the modern way?”
Renard’s team considered verbs-based split and reorder. The challenge: extra passes. Either a CPU second pass to reorder or a GPU reorder that burns global memory and adds complexity. Their constraint, full-rate baseband in RAM, means NIC→GPU doesn’t remove the DRAM trip. DPDK minimizes it to one write.
“Poll-mode wastes cores.”
They run on 6-core CPUs in many nodes, intentionally small, because the CPU’s job is mostly copy/placement with few cycles per packet. DPDK’s low overhead made that feasible. On newer 100 Gb/s per NUMA nodes, they add distributor cores; the model still holds.
“Kernel bypass is dated; smart NICs can fix this.”
Smart NICs or programmable NIC pipelines could help, but economics and programmability matter. Commodity NICs plus DPDK delivered, repeatedly, across multiple instruments. The hardware dream Andre sketches, programmable on-NIC address calculation from custom headers, remains compelling if it arrives as a commodity surface.
“The bet is simple: one pass across cache, one write to DRAM, one GPU read. Anything extra pays interest in costs bandwidth you don’t have.”
Real-World Impact
- CHIME correlator: At build time, largest radio correlator by bandwidth × antennas. The DPDK-based path is a critical link in sustained operations.
- Throughput milestones: Legacy nodes around 25.6 Gb/s per CPU; upgrades targeting 100 Gb/s per NUMA with distributor cores.
- Multi-site operations: Software and framework used across ~6 sites and by external users who download and adapt stages.
- Science enabled: Mapping 21-cm neutral hydrogen to probe baryon acoustic oscillations; pulsar timing; prolific fast radio burst (FRB) detection with outrigger stations for precise localization.
- Maintainable deployments: Preference for Ubuntu-bundled DPDK eases adoption across collaborations without bespoke build hurdles.
Reproduce It (Engineering Notes)
Goal: Land UDP payloads into GPU-ready DRAM tiles in a single pass.
Environment (representative):
- NIC: Commodity 10/25/100 GbE supporting DDIO on host platform
- CPU: 1–2 sockets; ensure NUMA-local RX queues; 6 cores workable at ~25 Gb/s; add distributor cores at 100 Gb/s/NUMA
- GPU: 4× per node typical; correlation kernels tuned for int4/int8/tensor cores
- RAM: Large (e.g., ≥1.5 TB per node) to hold baseband ring buffer
- DPDK: Use distro-packaged (Ubuntu) for reproducibility across sites; pin lcores via YAML/pipeline config in Kodakan
Build/Run sketch (framework-agnostic pseudocode):
// Pseudocode: single-pass placement
while (rx_dequeue(pkts, RX_BURST)) {
for (pkt in pkts) {
hdr = parse_header(pkt); // still in LLC via DDIO
// Compute one or more target offsets for scatter
for (t in layout_targets(hdr)) {
nt_store(t.dst, pkt->payload, t.len); // non-temporal write to DRAM
}
}
}
// GPU stage DMA-reads the arranged tiles and launches corr kernels.
Config checklist:
- Map RX queues to NUMA-local cores and target DRAM on the same socket.
- Disable interrupt moderation; poll-mode only.
- Use hugepages for DPDK mbufs; align scatter destinations to GPU-friendly strides.
- Validate LLC hit rates and memory ops with Intel PCM (or analogous counters).
- At 100 Gb/s, add a distributor core fan-out to multiple placement workers per NUMA.
Sanity checks:
- Zero-loss on long runs at target line rate (synthetic F-engine traffic OK).
- PCM shows ~2 memory ops/byte path (DRAM write, then GPU read).
- GPU kernels see expected tile shapes without an internal reorder step.
Trade-offs
- Host RAM dependency is intentional (for baseband capture); NIC→GPU bypass would under-deliver CHIME’s needs.
- Poll-mode demands dedicated cores; it buys predictability and low tail latency at the cost of idle power.
- Scatter-write complexity shifts logic to RX; it simplifies GPU kernels and reduces total memory traffic.
Community Impact
The correlator work sits alongside and in conversation with broader radio astronomy efforts, teams exploring NIC→GPU placement, terabit class ingress, and tensor-core-tailored kernels. Renard calls out ASTRON work (e.g., John Romein) exploring DPDK for GPU memory regions and extreme bandwidth. While CHIME’s current path stays DRAM-centric by design, these lines of work are converging on the same question: How do we feed accelerators at scale without melting host resources?
“Long term, everyone faces the same problem: feeding GPUs without burning CPUs or DRAM.”
Future & Next Steps
- CHIME X-engine upgrade: Modern GPUs, tensor-core kernels, updated Kotekan pipelines; sustained 100 Gb/s/NUMA paths.
- CORD (sister telescope): Dish-based array next to CHIME; newer FPGAs; similar DPDK path via a switch fabric.
- HIERAX (South Africa): Sister project targeting similar 100 Gb/s/NUMA ingest with Kotekan stages.
- Wishlist for NICs + APIs:
- Bulk enqueue semantics akin to verbs: “Next N packets land at base + stride S”
- Programmable address calculators on NICs: turn custom headers into DMA addresses (and scatter lists)
- A commodity path for FPGA→RDMA encapsulation that’s feasible without massive RTL investments
How to Contribute
- DPDK stages in Kodakan: New packet processors for alternative F-engine formats; distributor-core strategies for 100G.
- Performance tooling: Portable PCM-like sampling, NUMA heatmaps, and cache residency metrics integrated into pipelines.
- GPU kernels: int4/int8 correlation kernels tuned for new tensor cores; memory-layout co-design with host scatter logic.
- Reliability: Long-haul, zero-loss regression harnesses; packet gap simulation; time-sync checks across multi-site deployments.
Onboarding path:
- Start with docs/tests in Kodakan: run a synthetic F-engine generator → verify placement maps and GPU tiles.
- Implement a toy stage: parse a minimal header, scatter to two destinations; validate with a tile checker.
- Add metrics hooks (per-queue drops, L3 hit rate proxies, DRAM BW, GPU DMA time).
- Join the mailing lists; discuss NUMA layouts and YAML pipelines before touching hot paths.
- Only then propose core changes to shared DPDK abstractions.
Project links:
- https://github.com/kotekan/kotekan
- https://chime-experiment.ca/
- https://www.chord-observatory.ca/
- https://hirax.ukzn.ac.za/
Closing
Ask Renard what he’d change in DPDK, “It does what we need.” Then the engineer resurfaces: bulk enqueue semantics, on-NIC programmable address transforms, a commodity way for FPGAs to produce RDMA-placeable streams without heroic RTL. None of that contradicts CHIME’s DRAM-first reality. It simply opens options for the next instruments.
“I’d love a commodity NIC where I upload a tiny program: here’s my header, here’s the formula, put the packet exactly there.”
If you’re a developer who enjoys cachelines, NUMA maps, and the satisfaction of shaving one more pass off a hot path, CHIME’s approach shows the shape of the work: make placement decisions earlier; touch memory fewer times. Bring that energy to DPDK, to Kotekan, and to the telescopes that still need to be made real.
Get Involved
Review your first DPDK patch today: https://www.dpdk.org/review-your-first-patch/
About CHIME
The Canadian Hydrogen Intensity Mapping Experiment (CHIME) is a fixed, wide-field radio telescope located at the Dominion Radio Astrophysical Observatory near Penticton, British Columbia. It uses four stationary, 100-meter-long cylindrical reflectors in a drift-scan configuration: as Earth rotates, CHIME continuously maps a narrow north–south strip of the sky. Its science focuses on three pillars: 21-cm cosmology (tracing large-scale structure via neutral hydrogen and baryon acoustic oscillations), pulsar timing (including gravitational-wave–related studies), and fast radio bursts (FRBs), with outrigger stations added for high-precision FRB localization.
On the compute side, CHIME pairs FPGA “F-engine” front ends (digitization and channelization with a corner-turn) with GPU “X-engine” correlators that perform massive outer-product math to form visibilities and images. The collaboration spans multiple institutions—the Dominion Radio Astrophysical Observatory, McGill University, and many other institutions.—with shared software frameworks that enable related instruments (e.g., sister arrays in Canada and South Africa) to reuse pipeline components and configurations.

