Your DGX Spark Is Cooking Itself

We run 13 AI agents around the clock on two ASUS Ascent GX10 units. MiniMax M2.7, tensor parallel across both nodes, 196K context, fp8 KV cache. The agents do investment research, competitive intelligence, proposal drafting, claim verification. They never stop.

Last night one of our nodes just vanished. SSH dropped, no ping response, telemetry gap. An hour later it came back on its own, GPU at 39°C, sitting in idle. Classic thermal shutdown.

We pulled the temperature logs from our monitoring database and the picture was bad. Not “occasionally spikes” bad. 62 out of 73 hours over the previous three days had max GPU temps at or above 85°C. The node was running in the danger zone nonstop, and we just hadn’t noticed because it never actually stopped working. Until it did.

The telemetry told the full story

Our WPMC dashboard polls nvidia-smi on both nodes every 10 seconds and stores it in SQLite. Here’s what the hours before the crash looked like on gx10-1:

23:57  78°C  45W  96% util
00:02  87°C  73W  96% util  ← power spike
00:03  86°C  74W  96% util
00:04  85°C  74W  96% util  ← last reading
[1 hour gap — node offline]
01:04  39°C  10W   0% util  ← came back after cooldown

The interesting part: gx10-2, running the exact same workload on identical hardware, was sitting at 68°C the whole time. Same model, same inference, same number of requests. Ten to fifteen degrees cooler.

That temperature gap had been there for weeks. gx10-1 came back from an ASUS repair in March and has run hot ever since. We assumed it was fine because it kept working. It wasn’t fine. It was accumulating heat faster than it could dissipate, and every power spike from a batch of simultaneous prefills pushed it closer to the edge.

Why DGX Spark overheats under MoE inference

The GB10 SoC has a 140W TDP crammed into a 1.13 liter chassis with firmware-controlled cooling that you cannot adjust. There is no fan speed control. There is no power limit command. nvidia-smi -pl returns N/A. The cooling solution was designed for desktop workloads and burst inference, not 13 concurrent agents running 96% GPU utilization around the clock.

MoE models like MiniMax M2.5 and M2.7 make this worse because of how inference works. Most of the time the GPU is doing memory-bound decode (generating tokens one at a time). That’s relatively cool, maybe 45W. But when a new request starts, the full prompt has to be prefilled, and that’s a compute-heavy operation. If multiple agents start new mandates at the same time, you get a burst of concurrent prefills that spike power draw from 45W to 75W in seconds. That’s what happened at 00:02 in our logs. Multiple agents finished their previous tasks and got re-dispatched simultaneously.

We checked the NVIDIA forums and found at least ten threads about the same problem. Some users were seeing shutdowns at even lower temps, pointing to power delivery instability rather than pure thermal limits. One forensic analysis found kernel driver memory allocation failures at only 60°C. The Spark chassis is the bottleneck, not the GPU itself. The MSI EdgeXpert, which uses the same GB10 chip with a better vapor chamber cooler, runs 10-15°C lower and never throttles.

The fix is one command

sudo nvidia-smi -lgc 300,2200

That caps the GPU clock at 2200 MHz instead of the default 3003 MHz. Our results:

Before: 78-88°C, sustained 96% util, periodic thermal shutdowns. After: 65-69°C, sustained 96% util, zero shutdowns.

Performance impact: nearly zero. MoE inference on DGX Spark is bottlenecked by memory bandwidth (273 GB/s per unit), not compute clock speed. The GPU spends most of its time waiting for weights to load from HBM, not crunching them. Dropping from 3003 to 2200 MHz reduces peak compute throughput but doesn’t touch memory bandwidth at all. Our agents complete mandates in the same time, produce the same output, score the same on our eval suite.

Multiple users on the NVIDIA forums confirmed this. One went from constant crashes to zero shutdowns at 2200 MHz. Another found 2100 MHz worked. The sweet spot seems to be 2000-2300 MHz depending on your ambient temperature and specific unit.

Making it stick

The clock cap doesn’t persist across reboots. If your node thermal shuts down and comes back, it boots at full 3003 MHz and immediately starts cooking again. We solved this with a systemd oneshot service on each node:

[Unit]
Description=Apply GPU clock cap for thermal protection
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pm 1
ExecStart=/usr/bin/nvidia-smi -lgc 300,2200
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable it with systemctl enable gpu-clock-cap.service and it runs before anything else at boot. We also apply the same cap in our vLLM deployment script as a belt-and-suspenders measure.

What we learned

The DGX Spark is remarkable hardware for the price. Two units give you 192GB of HBM3e and a 200Gbps RDMA link for running frontier MoE models locally. But NVIDIA designed the cooling for burst workloads, not sustained 24/7 inference at 96% utilization. If you’re running agents or serving models continuously, you will hit this wall eventually.

The fix costs nothing. One command, zero performance loss, twenty degree temperature drop. If you’re running a Spark and haven’t set a clock cap, check your temps. You might be closer to a shutdown than you think.

A few other things we picked up along the way. ASUS firmware v0103 (released March 2026) reportedly drops temps another 8-10°C and improves ConnectX-7 link speed for multi-node setups. Some units have a bug where fans don’t spin in headless boot mode, so if you’re running without a monitor, verify your fans are actually running. And if your node gets stuck in low power mode after a thermal event (P8 state, 5W, 0% util), a full power cycle with the battery disconnected for five minutes usually clears it.

We run all of this on sovereign hardware because we believe the future of AI is local. No API keys, no rate limits, no vendor lock-in. Just your models, your data, your agents. The DGX Spark makes that possible at a price point that would have been unthinkable two years ago. You just have to keep it cool.