Cluster & infrastructure

A data center built from leftovers

2026-06-11 · ~5 min read

Pro teams train their chess AIs on server farms. For us, it started with a single desktop - first it ran for hours, then for days. Heat, noise and power draw were tolerable at first, but not in continuous operation - and the performance of a single machine quickly hits its limit.

So the idea was born to put the idle hardware lying around to work and grow the pool piece by piece. Because it quickly turned out that these devices not only run quieter, but deliver a real result both individually and in sum.

Out of that grew a distributed datagen cluster that produces around 60 million training positions in two to three days. How it's connected, hardened against heat and power loss and spread across three operating systems - here's the workshop note.

The hardware - and how it comes together

The mix is deliberately heterogeneous: several Android smartphones (via the Linux environment Termux), Linux laptops (Debian, some straight off a USB stick), a Windows control node and a Linux VPS. All devices are linked over SSH; a central control node starts the work, monitors it and collects the results. That way any device lying around becomes a compute node in minutes.

Days on end: temperature, battery, cores

A run takes days - that forgives no half-baked safeguard. Two things decide whether a node holds up: heat and power.

Temperature. A "governor" script computes the number of used cores (threads) live from battery level and chip temperature and restarts the datagen process when needed. Important lesson: you have to read all thermal zones, not just the surface sensors. Example: a Dimensity-9300+ smartphone throttles at 7 threads - 6 is the stable optimum, and with active cooling it runs for days.

Battery & power. Critical power-saving actions are disabled (no auto-shutdown on low battery), lid switches and sleep on the laptops are off, the devices are on the charger. Still needed: a battery watcher that stops devices which drain despite the charger - because four cores at full load draw more power than some chargers supply.

The surprise: smartphones in the laptop class

The biggest aha moment: modern flagship smartphones deliver datagen throughput on par with older x86 laptops. The top performer in the cluster is the AVX-512 VPS - about 2.2× as fast as an ultrabook. And one laptop turned out to be purely watt-limited, not thermal: a physical wall, not a defect.

Splitting work across three operating systems

The work breaks into three roles that fit the strengths of the devices:

  • Datagen - generate self-play games (many cores per device; ideal for the smartphones).
  • Rescore - re-evaluate/label the generated positions with a stronger "teacher" (Stockfish) (laptops + AVX-512 VPS).
  • Training - learn the neural network (NNUE) on the GPU (dedicated machine; more on that below).
Pipeline: datagen generates positions, Stockfish rescores them as a teacher, bullet trains the net from them, the net is embedded into the engine.

"Seti@home" for chess

So nothing is computed twice, the rescoring runs like a classic distributed computing project: a master node holds a queue of work units. Each worker grabs a unit atomically (reservation via atomic move + lease), rescores it locally, pushes the result back and confirms it atomically (from "in progress" to "done"). If a device fails, its unit is simply reassigned - duplicate work never happens.

The training-data format helps: fixed record size, simply concatenable. You collect the partial files and append them - fault-tolerant and without a complicated database.

GPU training: from corpus to net

The positions gathered in a distributed way and re-evaluated by the Stockfish teacher - about 200 million per training candidate - flow into the open-source trainer bullet (GPU/CUDA). Out comes a neural evaluation net of the architecture kb4×768 (king-bucketed, 768-wide hidden layer); evaluating a position takes under a microsecond on the CPU.

The recipe in numbers: around 120 epochs (each a full pass over the ~200M positions), a stepped learning-rate schedule with warmup and a final drop, a constant win/draw/loss blend factor - in total about 24 billion training samples per net.

Multi-seed against self-deception. Same dataset, several random seeds - this makes the natural random variance visible. What's selected is not the prettiest training curve, but the best playing strength, confirmed by a 400-game SPRT against the live baseline. A "lucky seed" that happens to look good doesn't survive the SPRT.

Why the training metric can lie. The validation loss (reference ~0.0077) checks overfitting and reproducibility - but a lower value does not mean "stronger". On a test set skewed by opening coverage the signal can even run the other way. That's why in the end only the Elo SPRT decides.

Two stories from the workshop on this:

  • "Val loss lies." Two training runs showed a better validation loss than the baseline - and still lost clearly in the SPRT. The test set was weighted by openings, not by the search frontier; a net that "memorizes" this set loses the tactical middlegame. Only the board decides.
  • "Reboot at 2:42." A nightly Windows-update restart hit in the middle of an hours-long evaluation run. The data format saved it: fixed record size and progress-preserving - the run resumed exactly at the interrupted spot, zero data loss.

Pitfalls across OS boundaries

Three operating systems mean three kinds of surprise. The ones that cost us time:

  • Process detection on Android/Termux returns false positives - solution: exact match of the program name plus a check via file growth instead of the process list.
  • Android kills Termux sessions after about two hours - an auto-restart monitor from a stable node keeps them alive.
  • Never put program files on volatile RAM disks - after a reboot they're gone, and a crash loop ensues.
  • Character sets: a Windows script read UTF-8 wrong and aborted silently on a special character - since then, strict ASCII discipline in the control scripts.

The nicest bug to close with: a smartphone Wi-Fi with "client isolation" enabled made the devices invisible to each other - they ran flawlessly but were unreachable. Solution: a dedicated datagen Wi-Fi of its own.

What remains

Serious machine learning doesn't necessarily need a data center - it needs discipline about heat, power and fault tolerance. The rest is orchestration. More on the big picture on the project page; the engine plays live as @clrsrc_lc0, and the source code is on GitHub.


← Back to the blog