Caught and contained
After rolling out a bigger engine improvement, our Lichess bot @clrsrc_lc0 suddenly stopped playing: every attempt to challenge an opponent bounced back with "Too many requests". A short workshop note about attentive monitoring, a look into open source and a three-layer fix.
The symptom
Our monitoring fired: the bot was online but idle. A glance at the log showed the pattern at once – challenges rejected minute by minute, one after another.
The chain of causes
The investigation didn't surface a single cause, but a whole chain:
- The restart had wiped the in-memory list of opponents that had already hit their daily game limit.
- Mornings are when especially many bots are maxed out – so our bot re-challenged exactly those already-blocked opponents in droves.
- The sheer volume of these requests eventually tripped the platform's account-wide request limit.
- And because the bot kept knocking at a fixed interval, it kept the lock alive itself – a limit of this kind only recovers if you leave it alone.
We confirmed the mechanism by reading Lichess's open-source server code – one of the nice perks when the other side is open source.
The fix, in three layers
- Persistent block list. The list of maxed-out opponents now survives a restart – so there's no request surge right after startup.
- Escalating backoff. Instead of knocking at a rigid interval, the bot now waits progressively longer after rejections – so a single lock can no longer perpetuate itself.
- Immediate pause. As a direct countermeasure, a short full pause so the limit can recover at all.
On top of that, we documented the platform's rate-limit rules cleanly – so the next rollout respects them from the start.
Caught and contained – through attentive monitoring, a look into open source and a multi-layer fix. You can watch the bot live on the Live page; the source code is on GitHub.