Operations & monitoring

Caught and contained

2026-06-15 · ~3 min read

After rolling out a bigger engine improvement, our Lichess bot @clrsrc_lc0 suddenly stopped playing: every attempt to challenge an opponent bounced back with "Too many requests". A short workshop note about attentive monitoring, a look into open source and a three-layer fix.

The symptom

Our monitoring fired: the bot was online but idle. A glance at the log showed the pattern at once – challenges rejected minute by minute, one after another.

The chain of causes

The investigation didn't surface a single cause, but a whole chain:

The restart had wiped the in-memory list of opponents that had already hit their daily game limit.
Mornings are when especially many bots are maxed out – so our bot re-challenged exactly those already-blocked opponents in droves.
The sheer volume of these requests eventually tripped the platform's account-wide request limit.
And because the bot kept knocking at a fixed interval, it kept the lock alive itself – a limit of this kind only recovers if you leave it alone.

We confirmed the mechanism by reading Lichess's open-source server code – one of the nice perks when the other side is open source.

The fix, in three layers

Persistent block list. The list of maxed-out opponents now survives a restart – so there's no request surge right after startup.
Escalating backoff. Instead of knocking at a rigid interval, the bot now waits progressively longer after rejections – so a single lock can no longer perpetuate itself.
Immediate pause. As a direct countermeasure, a short full pause so the limit can recover at all.

On top of that, we documented the platform's rate-limit rules cleanly – so the next rollout respects them from the start.

Caught and contained – through attentive monitoring, a look into open source and a multi-layer fix. You can watch the bot live on the Live page; the source code is on GitHub (opens in a new tab).

← Back to the blog