Boring on purpose: building a monitor you can trust

2025 was a humbling year for infrastructure. A major cloud region went dark for the better part of a day over a DNS failure and dragged a chunk of the consumer internet down with it. A large CDN had two separate global wobbles inside a single month. If the biggest, best-staffed platforms on earth can have a day like that, the rest of us should be honest about the odds.

That is the uncomfortable backdrop to building a monitoring product. The thing that is supposed to notice when your stuff breaks cannot itself break quietly. A monitor that goes down at the same moment your site does is worse than no monitor, because you were counting on it. So the guiding instinct for Tallwatch is not to be clever. It is to be boring in exactly the places that matter.

Here is what boring means in practice.

Keep the moving parts few

Distributed systems get hard fastest at the seams, the spots where separate systems have to agree. Every seam is one more way for things to disagree at 3am. So we keep the count of moving parts deliberately low and lean on one well-understood system of record instead of stitching five fashionable ones together because the architecture diagram looked impressive.

Fewer parts is not the exciting choice. It is the one that lets you actually reason about what happens when a piece falls over, which is the only moment the reasoning matters.

Assume every step gets interrupted

Networks drop packets. Processes restart. Machines vanish mid-task. The only safe assumption is that any step can be cut off halfway and then run again, so we build each one to produce the same result whether it runs once or three times. A retry after a hiccup must never turn into a duplicate page or a half-written incident.

When every step is safe to repeat, recovery stops being a special emergency mode. It is just the system doing its ordinary thing after a stumble.

Build workers you can kill without flinching

The pieces that do the work are meant to be stopped and restarted without ceremony. If one falls over halfway through a task, another picks it up, and nothing is lost in the gap. That property is what lets us deploy in the middle of the afternoon, scale up under load, and come back from a crash without holding our breath. A monitoring tool you are afraid to restart is a monitoring tool you do not really trust.

Let the probes reach in, not out

The checks run from regions out at the edge, and they reach in to ask for work rather than needing anything to reach out and find them. That keeps the moving parts friendly to firewalls and removes an entire category of "could not connect to the checker" failures before it can exist.

Boring is the compliment

You will never see any of this, and that is the point. You should not have to wonder whether the monitor is healthy, any more than you wonder about the smoke detector on the ceiling until the battery chirps. The promise under every Tallwatch alert is a quiet, unglamorous one: the thing watching your systems was built to be the steadiest part of them. In this corner of software, boring is the highest praise I have got.

Back to all posts