How Tallwatch stops false alerts

A monitor can be wrong in exactly two ways. It can wake you for an outage that isn't happening, or it can stay silent through one that is. This piece is about the first kind, which is both the more common and the more corrosive, because every false page quietly lowers your trust in the next real one.

The fix is not a cleverer algorithm guessing harder from the same thin evidence. It is more evidence.

One check is one opinion, formed in one place

When a single machine fails to reach your site once, you have one data point: one network, one moment, one route. It might mean your service is down. It might mean that machine's upstream had a bad thirty seconds. Nothing in that single failure tells you which, and a tool that pages on it is just forwarding you its own uncertainty at full volume.

The August 2025 Cloudflare incident is the cleanest example I know. For a few hours, the links between Cloudflare and AWS's us-east-1 region were badly congested. If your origin lived in us-east-1, some paths to it looked dead. Cloudflare went out of its way to say this was regional and that its global network was fine. Read that carefully: whether a monitor "saw an outage" that afternoon came down to where it happened to be standing. A single vantage point does not get to call an outage, because a single vantage point cannot tell its own bad day apart from yours.

Make the regions vote

Tallwatch checks each target from several regions and counts each result as a vote. An incident opens only when a configurable majority agree the target failed inside a short window.

One region seeing a failure is a rumor. Several regions seeing the same failure in the same short window is a fact. Tallwatch pages on facts. That is the whole trick, and most of its value is in how unexciting it is: there is no model to mistune and no anomaly score to second-guess, just a quorum that a transient blip mathematically cannot reach.

The part that is easy to get wrong

Naive voting has a failure mode, and it is the interesting one.

Sometimes a whole region degrades and starts failing checks for thousands of unrelated sites at once. Count those votes and one cloud region having a rough hour pages half your customers for outages that were never theirs. Several of 2025's loudest incidents had exactly this shape: a regional problem that, from the wrong angle, looked like everyone's problem.

So Tallwatch watches the regions themselves. When one starts failing across many unrelated targets, its vote is set aside until it recovers, and the call is left to the regions that are demonstrably healthy. This is the line between checking from many places and deciding from many places. Twenty probes are worthless if a single bad region can drag them all down together. The defense is to notice that the region is the thing that broke, and stop listening to it.

What it costs, and what it buys

It costs a little speed. Requiring agreement means Tallwatch will not page you on the very first failed check the way a single-probe tool will, because the first failed check is the one most likely to be noise. In exchange, the pages you do get are ones you can act on without first wondering "is it really down, or is it the monitor again."

That trade is the entire product. A pager is only useful if you believe it, and belief is built one true alert at a time and demolished one false one at a time. Tallwatch is an argument that you should spend your evidence on being right rather than on being first.

How Tallwatch stops false alerts

One check is one opinion, formed in one place

Make the regions vote

The part that is easy to get wrong

What it costs, and what it buys

Keep reading

Which alert channel actually wakes you at 3am

AI won't fix alert fatigue. A quorum will.

Status Page Examples: What Good Ones Look Like (and Why It Matters