Back to blog
4 min read
Engineering

How Tallwatch stops false alerts

Most teams say the majority of their alerts are false positives. Here is the thinking that fixes that, in plain terms, without the database internals.

NK

Nabin Khair · Founder

Here is a number that should bother anyone who runs on call. In a 2025 survey of DevOps and SRE teams, 67% of engineers said they had ignored or dismissed an alert without investigating it. Not because they were careless. Because most of the alerts were not worth investigating: in the same research, 85% of teams reported that the majority of their alerts were false positives.

Read that back. The majority. At most companies, the default state of on-call is an alarm that is usually wrong.

That is the problem Tallwatch is built to solve, and the fix turns out to be structural rather than clever.

One check is one opinion

When a single machine, on a single network, fails to reach your site once, you have exactly one data point, formed at one moment, from one place. It might mean your service is down. It might mean that machine's upstream had a bad route for thirty seconds. From that one failure you genuinely cannot tell which.

A tool built on single-location checks has to guess. Alert on every failed check and you get noise. Wait and re-check before alerting and you get the bad kind of latency, where a real outage sits unreported while the tool second-guesses itself. Neither is what you want at 2am.

The August 2025 Cloudflare incident is a clean illustration. For a few hours, traffic between Cloudflare and AWS's us-east-1 region got badly congested. If your origin lived in us-east-1, some paths to you looked broken. Cloudflare was careful to point out that this was a regional problem and that its global network was not affected. Put plainly, whether a monitor "saw an outage" that afternoon depended entirely on where it happened to be checking from. One vantage point does not get to make that call.

Make the regions vote

Tallwatch checks each target from several regions and treats each result as a vote. An incident opens only when a configurable majority of those regions agree the target failed inside a short window.

Quorum is the entire idea. One region seeing a failure is a rumor. A majority of independent regions seeing the same failure at the same moment is a fact. We page on facts. The thresholds are tunable, but the shape never changes: enough independent places have to agree before anyone's phone goes off.

Regions have bad days too

There is a subtler trap that a naive vote walks straight into. Sometimes a whole region degrades and starts failing checks for many unrelated sites at once. Count those votes blindly and one cloud region having a rough hour pages half your customers for outages that were never theirs. Several of 2025's loudest incidents were exactly this shape: a regional problem that looked like everyone's outage from the wrong angle.

So when a region starts failing across a lot of unrelated targets, Tallwatch sets its votes aside until it recovers, and the decision stays with the regions that are actually healthy. This is the difference between checking from many places and deciding from many places. Only the second one keeps someone else's bad afternoon off your pager.

What it adds up to

You never see any of this when you use the product. You see the result. A page arrives when the outage is real. A one-region blip stays invisible. A bad hour for a single cloud region does not become a wall of false incidents on your phone.

The goal is narrow and specific. When your phone buzzes, the right reaction should be to reach for the laptop, not to roll over and assume it is nothing. Earning that reflex back is worth a great deal of careful, invisible work, and most of the engineering in Tallwatch is in service of that one sentence.