content

Subscribe for updates

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

How to measure false positives in fraud systems that hide them

Chen Zamir

min read

•

February 24, 1026

In this blog series, Chen Zamir takes on an (almost) existential fraud problem: how to reduce false positives. In part 1, he covers five practical ways fraud teams quantify false declines when the data doesn’t volunteer answers.

Everyone agrees false positives are bad, but almost no one can quantify them.

Ask a fraud leader for last month’s chargebacks count and you’ll get a precise number. Ask how many good customers were blocked by mistake, and you’ll get a shrug, or worse, ‘None.’

It’s not because people don’t care. It’s because the way we build fraud systems makes false positives invisible by design.

If you approve a transaction, the world gives you feedback. If it was fraud, you’ll hear about it. If it was good, you’ll see the customer come back, spend more, log in again, do normal customer things.

But if you block something, that signal goes dark.

Best case: a customer complains. But even then, that complaint will rarely make its way back into your data. It usually dies in a silo, buried in a support ticket, stuck in an inbox, or fading into the memory of an angry phone call.

It almost never becomes a structured label attached to a specific event, the kind that feeds your model-training pipelines or powers your dashboards.

So, before we talk about reducing false positives, we have to talk about labeling them.

Part 1 in this series tackles this paradox: How do you measure an error that the system is designed to hide?

There is no silver bullet. Measuring false positives is always an exercise in triangulation. You must combine several imperfect methods until the signal creates a picture becomes sharp enough for decision-making.

Let’s get down to it.

5 ways to measure false positives in fraud systems

What	How
Simulation	Back test rules on historical data to identify false declines.
Manual review	Audit declined transactions to spot mistakes.
Behavioral linking	Track blocked users who return with good activity
Control groups	Allow a test group through to measure what you're missing
User feedback	Survey users about declined transactions

Method 1 - Simulating your rules and workflows

Teams naturally gravitate toward simulation because it’s easy: you simply backtest your logic against historical data.

Take a rule, or a ruleset, or a workflow, and run it against the last 6 months of traffic. Look at all the events it would have declined, then cross-reference them against their final status (fraud or good).

Crucially, don’t forget to account for fraud maturation. Since chargebacks take time to materialize, recent data is effectively ‘unlabeled’. A 30-day buffer is the minimum; a 90-day window ensures your ‘clean’ users are actually clean.

Any transaction labeled 'clean' that your rule would have blocked is a theoretical False Positive.

The appeal here is accessibility: you can run this in SQL or your fraud platform without deploying any new infrastructure.

But simulation comes with important limitations. You are testing one rule in isolation, but in production, that rule lives inside a complex 'decision stack’ with different rules, model thresholds, AI agents, manual reviews, third-party decisions, and so on.

In reality, a rule doesn’t operate alone. It might be preempted for an upstream block or overruled by a manual review. Your simulation assumes your rule is the final authority; in production, it is often just one voice in a choir.

You also cannot backtest operational friction. Simulation assumes perfect code execution but misses bugs, integration issues, or the behavior of upstream actors like issuers and processors. All of those can create false positives, and none of them are captured in a simple “If this rule had been live, what would it have done?” exercise.

So yes, simulate. It’s a good start. Just don’t confuse “This rule has a low false positive rate” with “This is the whole picture.”

Method 2 - Manual review of declined events

The second method is older, slower, and much more powerful: You manually review declined traffic.

Instead of looking at the logic, you look directly at the impact. You take a sample of blocked events, either across the board or for a specific solution, and you ask an experienced fraud investigator to label them manually.

This is essentially the same process you use to design new rules or hunt for new fraud patterns, just applied in reverse.

The limitations are obvious: It doesn’t scale, it’s time-consuming, and it’s expensive operationally. You also need high-expertise analysts, and even then a percentage of events will remain inconclusive 'gray area' cases.

On the flip side, because this is an offline audit rather than a live decision, case-specific accuracy matters less. You also don’t need to label everything.

All you need is a sense of where you stand and where the worst offenders are. After all, it doesn’t really matter if you conclude that a rule has an accuracy of 45% or 48%.

It’s also possible to combine the first two methods. This is especially useful when dealing with a new fraud pattern you want to block and cannot afford to wait for it to “mature”. Simply run the logic on the last month and then review a sample of the results to assess performance.

Method 3 - Linking declines to later good behavior

The third method is the one I consider the most underrated: Linking.

The idea is simple: In many cases, good users show up multiple times on your platform before, or even after, they get blocked. They may try again with a different card. They may reattempt onboarding with a different email address. They may transact again from the same device with an IP address you don’t find risky.

If you can link those events together and see the full sequence, you suddenly get an extremely strong signal that the original decline was a false positive.

This linking can be straightforward:

Same user ID, different card.
Same device, different email.
Same IP+Name, different device.

Or it can be more probabilistic (fuzzier), like a blocked user, followed by a successful registration of their spouse in the same household.

Once you have the basic infrastructure in place, the method scales beautifully. You simply run over your blocked population periodically and try to associate them with good events that happened before or after.

The catch is that you need that infrastructure. You need to research a heuristic that would perform well. You need some kind of entity resolution. You need to be able to pull sequences of events over time. That is non-trivial work.

But the ROI is massive. You generate a continuous stream of high-precision false positive labels. It won’t catch every false positive, but the ones it does catch will be very reliable.

Method 4 - Control groups: Letting some fraud through on purpose

The fourth method is the one everyone is slightly uncomfortable with: Control groups.

You take a randomized sample of traffic, typically 1%, and whitelist it completely.

No rules. No model-based declines. No manual review. This population sails straight through to whatever the next step is: payment processing, account creation, feature access.

Then you simply watch what happens.

If your overall system performs well, you will see a meaningful amount of fraud in that control group. If, on the other hand, you notice that the control group’s false positive rate is much higher than your expectations, then you know that you need to re-examine your setup.

Control groups are powerful because they answer a very direct question: “What would happen if we removed the system entirely for some users?” The answer usually isn’t pretty to look at, but it is honest.

This approach, however, is only practical at scale. You need enough traffic that one or two percent still gives you statistically meaningful numbers in a timeframe that makes sense.

You also need the budget to absorb the loss. For a small fintech with thin margins and limited volume, this may not be feasible. For larger players, it’s one of the few tools that gives you a truly unbiased measurement.

Method 5 - Asking the user

The fifth method, requesting direct user feedback, is a more opportunistic tactic only available in select contexts. If you’re looking at declined transactions on accounts you already know and trust, you can trigger an out-of-band notification: “We blocked a transaction ending in 1234. Was this you?”

If the user is logged-in and authenticated, an in-app prompt or email can explicitly ask whether a given block was legitimate.

But this isn’t something you can do across the board. It’s unusable at onboarding when you’ve never seen the user before; asking a stranger if they are a fraudster yields no useful data.

Additionally, and I cannot stress this enough, do not give fraudsters a way to mark their own attempts as “good.” If a 'Yes, it was me' button automatically lifts the block, you haven't built a feedback loop, you've built a backdoor for fraudsters to whitelist themselves.

But as an additional source of ground truth in very specific situations, such as failed login attempts or contact details change, this can be surprisingly helpful.

There is no single magic method

If your conclusion is 'None of these are perfect,' you are correct. That is the nature of the beast.

You’re not looking for a single magic metric that tells you the exact number of false positives in your system. You’re looking to combine several independent indicators to build both accuracy and scalability.

In practice, most teams end up with some combination of:

Simulation for quick rule-by-rule sense checks.
Manual review for high-value flows and top offenders.
Linking as the main continuous measurement pipeline once the engineering work is done.
Control groups for large populations where you can tolerate some controlled loss.
User feedback in narrow, carefully chosen contexts.

If you mix these properly, you’ll end up with a workable estimate of your false positives, broken down by at least a few important dimensions: product flow, traffic type, decisioning layer, and so on.

And once you can measure the problem, you can finally ask the right question.

From “How many?” to “Where from?”

It’s one thing to know that you have a false positive problem. It’s another to know where it comes from.

Is it mostly your rules? Your model threshold? Your human reviewers? A specific integration that corrupts IP addresses on mobile? A third-party payment processor that’s hyper-aggressive in a certain region?

In the second part of this series, we’ll take the false positives you’ve managed to detect and bucket them by root cause. We’ll look at:

How much of your false positive problem is actually under your control.
How much comes from specific flows and platforms.
How much is driven by data quality issues that have nothing to do with fraud.
And most importantly: How to tell which buckets are worth fixing and which ones you’ll simply have to live with.

Zero false positives is a myth. But you can still get to a point where every remaining false positive is either consciously accepted or outside of your sphere of influence.

And that clarity alone is worth a lot.

See you all in part 2.

content

Heading 2

Share the article

Tagged topics

Fraud

About the author

Chen Zamir

Head of Fraud Strategy