Expense Anomaly Detection

Seventy-nine percent of organizations experienced attempted or actual payments fraud in 2024, according to the AFP Payments Fraud and Control Survey (Rillion, 2025). Duplicate vendor payments, expense report padding, miscoded GL entries, and outright fraud hide in transaction data that no human has time to review line by line. Enterprise companies use dedicated fraud detection software like AppZen or Oversight at $50,000+ per year. Mid-market companies — the ones processing 500 to 5,000 transactions per month — rely on periodic spot checks and hope for the best. This analysis applies machine learning anomaly detection to your expense data, flagging the transactions that deviate most from normal patterns and giving you a ranked shortlist to investigate instead of an ocean of line items.

The Problem With Threshold-Based Rules

Most companies that try to catch unusual expenses use simple rules: flag anything over $10,000, flag any vendor payment within 5 days of a previous payment to the same vendor, flag expense reports that exceed the monthly average by 2x. These rules catch the obvious cases — a $50,000 payment to an unknown vendor will trigger every time. But they miss the subtle ones and generate too many false positives on legitimate transactions.

A duplicate payment of $2,300 to a vendor who normally receives $2,300 monthly looks perfectly normal to a threshold rule. An expense report that bills $180 for dinner every Tuesday night — slightly above the $150 average but never exceeding the $200 threshold — accumulates to $1,500 per quarter in padding that no rule catches. A vendor who gradually increases invoice amounts by 3% per month is invisible until someone audits the full history.

Machine learning anomaly detection works differently. Instead of looking at one dimension at a time (amount, frequency, vendor), it evaluates transactions across all dimensions simultaneously. A payment might look normal on amount alone and normal on frequency alone, but the combination of that amount, to that vendor, on that day of the week, in that department, is unusual relative to the full pattern of normal transactions. The algorithm — called Isolation Forest — finds these multi-dimensional outliers that single-dimension rules miss.

How Isolation Forest Works (Without the Math)

The core idea is intuitive: unusual things are easier to describe than normal things. If you played a game of 20 questions about a transaction, a normal payment ("$2,300, to Office Depot, on the 15th, from the admin department") would take many questions to isolate from the crowd of similar transactions. But an unusual payment ("$2,300, to a vendor we have never paid before, on a Sunday, from the engineering department") can be isolated in just a few questions.

The algorithm builds a forest of random decision trees that split transactions along random dimensions at random thresholds. Transactions that end up in short branches — isolated quickly — get high anomaly scores. Transactions deep in the tree — hard to isolate because they look like everything else — get low scores. The algorithm builds hundreds of these trees and averages the results for stability.

You do not need to tell the model what fraud looks like. You do not need labeled examples of "good" and "bad" transactions. You do not need to configure rules or set thresholds (beyond specifying roughly what percentage of transactions you expect to be anomalous). The algorithm learns what "normal" looks like from your data and flags everything that deviates.

What Kinds of Anomalies It Catches

The key advantage over rules is that the model catches combinations that no single rule would flag. A $900 payment is not unusual. A payment to Vendor X is not unusual. But a $900 payment to Vendor X from Department Y on a Saturday — that specific combination might score as highly anomalous because it has never occurred before.

Who This Is For

Controllers, accounts payable managers, and internal auditors at companies with $5M to $100M in revenue — large enough to process meaningful transaction volume (500+ per month) but too small for enterprise fraud detection software. Industries with significant operational spending benefit the most: construction, healthcare, professional services, manufacturing, and hospitality. Companies with distributed purchasing authority — multiple cost centers, field offices, or project-based billing — are particularly vulnerable because no single person sees all transactions.

The current alternative for most of these companies is periodic manual sampling: the controller pulls a random sample of 50 transactions per month and reviews them. This catches roughly 5% of anomalies. Quarterly audit spot-checks are even less effective. The analysis replaces random sampling with targeted investigation — review the 20 most anomalous transactions instead of 50 random ones, and you are far more likely to find actual problems.

What Data You Need

A CSV export from your general ledger, accounts payable system, or expense management platform (QuickBooks, Xero, Expensify, Concur, Brex). You need at least two numeric columns that characterize each transaction:

Additional numeric features that significantly improve detection:

The model accepts 2-20 numeric features. More dimensions give it more ways to distinguish anomalies from normal transactions. Categorical columns like vendor name or GL code should be excluded from the feature mapping — the model requires numeric inputs only. Use those columns for investigation after the anomalies are flagged.

Minimum: 200 transactions. The sweet spot is 1,000 to 5,000 rows — enough to establish normal patterns without overwhelming the model. A typical month-end GL export falls squarely in this range.

How to Read the Report

Anomaly Score Distribution — a histogram showing scores across all transactions. Normal transactions cluster at low scores; anomalies sit in the right tail. A clean separation between normal and anomalous scores means the model is confident. A gradual tail with no clear gap means the boundary is fuzzy and you should interpret borderline cases with caution.

Top Anomalies Table — this is your action list. The 20-50 most anomalous transactions ranked by score, with all feature values shown. For each transaction, you can see exactly why it was flagged — unusual amount, unusual timing, unusual vendor frequency, or some combination. Start your investigation here.

Feature Importance — which dimensions drove the anomaly scores most. If "transaction_amount" dominates, your anomalies are primarily unusually large or small payments. If multiple features contribute roughly equally, the anomalies are multi-dimensional — they look unusual across several characteristics simultaneously, which often indicates more sophisticated patterns.

Normal vs. Anomaly Comparison — side-by-side statistics showing how anomalous transactions differ from normal ones. Maybe anomalous transactions are 10x the average amount. Maybe they have normal amounts but come from vendors with unusually low payment frequency. This comparison table makes the differences concrete.

Building a Review Workflow

Monthly Cadence

Run the analysis at month-end on the full month's GL export. Review the top 20 anomalies. Most will have innocent explanations — a legitimate one-time purchase, a vendor who changed billing frequency, a correctly coded but unusual project expense. Flag the ones that do not have obvious explanations for deeper investigation.

Prioritize by Score

Anomaly scores are continuous, not binary. A transaction scoring 0.85 is far more suspicious than one scoring 0.55. Start with the highest scores and work down. You will quickly develop a sense for what score level separates "worth investigating" from "probably fine" in your specific data.

Track False Positives

Keep a log of which flagged transactions turned out to be legitimate. Over time, this helps you tune the contamination parameter — if the model flags 5% of transactions but only 1% are actually problematic, reduce the contamination setting from 5 to 2. This narrows the investigation list and reduces reviewer fatigue.

When to Use Something Else

References