User 136 · Fintech · Transactions · Fraud Anomaly

Executive Summary

Key model performance metrics for fraud detection

Observations

5000

Fraud Cases

492

Legitimate

4508

Fraud Rate (%)

9.84

Logistic AUC

0.9836

Isolation Forest AUC

0.8823

Optimal F1

0.9193

Precision at F1

0.9771

Recall at F1

0.8679

Fraud Caught

427

Fraud Missed

65

Logistic regression achieved AUC 0.984 and isolation forest achieved AUC 0.882 on the 5,000-transaction sample containing 492 confirmed fraud cases (9.84%). At the F1-optimal threshold, the logistic model catches 86.8% of fraud cases with precision 97.7% — meaning 10 legitimate transactions are flagged for every 427 fraud cases correctly identified.

Interpretation

Logistic regression achieved AUC 0.984 and isolation forest achieved AUC 0.882 on the 5,000-transaction sample containing 492 confirmed fraud cases (9.84%). At the F1-optimal threshold, the logistic model catches 86.8% of fraud cases with precision 97.7% — meaning 10 legitimate transactions are flagged for every 427 fraud cases correctly identified.

Visualization

Fraud Prevalence: Sample vs Original Dataset

Fraud rate comparison between the 5,000-row balanced sample and the full Kaggle dataset

Interpretation

The balanced training sample has a fraud rate of 9.84%, compared to just 0.17% in the full 284,807-transaction dataset. This enrichment (57x over-sampling) ensures both models see enough fraud examples to learn discriminative patterns. Results should be interpreted with the original prevalence in mind when estimating expected false-positive volume in production.

Visualization

Isolation Forest Anomaly Score by Class

Distribution of isolation forest anomaly scores for fraudulent vs legitimate transactions

Interpretation

Fraudulent transactions have a median isolation forest score of 0.612 versus 0.587 for legitimate transactions. Higher scores indicate observations that are easier to isolate — i.e., they fall in sparse regions of feature space. The degree of separation between these distributions reflects how well the unsupervised model can discriminate without access to the fraud label.

Visualization

Logistic Regression Coefficients (Top Features)

Log-odds coefficients from logistic regression showing each PCA component's contribution to fraud probability

Interpretation

Of the top 20 features by coefficient magnitude, 9 are statistically significant at p < 0.05. The largest-magnitude coefficient is v4 (+0.855), indicating it has the strongest influence on fraud log-odds. Negative coefficients reduce estimated fraud probability while positive coefficients increase it — features like V14 and V17 are well-known fraud discriminators in this dataset.

Visualization

ROC Curve — Logistic vs Isolation Forest

ROC curves comparing logistic regression and isolation forest across all classification thresholds

Interpretation

Logistic regression achieves AUC = 0.984 and isolation forest achieves AUC = 0.882 on this dataset. AUC measures the probability that a randomly chosen fraud transaction is ranked above a randomly chosen legitimate transaction. The model closer to the top-left corner at any given false-positive rate is preferable for production use where analyst capacity limits review volume.

Visualization

Precision-Recall Curve — Logistic vs Isolation Forest

PR curves showing the precision-recall trade-off for both models under class imbalance

Interpretation

Under the class imbalance present in the sample (9.8% fraud), precision-recall curves provide a more honest picture of operational performance than ROC curves. A model with high PR-AUC maintains strong precision even as recall is pushed toward 1.0, minimising the number of legitimate transactions sent to a human reviewer for each additional fraud case caught.

Visualization

Confusion Matrix at Optimal F1 Threshold

Confusion matrix using logistic regression at the threshold maximising F1-score (0.428)

Interpretation

At the F1-optimal threshold of 0.428, the logistic model correctly identifies 427 of 492 fraud cases (recall = 86.8%) while generating 10 false positives. 65 fraud cases are missed (false negatives), and these represent the highest operational risk — transactions that pass through undetected.

Data Table

Top 20 Most Anomalous Transactions

Highest-risk transactions ranked by blended isolation forest and logistic regression score

Transaction ID	Amount	Fraud Prob	Anomaly Score	Class Label
1999	0.01	1	0.7242	Fraud
4597	2.28	1	0.7222	Fraud
4927	0	1	0.7183	Fraud
28	1	1	0.7154	Fraud
737	1	1	0.7101	Fraud
1554	1	1	0.7101	Fraud
3238	1	1	0.7101	Fraud
3644	1	1	0.7101	Fraud
1297	2.28	1	0.7077	Fraud
1996	1	1	0.7039	Fraud
2911	1.63	1	0.7025	Fraud
3938	9.82	1	0.701	Fraud
301	364.2	1	0.6944	Fraud
1733	1.63	1	0.6888	Fraud
1976	106.5	1	0.6874	Fraud
1371	8.64	1	0.6869	Fraud
1506	1219	1	0.6851	Fraud
4223	1	1	0.6851	Fraud
1654	139.9	1	0.6823	Fraud
3025	30.31	1	0.6745	Fraud

Interpretation

The 20 transactions with the highest combined anomaly score include 20 confirmed fraud cases and 0 legitimate transactions flagged as highly anomalous. The blended score averages the normalised isolation forest score and the logistic regression probability, prioritising transactions that both models agree are suspicious. Transaction amount ranges from $0 to $1218.89 across the top 20.

Visualization

Feature Importance — Isolation Forest

Which PCA components contribute most to the isolation forest's anomaly detection

Interpretation

The most important feature for isolation forest is v19 — the PCA component on which the model most frequently and shallowly splits to isolate anomalies. Features that appear at shallow splits are particularly discriminating because they alone can separate outliers from the bulk of the data. Comparing this ranking to the logistic regression coefficients reveals whether supervised and unsupervised methods agree on which PCA components carry fraud signal.

What's wrong with this card?

Executive Summary

Fraud Prevalence: Sample vs Original Dataset

Isolation Forest Anomaly Score by Class

Logistic Regression Coefficients (Top Features)

ROC Curve — Logistic vs Isolation Forest

Precision-Recall Curve — Logistic vs Isolation Forest

Confusion Matrix at Optimal F1 Threshold

Top 20 Most Anomalous Transactions

Feature Importance — Isolation Forest

Report an Issue