Executive Summary
Key findings from fraud vs legitimate transaction analysis
Analyzed 2000 transactions with 216 fraudulent cases (10.8% fraud rate). Fraudulent transactions have mean amount $115.30 vs legitimate $73.82, suggesting amount patterns differ by class. PCA features and temporal signals show distinct patterns between fraud and legitimate activity, warranting further investigation of high-variance features and time-based anomalies.
Analysis Overview
Dataset overview and analysis scope
This analysis examined 2000 credit card transactions with 216 fraudulent cases (10.8% fraud rate). Data includes 28 PCA-transformed features, transaction amounts, timestamps, and fraud labels for comprehensive class-wise comparison.
Data Quality
Data quality assessment and preprocessing summary
No missing values detected in the 2000 analyzed transactions. All 28 PCA features are numeric and properly scaled. Transaction amounts and timestamps are complete. Data quality is suitable for exploratory analysis without additional preprocessing.
Fraud vs Legitimate Count
Distribution of fraudulent and legitimate transactions
Out of 2000 total transactions, 1784 are legitimate (89.2%) and 216 are fraudulent (10.8%). The severe class imbalance toward legitimate transactions (98%) is typical for fraud detection datasets and indicates that legitimate transactions vastly outnumber fraudulent ones.
Transaction Amount by Class
Box plot showing distribution of transaction amounts for each fraud class
Fraudulent transactions have median amount $11.86 (IQR $104.92) vs legitimate $22.00 (IQR $67.89). The lower median and wider interquartile range for fraudulent transactions suggest lower-value spending patterns for fraudulent activity.
Amount Distribution (Density)
Violin plot showing the density and shape of transaction amount distributions
The density distributions reveal that fraudulent transactions are right skewed (mean-median = 103.45) while legitimate transactions are right skewed (mean-median = 51.82). Fraudulent amounts concentrate at lower values, suggesting fraudsters target specific amount ranges.
Mean PCA Features by Class
Comparison of mean values for top PCA features by fraud class
The top 10 PCA features show the largest mean differences between fraudulent and legitimate transactions. pca_feature_3 shows the largest discriminative power with absolute difference of 7.366, indicating strong separation between fraud and legitimate patterns. These features are candidates for fraud detection models.
Fraud Rate Over Time
Heatmap showing fraud rates across time periods (Early, Middle, Late phases)
Fraud rates vary across different time periods within the transaction dataset. Peak fraud rate of 10.8% occurs during the early phase, indicating potential temporal clustering of fraudulent activity. This suggests that fraudsters may target specific times when detection is less likely or transaction monitoring is reduced.
Feature Correlations
Heatmap of Pearson correlations between all features and fraud label
The correlation matrix reveals which features are most strongly associated with fraud. pca_feature_14 shows the strongest correlation with the fraud label (r = -0.805), making it a key discriminative feature. Multicollinearity between PCA features is minimal due to their orthogonal construction from PCA.
Summary Statistics by Class
Mean, median, and standard deviation for top PCA features by fraud class
| Statistic | Feature Name | Fraudulent Value | Legitimate Value |
|---|---|---|---|
| Mean | pca_feature_3 | -7.382 | -0.0162 |
| Mean | pca_feature_14 | -7.085 | -0.0184 |
| Mean | pca_feature_17 | -7.02 | 0.0369 |
| Mean | pca_feature_12 | -6.432 | 0.027 |
| Mean | pca_feature_7 | -6.109 | -0.0253 |
| Mean | pca_feature_10 | -5.949 | 0.0166 |
| Mean | pca_feature_1 | -5.168 | 0.0386 |
| Mean | pca_feature_4 | 4.64 | -0.0157 |
| Mean | pca_feature_16 | -4.316 | 0.0006 |
| Mean | pca_feature_11 | 3.936 | -0.0093 |
| Median | pca_feature_3 | -5.139 | 0.1602 |
| Median | pca_feature_14 | -6.797 | 0.0386 |
| Median | pca_feature_17 | -5.756 | -0.0486 |
| Median | pca_feature_12 | -5.503 | 0.1526 |
| Median | pca_feature_7 | -3.161 | 0.0363 |
| Median | pca_feature_10 | -4.698 | -0.097 |
| Median | pca_feature_1 | -2.493 | 0.0738 |
| Median | pca_feature_4 | 4.223 | -0.0143 |
| Median | pca_feature_16 | -3.838 | 0.0424 |
| Median | pca_feature_11 | 3.712 | -0.0255 |
| Std Dev | pca_feature_3 | 7.377 | 1.405 |
| Std Dev | pca_feature_14 | 4.13 | 0.936 |
| Std Dev | pca_feature_17 | 7.056 | 0.7564 |
| Std Dev | pca_feature_12 | 4.639 | 0.9249 |
| Std Dev | pca_feature_7 | 7.62 | 0.9975 |
| Std Dev | pca_feature_10 | 5.061 | 1.044 |
| Std Dev | pca_feature_1 | 7.198 | 1.895 |
| Std Dev | pca_feature_4 | 2.869 | 1.346 |
| Std Dev | pca_feature_16 | 3.884 | 0.8168 |
| Std Dev | pca_feature_11 | 2.603 | 1.001 |
Summary statistics for the top 10 most discriminative PCA features reveal systematic differences between fraudulent and legitimate transactions. Fraudulent cases show greater variability (higher std dev) in many features, and lower central tendency (mean/median) patterns compared to legitimate transactions. These differences form the basis for anomaly detection approaches.