..

2022-12-01

ML for fraud detection

3 types of broad features

Account related features
Transaction related features
Customer related features

Major challenges

Class imbalance for fraud detection

Cost sensitive methods

Loss function level
Imbalance ratio: ratio of samples belonging to minority class and majority class
Problems
- Small sample size
- Class overlap
- Noisy or borderline instances
Can consider misclassification cost as a hyperparameter as well
Balanced accuracy
Can lead to a lot of false positives, affective precision

Resampling methods

Data level
Oversampling
- Random duplication (Naive)
- SMOTE
- ADASYN
Undersampling
- Random undersampling
- Edited nearest neighbor
- Replacing subsets by samples of their centroid
Hybrid
- Almost always improves performance
- SMOTE + Nearest neighbour - Tomek links

Sampling can generally be beneficial to AUC ROC, but leads to decreased performance in average precision.

Concept drift
NRT systems
Categorical features transformation for fraud detection
1. Converting timestamp
  1. Weekend/Weekdays
  2. Day/Night
2. Converting customer id/terminal id to:
Rolling window: Average and number of txns in the rolling windo
Sequential modeling
Class overlap
Performance measures
Lack of dataset (addressed by [[Data simulation for fraud detection]])

Training

Delay in train/test set

Model validation

Evaluate the trained model on validation dataset and tune the performance

Hold out
- Sensitive to the dataset
Repeated hold out
- Only subsets of data are used for training
Prequential validation
- Fixed test set
- Moving test set
- Computationally expensive
- More testing so more general results
- Also gives confidence intervals

Model selection

Training vs validation/test model performance tradeoff
Performance summary
1. Default parameters
2. Estimated parameters - Parameters on validation dataset
3. Optimal parameters - Parameters on test dataset
Random grid search