..
ML for fraud detection
3 types of broad features
- Account related features
- Transaction related features
- Customer related features
Major challenges
Class imbalance for fraud detection
Cost sensitive methods
- Loss function level
- Imbalance ratio: ratio of samples belonging to minority class and majority class
- Problems
- Small sample size
- Class overlap
- Noisy or borderline instances
- Can consider misclassification cost as a hyperparameter as well
- Balanced accuracy
- Can lead to a lot of false positives, affective precision
Resampling methods
- Data level
- Oversampling
- Random duplication (Naive)
- SMOTE
- ADASYN
- Undersampling
- Random undersampling
- Edited nearest neighbor
- Replacing subsets by samples of their centroid
- Hybrid
- Almost always improves performance
- SMOTE + Nearest neighbour - Tomek links
Sampling can generally be beneficial to AUC ROC, but leads to decreased performance in average precision.
- Concept drift
- NRT systems
- Categorical features transformation for fraud detection
- Converting timestamp
- Weekend/Weekdays
- Day/Night
- Converting customer id/terminal id to:
- Converting timestamp
- Rolling window: Average and number of txns in the rolling windo
- Sequential modeling
- Class overlap
- Performance measures
- Lack of dataset (addressed by [[Data simulation for fraud detection]])
Training
- Delay in train/test set
Model validation
Evaluate the trained model on validation dataset and tune the performance
- Hold out
- Sensitive to the dataset
- Repeated hold out
- Only subsets of data are used for training
- Prequential validation
- Fixed test set
- Moving test set
- Computationally expensive
- More testing so more general results
- Also gives confidence intervals
Model selection
- Training vs validation/test model performance tradeoff
- Performance summary
- Default parameters
- Estimated parameters - Parameters on validation dataset
- Optimal parameters - Parameters on test dataset
- Random grid search