..
ML for fraud detection
3 types of broad features
- Account related features
- Transaction related features
- Customer related features
Major challenges
Class imbalance for fraud detection
Cost sensitive methods
- Loss function level
- Imbalance ratio: ratio of samples belonging to minority class and majority class
- Problems - Small sample size - Class overlap - Noisy or borderline instances
- Can consider misclassification cost as a hyperparameter as well
- Balanced accuracy
- Can lead to a lot of false positives, affective precision
Resampling methods
- Data level
- Oversampling - Random duplication (Naive) - SMOTE - ADASYN
- Undersampling - Random undersampling - Edited nearest neighbor - Replacing subsets by samples of their centroid
- Hybrid - Almost always improves performance - SMOTE + Nearest neighbour - Tomek links
Sampling can generally be beneficial to AUC ROC, but leads to decreased performance in average precision.
- Concept drift
- NRT systems
- Categorical features transformation for fraud detection
- Converting timestamp
- Weekend/Weekdays
- Day/Night
- Converting customer id/terminal id to:
- Converting timestamp
- Rolling window: Average and number of txns in the rolling windo
- Sequential modeling
- Class overlap
- Performance measures
- Lack of dataset (addressed by [[Data simulation for fraud detection]])
Training
- Delay in train/test set
Model validation
Evaluate the trained model on validation dataset and tune the performance
- Hold out - Sensitive to the dataset
- Repeated hold out - Only subsets of data are used for training
- Prequential validation - Fixed test set - Moving test set - Computationally expensive - More testing so more general results - Also gives confidence intervals
Model selection
- Training vs validation/test model performance tradeoff
- Performance summary
- Default parameters
- Estimated parameters - Parameters on validation dataset
- Optimal parameters - Parameters on test dataset
- Random grid search
Neural networks for fraud detection
- Instead of using only tree based and NNs, use an ensemble of both.
- Need to scale input for NNs
- Usually tree based are used in real world scenarios
Shortcomings of tree based models
- Use overall data to compute splits
- Meaning they don’t learn iteratively
- Have to create aggregate features, requiring expert human knowledge and time - NNs can represent features and do classification in one go
Advantages of learning iteratively
- Can learn on newer dataset only instead of learning on all data everytime
- No need to store older data once learning is complete
- NN can learn per sample and iteratively hence the benefits over tree-based models
- Federated learning is possible in NNs
Auto encoders and anomaly detection
- Can use autoencoder techniques for generating a embedding of the input representation
- For autoencoders, we will use the MSE loss
- Encode all the txns (both fraud and genuine). Higher MSE loss will signify a fraud (rarer) txn
- Embedding from the txns can also be used to cluster visualization
- We can also combine the results of unsupervised learning and supervised learning - Train unsupervised learning on all data (labelled and unlabelled) - Train supervised model on labelled data - Average scores from both the above models - Another way is to use auto encoder architecture, and from the latent space add another learning to binary classification - Reconstruction score can also be used as a feature in the supervised learning scenario
Sequential modeling
- Things to consider: Sequence length and fixed input dimension
- 1D CNN for feature generation and then do binary classification
- LSTMs for feature generation, then use last hidden state for binary classificatib
- LSTMs + Attention layer on all hidden layers - Use current txn as context vector and all hidden states as input - Apply attention module on the above and get an output - Binary classifier on the output