Article
Title: "Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk"
Authors: Musara Keith R., Edmore Ranganai, Charles Chimedza, Florence Matarise, Sheunesu Munyira
Pages: 229-270
DOI: 10.2478/fcds-2025-0009
Abstract:

Skewed fat-tailed distributed (imbalance or class-imbalance) datasets pose over- whelming aberrations in numerous machine learning (ML) algorithms, particularly in real-life ap- plications, especially in the domain of credit risk modelling, where default cases (minority-classes) are often outnumbered by non-default cases (majority-classes) cases or vice versa. Data-level (DL) approaches have been suggested in the recent literature as remedies for skewed fat-tailed distributed datasets. The popularized DL approach in contemporary studies is the synthetic minority oversampling technique (SMOTE) and its variants that are capable of mitigating the risk of overfitting and minimizing the generalization errors. However, these approaches can introduce noisy instances that adversely diminish the robustness of the ML algorithms. Also, they are often amenable to the presence of nominal features with mismatching labels that are inherent in real-world datasets. To bridge these gaps, we proposed a hybrid innovation framework that effectively mitigates the aberrations presented by nominal features with mismatching labels and noisy instances simultaneously. The proposed approach is the SMOTE-edited nearest neighbors-encoding nominal and continuous (SMOTEENN-ENC) features. The efficacy of our novelty was evaluated against DL approaches suggested in the literature, orchestrated to handle skewed fat-tailed distributed datasets with inherent diverse features. This approach was coupled with widely employed ensemble algorithms, namely the random forest (RF) and the extreme gradient boost (XGBoost). The results suggested that our novelty, SMOTEENN-ENC, integrated with the XGBoost algorithm demonstrated superiority and stability in the predictive performance when applied to skewed fat-tailed distributed datasets with inherent diverse features.

Open access to full text at De Gruyter Online