Application of SMOTE-ENN Method in Data Balancing for Classification of Diabetes Health Indicators with C4.5 Algorithm

Authors

  • Bakti Putra Pamungkas Department of Informatics Engineering, University of Nahdlatul Ulama Sunan Giri, Bojonegoro
  • Muhammad Jauhar Vikri Department of Informatics Engineering, University of Nahdlatul Ulama Sunan Giri, Bojonegoro
  • Ita Aristia Sa'ida Department of Informatics Engineering, University of Nahdlatul Ulama Sunan Giri, Bojonegoro

DOI:

https://doi.org/10.32736/sisfokom.v14i2.2350

Keywords:

SMOTE-ENN, Data Imbalance, C4.5, Diabetes, Classification

Abstract

Data imbalance in health datasets often leads to decreased performance of classification models, especially in detecting minority classes such as diabetics. This study evaluates the effect of the SMOTE-ENN method on improving the performance of the C4.5 algorithm in the classification of diabetes health indicators. The dataset used is the 2021 Diabetes Binary Health Indicators BRFSS from Kaggle, which consists of 236,378 respondent data with unbalanced class distribution: 85.80% non-diabetic and 14.20% diabetic. The SMOTE method was used to add synthetic data to the minority classes, while ENN was applied to remove data considered noise. After balancing, the C4.5 algorithm was used for classification. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics. The results showed that the application of SMOTE-ENN improved accuracy from 79.49% to 80.33% and precision from 29% to 30%. Although the recall value did not increase, this method proved to be able to improve the overall stability of the prediction, especially in terms of the accuracy of the classification of the positive class. The novelty of this research lies in the specific application of the SMOTE-ENN method on large-scale health datasets with the C4.5 algorithm, which has not been widely explored before. Therefore, further exploration of other balancing techniques and algorithms is needed to obtain more optimal classification results on unbalanced data.

References

WHO, “Thermostability of human insulin,” World Heal. Organ. 2024., vol. 2050, no. 1, pp. 1–7, 2024.

H. Marlisa, N. Satyahadewi, N. Imro’ah, and N. N. Debataraja, “Application of Adasyn Oversampling Technique on K-Nearest Neighbor Algorithm,” BAREKENG J. Ilmu Mat. dan Terap., vol. 18, no. 3, pp. 1829–1838, 2024.

M. K. Rezki, M. I. Mazdadi, F. Indriani, Muliadi, T. H. Saragih, and V. A. Athavale, “Application of Smote to Address Class Imbalance in Diabetes Disease Categorization Utilizing C5.0, Random Forest, and Support Vector Machine,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 4, pp. 343–354, 2024.

J. Wang, “Prediction of postoperative recovery in patients with acoustic neuroma using machine learning and SMOTE-ENN techniques,” Math. Biosci. Eng. aimspress.com, 2022.

R. P. Fadhillah, R. Rahma, and ..., “Klasifikasi Penyakit Diabetes Mellitus Berdasarkan Faktor-Faktor Penyebab Diabetes menggunakan Algoritma C4. 5,” … Penelit. dan …, 2022.

R. Doğan, S. M. Çınar, and E. Akarslan, “A Novel ZIP-Based NILM Method Design Robust to Undervoltage and Overvoltage Conditions,” Arab. J. Sci. Eng., 2025.

U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection algorithm: A review,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 4, pp. 1060–1073, 2022.

Q. H. Nguyen et al., “Influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” Math. Probl. Eng., vol. 2021, 2021.

M. T. Akhir, M. Syarat, G. Memperoleh, G. Sarjana, S. Satu, and T. Informasi, Perbandingan Kinerja Metode Klasifikasi Naïve Bayes Dan Random Forest Dalam Analisis Sentimen Kasus Narkoba di Indonesia Pada Komentar YouTube SKRIPSI Diajukan oleh : NAILUL ‘ INAYAH PROGRAM STUDI TEKNOLOGI INFORMASI. 2023.

A. Ambarwari, Q. J. Adrian, and Y. Herdiyeni, “Analisis Pengaruh Data Scaling Terhadap Performa Algoritme Machine Learning untuk Identifikasi Tanaman,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 4, no. 1, pp. 117–122, 2020.

S. Nagibzadeh, Bilgisayar Bilimleri ve Mühendisli ğ i + tr-2, no. January 2025. 2024.

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced Data Problem in Machine Learning: A Review,” IEEE Access, vol. 13, no. January, pp. 13686–13699, 2025.

M. Seyedtabib and N. Kamyari, “Predicting polypharmacy in half a million adults in the Iranian population: comparison of machine learning algorithms,” BMC medical informatics and decision making. Springer, 2023.

H. L. Ngo et al., “The composition of time-series images and using the technique SMOTE ENN for balancing datasets in land use/cover mapping,” Acta Montan. Slovaca, vol. 27, no. 2, pp. 342–359, 2022.

M. Lu, L. T. Tay, and J. Mohamad-Saleh, “Landslide susceptibility analysis using random forest model with SMOTE-ENN resampling algorithm,” Geomatics, Nat. Hazards Risk, vol. 15, no. 1, p. , 2024.

Dhea Halimah, Muhammad Ridwan Lubis, and Widodo Saputra, “Algoritma C4.5 Untuk Menentukan Klasifikasi Tingkat Pemahaman Mahasiswa Pada Matakuliah Bahasa Pemrograman,” J. Tek. Mesin, Ind. Elektro Dan Inform., vol. 1, no. 3, pp. 24–38, 2022.

P. B. N. Setio, D. R. S. Saputro, and Bowo Winarno, “Klasifikasi Dengan Pohon Keputusan Berbasis Algoritme C4.5,” Prism. Pros. Semin. Nas. Mat., vol. 3, pp. 64–71, 2020.

U. P. Budi and A. Info, “Application Of C4 . 5 Algorithm In Disease Classification,” vol. 2, no. 02, pp. 58–62, 2024.

A. Afifuddin and L. Hakim, “Deteksi Penyakit Diabetes Mellitus Menggunakan Algoritma Decision Tree Model Arsitektur C4.5,” J. Krisnadana, vol. 3, no. 1, pp. 25–33, 2023.

L. Y. L. Gaol, M. Safii, and D. Suhendro, “Prediksi Kelulusan Mahasiswa Stikom Tunas Bangsa Prodi Sistem Informasi Dengan Menggunakan Algoritma C4. 5,” Brahmana J. Penerapan …, 2021.

F. F. Nugraha, I. Sunandar, and C. Julian, “Penerapan Data Mining Dengan Metode Kalsifikasi Menggunakan Algoritma C4.5,” Teknologi, vol. 7, no. March, pp. 10–20, 2022.

M. A. Barata et al., “PERANCANGAN SISTEM ELECTRONIC NOSE BERBASIS,” pp. 117–126, 2016.

M. D. Nguyen et al., “Estimation of recompression coefficient of soil using a hybrid ANFIS-PSO machine learning model,” J. Eng. Res., vol. 12, no. September 2023, pp. 358–368, 2024.

V. R. Prasetyo, M. Mercifia, A. Averina, L. Sunyoto, and B. Budiarjo, “Prediksi Rating Film Pada Website Imdb Menggunakan Metode Neural Network,” Netw. Eng. Res. Oper., vol. 7, no. 1, p. 1, 2022.

S. Sathyanarayanan and B. R. Tantri, “Confusion Matrix-Based Performance Evaluation Metrics,” no. November, 2024.

Downloads

Published

2025-05-26

Issue

Section

Articles