Enhancing XGBoost Performance in Malware Detection through Chi-Squared Feature Selection

Salma Rosyada; Fauzi Adi Rafrastara; Arsabilla Ramadhani; Wildanil Ghozi; Warusia Yassin

doi:10.32736/sisfokom.v13i3.2293

Authors

Salma Rosyada Department of Informatics Engineering, Universitas Dian Nuswantoro, Semarang
Fauzi Adi Rafrastara Department of Informatics Engineering, Universitas Dian Nuswantoro, Semarang
Arsabilla Ramadhani Department of Informatics Engineering, Universitas Dian Nuswantoro, Semarang
Wildanil Ghozi Department of Informatics Engineering, Universitas Dian Nuswantoro, Semarang
Warusia Yassin Faculty of Technology Maklumat and Communication, Universiti Teknikal Malaysia Melaka

DOI:

https://doi.org/10.32736/sisfokom.v13i3.2293

Keywords:

malware detection, XGBoost, chi-squared, machine learning, feature selection

Abstract

The increasing prevalence of malware poses significant risks, including data loss and unauthorized access. These threats manifest in various forms, such as viruses, Trojans, worms, and ransomware. Each continually evolves to exploit system vulnerabilities. Ransomware has seen a particularly rapid increase, as evidenced by the devastating WannaCry attack of 2017 which crippled critical infrastructure and caused immense economic damage. Due to their heavy reliance on signature-based techniques, traditional anti-malware solutions struggle to keep pace with malware's evolving nature. However, these techniques face limitations, as even slight code modifications can allow malware to evade detection. Consequently, this highlights weaknesses in current cybersecurity defenses and underscores the need for more sophisticated detection methods. To address these challenges, this study proposes an enhanced malware detection approach utilizing Extreme Gradient Boosting (XGBoost) in conjunction with Chi-Squared Feature Selection. The research applied XGBoost to a malware dataset and implemented preprocessing steps such as class balancing and feature scaling. Furthermore, the incorporation of Chi-Squared Feature Selection improved the model's accuracy from 99.1% to 99.2% and reduced testing time by 89.28%, demonstrating its efficacy and efficiency. These results confirm that prioritizing relevant features enhances both the accuracy and computational speed of the model. Ultimately, combining feature selection with machine learning techniques proves effective in addressing modern malware detection challenges, not only enhancing accuracy but also expediting processing times.

References

N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. Damaševičius, “Windows PE Malware Detection Using Ensemble Learning,” Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.

N. Pachhala, S. Jothilakshmi, and B. P. Battula, “A Comprehensive Survey on Identification of Malware Types and Malware Classification Using Machine Learning Techniques,” in 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India: IEEE, Oct. 2021, pp. 1207–1214. doi: 10.1109/ICOSEC51865.2021.9591763.

N. Dutta, N. Jadav, S. Tanwar, H. K. D. Sarma, and E. Pricop, “Introduction to Malware Analysis,” in Cyber Security: Issues and Current Trends, vol. 995, in Studies in Computational Intelligence, vol. 995. , Singapore: Springer Singapore, 2022, pp. 129–141. doi: 10.1007/978-981-16-6597-4_7.

N. Adeel, R. Kumar, K. N. S. Akella, V. Manickam, M. W. Khan, and S. V. Nandury, “Measuring the Implications of Email Viruses Through a Unified Model of Cyber Security,” in 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha Nagar, India: IEEE, Sep. 2023, pp. 614–621. doi: 10.1109/IC3I59117.2023.10398148.

R. Vanness, M. M. Chowdhury, and N. Rifat, “Malware: A Software for Cybercrime,” in 2022 IEEE International Conference on Electro Information Technology (eIT), Mankato, MN, USA: IEEE, May 2022, pp. 513–518. doi: 10.1109/eIT53891.2022.9813970.

A. M. Kovács, “Ransomware: a comprehensive study of the exponentially increasing cybersecurity threat,” IRD, vol. 4, no. 2, pp. 96–104, Jun. 2022, doi: 10.9770/IRD.2022.4.2(8).

M. Aljaidi et al., “NHS WannaCry Ransomware Attack: Technical Explanation of The Vulnerability, Exploitation, and Countermeasures,” in 2022 International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI), Zarqa, Jordan: IEEE, Nov. 2022, pp. 1–6. doi: 10.1109/EICEEAI56378.2022.10050485.

C. M. Codreanu, “Exploring the need for human-centred cybersecurity. The WannaCry Cyberattack,” vol. 15, no. 2, 2021.

B. Fiore, K. Ha, L. Huynh, J. Falcon, R. Vendiola, and Y. Li, “Security Analysis of Ransomware: A Deep Dive into WannaCry and Locky,” in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA: IEEE, Mar. 2023, pp. 285–294. doi: 10.1109/CCWC57344.2023.10099114.

A. Muzaffar, H. Ragab Hassen, M. A. Lones, and H. Zantout, “An in-depth review of machine learning based Android malware detection,” Computers & Security, vol. 121, p. 102833, Oct. 2022, doi: 10.1016/j.cose.2022.102833.

K. D. K. Wardhani and M. Akbar, “Diabetes Risk Prediction Using Extreme Gradient Boosting (XGBoost),” join, vol. 7, no. 2, pp. 244–250, Dec. 2022, doi: 10.15575/join.v7i2.970.

J. Palša et al., “MLMD—A Malware-Detecting Antivirus Tool Based on the XGBoost Machine Learning Algorithm,” Applied Sciences, vol. 12, no. 13, p. 6672, Jul. 2022, doi: 10.3390/app12136672.

R. Kumar and G. S, “Malware classification using XGboost-Gradient Boosted Decision Tree,” Adv. sci. technol. eng. syst. j., vol. 5, no. 5, pp. 536–549, 2020, doi: 10.25046/aj050566.

F. A. Rafrastara, C. Supriyanto, A. Amiral, S. R. Amalia, M. D. Al Fahreza, and F. Ahmed, “Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection,” mib, vol. 8, no. 1, p. 450, Jan. 2024, doi: 10.30865/mib.v8i1.6971.

T. Lu, Y. Du, L. Ouyang, Q. Chen, and X. Wang, “Android Malware Detection Based on a Hybrid Deep Learning Model,” Security and Communication Networks, vol. 2020, pp. 1–11, Aug. 2020, doi: 10.1155/2020/8863617.

M. Abujazoh, D. Al-Darras, N. A. Hamad, and S. Al-Sharaeh, “Feature Selection for High-Dimensional Imbalanced Malware Data Using Filter and Wrapper Selection Methods,” in 2023 International Conference on Information Technology (ICIT), Amman, Jordan: IEEE, Aug. 2023, pp. 196–201. doi: 10.1109/ICIT58056.2023.10226049.

O. N. Elayan and A. M. Mustafa, “Android Malware Detection Using Deep Learning,” Procedia Computer Science, vol. 184, pp. 847–852, 2021, doi: 10.1016/j.procs.2021.03.106.

C. Supriyanto, F. A. Rafrastara, A. Amiral, S. R. Amalia, M. D. Al Fahreza, and Mohd. F. Abdollah, “Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection,” mib, vol. 8, no. 1, p. 412, Jan. 2024, doi: 10.30865/mib.v8i1.6970.

Rishitha Venumuddala and J. Krishna, “Methodological approach for designing a data pre- processing tool on textual data,” 2022, doi: 10.13140/RG.2.2.18627.27689.

T. D. Nguyen, M.-H. Shih, D. Srivastava, S. Tirthapura, and B. Xu, “Stratified random sampling from streaming and stored data,” Distrib Parallel Databases, vol. 39, no. 3, pp. 665–710, Sep. 2021, doi: 10.1007/s10619-020-07315-w.

M. Shantal, Z. Othman, and A. A. Bakar, “A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization,” Symmetry, vol. 15, no. 12, p. 2185, Dec. 2023, doi: 10.3390/sym15122185.

M. Büyükkeçeci̇ and M. C. Okur, “A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine Learning,” Gazi University Journal of Science, vol. 36, no. 4, pp. 1506–1520, Dec. 2023, doi: 10.35378/gujs.993763.

K. Kishore and V. Jaswal, “Statistics Corner: Chi-squared Test,” Journal of Postgraduate Medicine, Education and Research, vol. 57, no. 1, pp. 40–44, Apr. 2023, doi: 10.5005/jp-journals-10028-1618.

D. Sitanggang, A. S. Ginting, R. M. Simanjuntak, and N. Lumbantoruan, “EEG Signal Classification using K-Nearest Neighbor Method to Measure Impulsivity Level,” SISFOKOM, vol. 13, no. 2, pp. 261–266, Jun. 2024, doi: 10.32736/sisfokom.v13i2.2154.

J. Hu and S. Szymczak, “A review on longitudinal data analysis with random forest,” Briefings in Bioinformatics, vol. 24, no. 2, p. bbad002, Mar. 2023, doi: 10.1093/bib/bbad002.

L. M. Sinaga, Sawaluddin, and S. Suwilo, “Analysis of classification and Naïve Bayes algorithm k-nearest neighbor in data mining,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 725, no. 1, p. 012106, Jan. 2020, doi: 10.1088/1757-899X/725/1/012106.

K. Wang, M. Li, J. Cheng, X. Zhou, and G. Li, “Research on personal credit risk evaluation based on XGBoost,” Procedia Computer Science, vol. 199, pp. 1128–1135, 2022, doi: 10.1016/j.procs.2022.01.143.

S. Chehreh Chelgani, H. Nasiri, and A. Tohry, “Modeling of particle sizes for industrial HPGR products by a unique explainable AI tool- A ‘Conscious Lab’ development,” Advanced Powder Technology, vol. 32, no. 11, pp. 4141–4148, Nov. 2021, doi: 10.1016/j.apt.2021.09.020.

D. A. Anggoro, “Comparison of Accuracy Level of Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) Algorithms in Predicting Heart Disease,” IJETER, vol. 8, no. 5, pp. 1689–1694, May 2020, doi: 10.30534/ijeter/2020/32852020.

J. Asian, M. Dholah Rosita, and T. Mantoro, “Sentiment Analysis for the Brazilian Anesthesiologist Using Multi-Layer Perceptron Classifier and Random Forest Methods,” join, vol. 7, no. 1, pp. 132–141, Sep. 2022, doi: 10.15575/join.v7i1.900.