Comparative Analysis of Feature Selection Methods with XGBoost for Malware Detection on the Drebin Dataset

Ines Aulia Latifah; Fauzi Adi Rafrastara; Jevan Bintoro; Wildanil Ghozi; Waleed Mahgoub Osman

doi:10.32736/sisfokom.v13i3.2294

Authors

Ines Aulia Latifah Department of Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
Fauzi Adi Rafrastara Department of Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
Jevan Bintoro Department of Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
Wildanil Ghozi Department of Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
Waleed Mahgoub Osman Mathematics Department, College of Education Sudan University od Science and Technology

DOI:

https://doi.org/10.32736/sisfokom.v13i3.2294

Keywords:

android malware detection, drebin, information gain, XGBoost, machine learning

Abstract

Malware, or malicious software, continues to evolve alongside increasing cyberattacks targeting individual devices and critical infrastructure. Traditional detection methods, such as signature-based detection, are often ineffective against new or polymorphic malware. Therefore, advanced malware detection methods are increasingly needed to counter these evolving threats. This study aims to compare the performance of various feature selection methods combined with the XGBoost algorithm for malware detection using the Drebin dataset, and to identify the best feature selection method to enhance accuracy and efficiency. The experimental results show that XGBoost with the Information Gain method achieves the highest accuracy of 98.7%, with faster training times than other methods like Chi-Squared and ANOVA, which each achieved an accuracy of 98.3%. Information Gain yielded the best performance in accuracy and training time efficiency, while Chi-Squared and ANOVA offered competitive but slightly lower results. This study highlights that appropriate feature selection within machine learning algorithms can significantly improve malware detection accuracy, potentially aiding in real-world cybersecurity applications to prevent harmful cyberattacks.

References

F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” J. Inform. J. Pengemb. IT, vol. 8, no. 2, pp. 113–118, May 2023, doi: 10.30591/jpit.v8i2.5207.

Dr. Y. Perwej, S. Qamar Abbas, J. Pratap Dixit, Dr. N. Akhtar, and A. Kumar Jaiswal, “A Systematic Literature Review on the Cyber Security,” Int. J. Sci. Res. Manag., vol. 9, no. 12, pp. 669–710, Dec. 2021, doi: 10.18535/ijsrm/v9i12.ec04.

C. Beaman, A. Barkworth, T. D. Akande, S. Hakak, and M. K. Khan, “Ransomware: Recent advances, analysis, challenges and future research directions,” Comput. Secur., vol. 111, p. 102490, Dec. 2021, doi: 10.1016/j.cose.2021.102490.

L. Wang et al., “MalRadar: Demystifying Android Malware in the New Era,” Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 2, pp. 1–27, May 2022, doi: 10.1145/3530906.

F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, and A. A. H. Elnour, “Malware Detection Issues, Challenges, and Future Directions: A Survey,” Appl. Sci., vol. 12, no. 17, p. 8482, Aug. 2022, doi: 10.3390/app12178482.

Md. S. Rana and A. H. Sung, “Evaluation of Advanced Ensemble Learning Techniques for Android Malware Detection,” Vietnam J. Comput. Sci., vol. 07, no. 02, pp. 145–159, May 2020, doi: 10.1142/S2196888820500086.

S. A. Roseline and S. Geetha, “Android Malware Detection and Classification using LOFO Feature Selection and Tree-based Models,” J. Phys. Conf. Ser., vol. 1911, no. 1, p. 012031, May 2021, doi: 10.1088/1742-6596/1911/1/012031.

Y. Yin et al., “IGRF-RFE: a hybrid feature selection method for MLP-based network intrusion detection on UNSW-NB15 dataset,” J. Big Data, vol. 10, no. 1, p. 15, Feb. 2023, doi: 10.1186/s40537-023-00694-8.

R. Sihwail, K. Omar, and K. Akram Zainol Ariffin, “An Effective Memory Analysis for Malware Detection and Classification,” Comput. Mater. Contin., vol. 67, no. 2, pp. 2301–2320, 2021, doi: 10.32604/cmc.2021.014510.

A. G. Baydin et al., “Toward Machine Learning Optimization of Experimental Design,” Nucl. Phys. News, vol. 31, no. 1, pp. 25–28, Jan. 2021, doi: 10.1080/10619127.2021.1881364.

V. Çetin and O. Yıldız, “A comprehensive review on data preprocessing techniques in data analysis,” Pamukkale Univ. J. Eng. Sci., vol. 28, no. 2, pp. 299–312, 2022, doi: 10.5505/pajes.2021.62687.

K. Hwang, W. Kang, and Y. Jung, “Application of the class-balancing strategies with bootstrapping for fitting logistic regression models for post-fire tree mortality in South Korea,” Environ. Ecol. Stat., vol. 30, no. 3, pp. 575–598, Sep. 2023, doi: 10.1007/s10651-023-00573-8.

Z. Abedjan et al., “Detecting data errors: where are we and what needs to be done?,” Proc. VLDB Endow., vol. 9, no. 12, pp. 993–1004, Aug. 2016, doi: 10.14778/2994509.2994518.

STMIK Lombok, S. Saikin, S. Fadli, STMIK Lombok, M. Ashari, and STMIK Lombok, “Optimization of Support Vector Machine Method Using Feature Selection to Improve Classification Results,” JISAJurnal Inform. Dan Sains, vol. 4, no. 1, pp. 22–27, Jun. 2021, doi: 10.31326/jisa.v4i1.881.

M. Al-Omari and Q. A. Al-Haija, “Towards Robust IDSs: An Integrated Approach of Hybrid Feature Selection and Machine Learning,” J. Internet Serv. Inf. Secur., vol. 14, no. 3, pp. 47–67, Aug. 2024, doi: 10.58346/JISIS.2024.I2.004.

S. Tangirala, “Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm*,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, 2020, doi: 10.14569/IJACSA.2020.0110277.

N. Wijaya, “Evaluation of Naïve Bayes and Chi-Square performance for Classification of Occupancy House,” Int. J. Inform. Comput., vol. 1, no. 2, p. 46, Feb. 2020, doi: 10.35842/ijicom.v1i2.20.

K. Dissanayake and M. G. Md Johar, “Comparative Study on Heart Disease Prediction Using Feature Selection Techniques on Classification Algorithms,” Appl. Comput. Intell. Soft Comput., vol. 2021, pp. 1–17, Nov. 2021, doi: 10.1155/2021/5581806.

U. Moorthy and U. D. Gandhi, “A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization,” J. Ambient Intell. Humaniz. Comput., vol. 12, no. 3, pp. 3527–3538, Mar. 2021, doi: 10.1007/s12652-020-02592-w.

S. Chehreh Chelgani, H. Nasiri, and A. Tohry, “Modeling of particle sizes for industrial HPGR products by a unique explainable AI tool- A ‘Conscious Lab’ development,” Adv. Powder Technol., vol. 32, no. 11, pp. 4141–4148, Nov. 2021, doi: 10.1016/j.apt.2021.09.020.

O. Uludağ and A. Gürsoy, “On the Financial Situation Analysis with KNN and Naive Bayes Classification Algorithms,” Iğdır Üniversitesi Fen Bilim. Enstitüsü Derg., vol. 10, no. 4, pp. 2881–2888, Dec. 2020, doi: 10.21597/jist.703004.

Z. Lubis, P. Sihombing, and H. Mawengkang, “Optimization of K Value at the K-NN algorithm in clustering using the expectation maximization algorithm,” IOP Conf. Ser. Mater. Sci. Eng., vol. 725, no. 1, p. 012133, Jan. 2020, doi: 10.1088/1757-899X/725/1/012133.

“Prediction of Heart Disease Using Feature Selection and Random Forest Ensemble Method,” Int. J. Pharm. Res., vol. 12, no. 04, Jun. 2020, doi: 10.31838/ijpr/2020.12.04.013.

F. Mohr and J. N. Van Rijn, “Fast and Informative Model Selection Using Learning Curve Cross-Validation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9669–9680, Aug. 2023, doi: 10.1109/TPAMI.2023.3251957.

S. Farahdiba, D. Kartini, R. A. Nugroho, R. Herteno, and T. H. Saragih, “Backward Elimination for Feature Selection on Breast Cancer Classification Using Logistic Regression and Support Vector Machine Algorithms,” IJCCS Indones. J. Comput. Cybern. Syst., vol. 17, no. 4, p. 429, Oct. 2023, doi: 10.22146/ijccs.88926.