Comparative Analysis of Feature Selection Methods with XGBoost for Malware Detection on the Drebin Dataset
DOI:
https://doi.org/10.32736/sisfokom.v13i3.2294Keywords:
android malware detection, drebin, information gain, XGBoost, machine learningAbstract
Malware, or malicious software, continues to evolve alongside increasing cyberattacks targeting individual devices and critical infrastructure. Traditional detection methods, such as signature-based detection, are often ineffective against new or polymorphic malware. Therefore, advanced malware detection methods are increasingly needed to counter these evolving threats. This study aims to compare the performance of various feature selection methods combined with the XGBoost algorithm for malware detection using the Drebin dataset, and to identify the best feature selection method to enhance accuracy and efficiency. The experimental results show that XGBoost with the Information Gain method achieves the highest accuracy of 98.7%, with faster training times than other methods like Chi-Squared and ANOVA, which each achieved an accuracy of 98.3%. Information Gain yielded the best performance in accuracy and training time efficiency, while Chi-Squared and ANOVA offered competitive but slightly lower results. This study highlights that appropriate feature selection within machine learning algorithms can significantly improve malware detection accuracy, potentially aiding in real-world cybersecurity applications to prevent harmful cyberattacks.References
F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” J. Inform. J. Pengemb. IT, vol. 8, no. 2, pp. 113–118, May 2023, doi: 10.30591/jpit.v8i2.5207.
Dr. Y. Perwej, S. Qamar Abbas, J. Pratap Dixit, Dr. N. Akhtar, and A. Kumar Jaiswal, “A Systematic Literature Review on the Cyber Security,” Int. J. Sci. Res. Manag., vol. 9, no. 12, pp. 669–710, Dec. 2021, doi: 10.18535/ijsrm/v9i12.ec04.
C. Beaman, A. Barkworth, T. D. Akande, S. Hakak, and M. K. Khan, “Ransomware: Recent advances, analysis, challenges and future research directions,” Comput. Secur., vol. 111, p. 102490, Dec. 2021, doi: 10.1016/j.cose.2021.102490.
L. Wang et al., “MalRadar: Demystifying Android Malware in the New Era,” Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 2, pp. 1–27, May 2022, doi: 10.1145/3530906.
F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, and A. A. H. Elnour, “Malware Detection Issues, Challenges, and Future Directions: A Survey,” Appl. Sci., vol. 12, no. 17, p. 8482, Aug. 2022, doi: 10.3390/app12178482.
Md. S. Rana and A. H. Sung, “Evaluation of Advanced Ensemble Learning Techniques for Android Malware Detection,” Vietnam J. Comput. Sci., vol. 07, no. 02, pp. 145–159, May 2020, doi: 10.1142/S2196888820500086.
S. A. Roseline and S. Geetha, “Android Malware Detection and Classification using LOFO Feature Selection and Tree-based Models,” J. Phys. Conf. Ser., vol. 1911, no. 1, p. 012031, May 2021, doi: 10.1088/1742-6596/1911/1/012031.
Y. Yin et al., “IGRF-RFE: a hybrid feature selection method for MLP-based network intrusion detection on UNSW-NB15 dataset,” J. Big Data, vol. 10, no. 1, p. 15, Feb. 2023, doi: 10.1186/s40537-023-00694-8.
R. Sihwail, K. Omar, and K. Akram Zainol Ariffin, “An Effective Memory Analysis for Malware Detection and Classification,” Comput. Mater. Contin., vol. 67, no. 2, pp. 2301–2320, 2021, doi: 10.32604/cmc.2021.014510.
A. G. Baydin et al., “Toward Machine Learning Optimization of Experimental Design,” Nucl. Phys. News, vol. 31, no. 1, pp. 25–28, Jan. 2021, doi: 10.1080/10619127.2021.1881364.
V. Çetin and O. Yıldız, “A comprehensive review on data preprocessing techniques in data analysis,” Pamukkale Univ. J. Eng. Sci., vol. 28, no. 2, pp. 299–312, 2022, doi: 10.5505/pajes.2021.62687.
K. Hwang, W. Kang, and Y. Jung, “Application of the class-balancing strategies with bootstrapping for fitting logistic regression models for post-fire tree mortality in South Korea,” Environ. Ecol. Stat., vol. 30, no. 3, pp. 575–598, Sep. 2023, doi: 10.1007/s10651-023-00573-8.
Z. Abedjan et al., “Detecting data errors: where are we and what needs to be done?,” Proc. VLDB Endow., vol. 9, no. 12, pp. 993–1004, Aug. 2016, doi: 10.14778/2994509.2994518.
STMIK Lombok, S. Saikin, S. Fadli, STMIK Lombok, M. Ashari, and STMIK Lombok, “Optimization of Support Vector Machine Method Using Feature Selection to Improve Classification Results,” JISAJurnal Inform. Dan Sains, vol. 4, no. 1, pp. 22–27, Jun. 2021, doi: 10.31326/jisa.v4i1.881.
M. Al-Omari and Q. A. Al-Haija, “Towards Robust IDSs: An Integrated Approach of Hybrid Feature Selection and Machine Learning,” J. Internet Serv. Inf. Secur., vol. 14, no. 3, pp. 47–67, Aug. 2024, doi: 10.58346/JISIS.2024.I2.004.
S. Tangirala, “Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm*,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, 2020, doi: 10.14569/IJACSA.2020.0110277.
N. Wijaya, “Evaluation of Naïve Bayes and Chi-Square performance for Classification of Occupancy House,” Int. J. Inform. Comput., vol. 1, no. 2, p. 46, Feb. 2020, doi: 10.35842/ijicom.v1i2.20.
K. Dissanayake and M. G. Md Johar, “Comparative Study on Heart Disease Prediction Using Feature Selection Techniques on Classification Algorithms,” Appl. Comput. Intell. Soft Comput., vol. 2021, pp. 1–17, Nov. 2021, doi: 10.1155/2021/5581806.
U. Moorthy and U. D. Gandhi, “A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization,” J. Ambient Intell. Humaniz. Comput., vol. 12, no. 3, pp. 3527–3538, Mar. 2021, doi: 10.1007/s12652-020-02592-w.
S. Chehreh Chelgani, H. Nasiri, and A. Tohry, “Modeling of particle sizes for industrial HPGR products by a unique explainable AI tool- A ‘Conscious Lab’ development,” Adv. Powder Technol., vol. 32, no. 11, pp. 4141–4148, Nov. 2021, doi: 10.1016/j.apt.2021.09.020.
O. Uludağ and A. Gürsoy, “On the Financial Situation Analysis with KNN and Naive Bayes Classification Algorithms,” Iğdır Üniversitesi Fen Bilim. Enstitüsü Derg., vol. 10, no. 4, pp. 2881–2888, Dec. 2020, doi: 10.21597/jist.703004.
Z. Lubis, P. Sihombing, and H. Mawengkang, “Optimization of K Value at the K-NN algorithm in clustering using the expectation maximization algorithm,” IOP Conf. Ser. Mater. Sci. Eng., vol. 725, no. 1, p. 012133, Jan. 2020, doi: 10.1088/1757-899X/725/1/012133.
“Prediction of Heart Disease Using Feature Selection and Random Forest Ensemble Method,” Int. J. Pharm. Res., vol. 12, no. 04, Jun. 2020, doi: 10.31838/ijpr/2020.12.04.013.
F. Mohr and J. N. Van Rijn, “Fast and Informative Model Selection Using Learning Curve Cross-Validation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9669–9680, Aug. 2023, doi: 10.1109/TPAMI.2023.3251957.
S. Farahdiba, D. Kartini, R. A. Nugroho, R. Herteno, and T. H. Saragih, “Backward Elimination for Feature Selection on Breast Cancer Classification Using Logistic Regression and Support Vector Machine Algorithms,” IJCCS Indones. J. Comput. Cybern. Syst., vol. 17, no. 4, p. 429, Oct. 2023, doi: 10.22146/ijccs.88926.
Downloads
Published
Issue
Section
License
The copyright of the article that accepted for publication shall be assigned to Jurnal Sisfokom (Sistem Informasi dan Komputer) and LPPM ISB Atma Luhur as the publisher of the journal. Copyright includes the right to reproduce and deliver the article in all form and media, including reprints, photographs, microfilms, and any other similar reproductions, as well as translations.
Jurnal Sisfokom (Sistem Informasi dan Komputer), LPPM ISB Atma Luhur, and the Editors make every effort to ensure that no wrong or misleading data, opinions or statements be published in the journal. In any way, the contents of the articles and advertisements published in Jurnal Sisfokom (Sistem Informasi dan Komputer) are the sole and exclusive responsibility of their respective authors.
Jurnal Sisfokom (Sistem Informasi dan Komputer) has full publishing rights to the published articles. Authors are allowed to distribute articles that have been published by sharing the link or DOI of the article. Authors are allowed to use their articles for legal purposes deemed necessary without the written permission of the journal with the initial publication notification from the Jurnal Sisfokom (Sistem Informasi dan Komputer).
The Copyright Transfer Form can be downloaded [Copyright Transfer Form Jurnal Sisfokom (Sistem Informasi dan Komputer).
This agreement is to be signed by at least one of the authors who have obtained the assent of the co-author(s). After submission of this agreement signed by the corresponding author, changes of authorship or in the order of the authors listed will not be accepted. The copyright form should be signed originally, and send it to the Editorial in the form of scanned document to sisfokom@atmaluhur.ac.id.