Download PDFOpen PDF in browserSupervised and Unsupervised Learning Techniques Utilizing Malware DatasetsEasyChair Preprint 96677 pages•Date: February 4, 2023AbstractMalware continues to gain momentum as it becomes more sophisticated against detection. Monitoring tools and antivirus software do not have the ability to keep up with the ever-going changes of these malignant variants. Due to these dilemmas, machine learning has gained popularity in classification and detection of malware related data. In this study, two separate datasets, Malware-Exploratory and CIC-MalMem-2022, undergo a series of supervised and unsupervised learning procedures to first gather information for observation. The developed model in this research utilizes three clustering algorithms for analysis, K-Means, DBSCAN, and GMM. The model also uses seven classification algorithms for predicting malware including Decision Tree, Random Forest, Ada Boost, KNeighbors, Stochastic Gradient Descent, Extra Trees, and Gaussian Naïve Bayes. Results have shown that Malware-Exploratory dataset averaged an accuracy score of 90% while CIC-MalMem-2022 dataset averaged a score of 99%. Both datasets also showed consistency across all three clustering algorithms. Besides, correlation between variables do not necessarily need to be highly related for malware detection. Future studies will determine if the results remain stable against feature selection and genetic algorithms. Keyphrases: Gaussian Mixture Model (GMM), Supervised Machine Learning, Unsupervised Machine Learning, area under the curve-receiver operating characteristics (AUC-ROC), density-based spatial clustering of applications with noise (DBSCAN), hierarchical density-based spatial clustering of applications with noise (HDBSCAN)
|