Original authors: Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi
Original authors: Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi
Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Technical Summary: A Hybrid Approach For Malware Classification Using Secondary Features Fusion
Problem Statement
The rapid evolution of malware, characterized by polymorphism, obfuscation, and zero-day variants, renders traditional detection methods insufficient. Existing anti-malware software often fails to detect variated samples or classify them into specific families, hindering effective mitigation. While machine learning (ML) has been applied to malware detection, challenges remain regarding feature generalization across different families, class imbalance in datasets, and the reliance on static or dynamic analysis alone. Furthermore, the widely used Microsoft Malware Classification Challenge dataset lacks benign examples, limiting its utility for binary detection (benign vs. malicious) alongside multi-class family classification.
Methodology
The authors propose a hybrid approach addressing two distinct stages: feature engineering and modeling. The methodology involves the following steps:
Dataset Extension and Preparation:
- The study modifies the Microsoft Kaggle dataset by adding 1,609 benign disassembled files (
.asm) to the existing 10,868 malware samples across nine families. - This extension enables both binary classification (malware vs. benign) and multi-class classification (specific malware families).
- Stratified random sampling with replacement is employed to mitigate class imbalance issues inherent in the original dataset.
- The study modifies the Microsoft Kaggle dataset by adding 1,609 benign disassembled files (
Feature Extraction:
- Primary Features: The system extracts Application Programming Interface (API) calls, Dynamic Link Library (DLL) imports, and Operation Code (OpCode) mnemonics from the
.textsection of disassembled files. - Secondary Features:
- OpCodes: Extracted as unigrams, filtered via a dictionary-based selection (removing irregular/custom OpCodes), and then transformed into fixed-length quad-grams and variable-length n-grams.
- APIs and DLLs: Combinational analysis determined that bi-grams are the optimal size for these features, balancing accuracy and computational cost.
- Noise Reduction: A frequency analysis is conducted to discard features with low occurrence (threshold < 50), ensuring only representative features are retained.
- Primary Features: The system extracts Application Programming Interface (API) calls, Dynamic Link Library (DLL) imports, and Operation Code (OpCode) mnemonics from the
Feature Selection:
- A two-stage selection process is implemented:
- Primary Selection: Dictionary-based filtering and frequency analysis to remove irregular and rare features.
- Secondary Selection: Evaluation of filter (Shannon Entropy), wrapper (proposed Backward Selection using Random Forest and Regularized Greedy Forest), and embedded (Lasso, XGBoost) methods.
- A customized backward selection algorithm is proposed to iteratively remove the least important features until a minimum feature count is reached, optimizing the feature set for specific algorithms.
- A two-stage selection process is implemented:
Feature Fusion:
- Instead of selecting a single best feature set, the authors perform feature fusion by taking the union of the best features from all representations (API bi-grams, DLL bi-grams, quad-grams, and variable-length grams) to create a comprehensive input matrix.
Algorithm Fusion (Ensemble):
- Ten base classifiers are evaluated, including CART, Naive Bayes, SVM, Logistic Regression, kNN, Neural Networks, Random Forest, AdaBoost, XGBoost, and LightGBM.
- A weighted voting-based ensemble is constructed using the top five performing classifiers.
- Weights for each classifier are determined using Sequential Least Squares Programming (SLSQP) to minimize log loss on the test set.
- The final prediction is derived by calculating the geometric mean of the weighted probability outputs from the ensemble members.
Key Contributions
- Dataset Modification: Extending the Microsoft dataset with benign samples to facilitate both binary and multi-class classification tasks.
- Feature Engineering: Utilizing a combination of API calls, DLL imports, and OpCode n-grams (specifically quad-grams and variable-length grams) as primary and secondary features.
- Customized Feature Selection: Proposing a backward selection algorithm and evaluating a hybrid approach that combines filter, wrapper, and embedded methods to identify the most valuable features.
- Dual Fusion Strategy: Implementing both feature fusion (combining diverse feature sets) and algorithm fusion (weighted voting ensemble) to enhance detection robustness.
- Comprehensive Evaluation: Providing a detailed comparison against state-of-the-art methods, including the winners of the original Microsoft Kaggle challenge and other recent studies.
Experimental Results
The proposed method was evaluated on a standard hardware setup (Intel i7-8700, 16GB RAM) without GPU acceleration.
- Performance Metrics: The ensemble model achieved an accuracy of 99.72%, an Area Under the Curve (AUC) of 0.989, and a log loss of 0.01.
- Comparison with State-of-the-Art:
- Compared to the winners of the original Microsoft Kaggle competition (who achieved a log loss of ~0.0023), the proposed model achieved a slightly higher log loss (0.01) but with significantly lower computational resource requirements (standard desktop vs. Google Compute Engine with 104GB memory).
- The authors argue that the winning team's approach relied heavily on encrypted file features and hard-coded hyperparameters specific to the competition, potentially limiting generalizability. In contrast, the proposed approach uses features (API, DLL, variable-length n-grams) that are traceable to file functionality and generalize better.
- Compared to a study by Ahmadi et al. (2016), the proposed method offers better generalizability by avoiding features that vary significantly with dataset changes (e.g., file size-dependent instruction counts) and by using a more robust feature selection process.
Significance and Claims
The paper claims that the proposed hybrid approach effectively automates malware detection and family classification. The significance lies in the demonstration that:
- Feature Fusion of secondary features (n-grams) with primary features (API/DLL) creates a more robust input matrix than using any single feature type.
- Algorithm Fusion via a weighted voting ensemble outperforms individual base classifiers, achieving high accuracy even on resource-constrained machines.
- The approach is generalizable and practical for real-world deployment, as it does not rely on the massive computational resources or competition-specific feature engineering (like pixel intensity of encrypted files) used by top-tier Kaggle solutions.
- The inclusion of benign files allows for a complete security workflow: first determining if a file is malicious, and subsequently identifying its specific family for targeted mitigation.
The authors conclude that while their log loss is slightly higher than the competition winner's, their method offers a more sustainable, generalizable, and resource-efficient solution for malware classification. Future work is planned to investigate fusion between hexadecimal and disassembled data features and to include encrypted samples in the training set.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.
Get the best AI papers every week.
Trusted by researchers at Stanford, Cambridge, and the French Academy of Sciences.
Check your inbox to confirm your subscription.
Something went wrong. Try again?
No spam, unsubscribe anytime.