Pre-prints
On the Effectiveness of Adversarial Training on Malware Classifiers
H. Bostani, J. Cortellazzi, D. Arp, F. Pierazzi, V. Moonsamy and L. Cavallaro
Under peer review (2024).
Adversarial Training (AT) has been widely applied to harden learning-based classifiers against adversarial evasive attacks. However, its effectiveness in identifying and strengthening vulnerable areas of the model's decision space while maintaining high performance on clean data of malware classifiers remains an under-explored area. In this context, the robustness that AT achieves has often been assessed against unrealistic or weak adversarial attacks, which negatively affect performance on clean data and are arguably no longer threats. Previous work seems to suggest robustness is a task-dependent property of AT. We instead argue it is a more complex problem that requires exploring AT and the intertwined roles played by certain factors within data, feature representations, classifiers, and robust optimization settings, as well as proper evaluation factors, such as the realism of evasion attacks, to gain a true sense of AT's effectiveness. In our paper, we address this gap by systematically exploring the role such factors have in hardening malware classifiers through AT. Contrary to recent prior work, a key observation of our research and extensive experiments confirm the hypotheses that all such factors influence the actual effectiveness of AT, as demonstrated by the varying degrees of success from our empirical analysis. We identify five evaluation pitfalls that affect state-of-the-art studies and summarize our insights in ten takeaways to draw promising research directions toward better understanding the factors' settings under which adversarial training works at best.
2025
Beyond Learning Algorithms: The Crucial Role of Data in Robust Malware Detection
H. Bostani and V. Moonsamy
Accepted to the IEEE Security & Privacy Magazine on 5 March 2025
In the battle against evolving cyber threats, robust malware detection with machine learning requires more than just advanced algorithms—it demands high-quality data. This article emphasizes data quality's role and how leveraging coresets—constructed from key samples that preserve the datasets’ properties—to enrich training sets can enhance malware classifiers’ resilience.
Level Up with ML Vulnerability Identification: Leveraging Domain Constraints in Feature Space for Robust Android Malware Detection
H. Bostani, Z. Zhao, Z. Liu, and V. Moonsamy
ACM Transactions on Privacy and Security (TOPS), vol. 28, No. 2, pp. 1–32, 2025
Machine Learning (ML) promises to enhance the efficacy of Android Malware Detection (AMD); however, ML models are vulnerable to realistic evasion attacks—crafting realizable Adversarial Examples (AEs) that satisfy Android malware domain constraints. To eliminate ML vulnerabilities, defenders aim to identify susceptible regions in the feature space where ML models are prone to deception. The primary approach to identifying vulnerable regions involves investigating realizable AEs, but generating these feasible apps poses a challenge. For instance, previous work has relied on generating either feature-space norm-bounded AEs or problem-space realizable AEs in adversarial hardening. The former is efficient but lacks full coverage of vulnerable regions while the latter can uncover these regions by satisfying domain constraints but is known to be time-consuming. To address these limitations, we propose an approach to facilitate the identification of vulnerable regions. Specifically, we introduce a new interpretation of Android domain constraints in the feature space, followed by a novel technique that learns them. Our empirical evaluations across various evasion attacks indicate effective detection of AEs using learned domain constraints, with an average of 89.6%. Furthermore, extensive experiments on different Android malware detectors demonstrate that utilizing our learned domain constraints in Adversarial Training (AT) outperforms other AT-based defenses that rely on norm-bounded AEs or state-of-the-art non-uniform perturbations. Finally, we show that retraining a malware detector with a wide variety of feature-space realizable AEs results in a 77.9% robustness improvement against realizable AEs generated by unknown problem-space transformations, with up to 70x faster training than using problem-space realizable AEs.
2024
Improving Adversarial Robustness in Android Malware Detection by Reducing the Impact of Spurious Correlations
Hamid Bostani, Zhengyu Zhao, and Veelasha Moonsamy
ESORICS 2024 International Workshops
Machine learning (ML) has demonstrated significant advancements in Android malware detection (AMD); however, the resilience of ML against realistic evasion attacks remains a major obstacle for AMD. One of the primary factors contributing to this challenge is the scarcity of reliable generalizations. Malware classifiers with limited generalizability tend to overfit spurious correlations derived from biased features. Consequently, adversarial examples (AEs), generated by evasion attacks, can modify these features to evade detection. In this study, we propose a domain adaptation technique to improve the generalizability of AMD by aligning the distribution of malware samples and AEs. Specifically, we utilize meaningful feature dependencies, reflecting domain constraints in the feature space, to establish a robust feature space. Training on the proposed robust feature space enables malware classifiers to learn from predefined patterns associated with app functionality rather than from individual features. This approach helps mitigate spurious correlations inherent in the initial feature space. Our experiments conducted on DREBIN, a renowned Android malware detector, demonstrate that our approach surpasses the state-of-the-art defense, Sec-SVM, when facing realistic evasion attacks. In particular, our defense can improve adversarial robustness by up to 55% against realistic evasion attacks compared to Sec-SVM.
EvadeDroid: A practical evasion attack on machine learning for black-box Android malware detection
H. Bostani and V. Moonsamy
Computers & Security, vol. 139, pp. 1–18, 2024
Over the last decade, researchers have extensively explored the vulnerabilities of Android malware detectors to adversarial examples through the development of evasion attacks; however, the practicality of these attacks in real-world scenarios remains arguable. The majority of studies have assumed attackers know the details of the target classifiers used for malware detection, while in reality, malicious actors have limited access to the target classifiers. This paper introduces EvadeDroid, a problem-space adversarial attack designed to effectively evade black-box Android malware detectors in real-world scenarios. EvadeDroid constructs a collection of problem-space transformations derived from benign donors that share opcode-level similarity with malware apps by leveraging an n-gram-based approach. These transformations are then used to morph malware instances into benign ones via an iterative and incremental manipulation strategy. The proposed manipulation technique is a query-efficient optimization algorithm that can find and inject optimal sequences of transformations into malware apps. Our empirical evaluations, carried out on 1K malware apps, demonstrate the effectiveness of our approach in generating real-world adversarial examples in both soft- and hard-label settings. Our findings reveal that EvadeDroid can effectively deceive diverse malware detectors that utilize different features with various feature types. Specifically, EvadeDroid achieves evasion rates of 80%–95% against DREBIN, Sec-SVM, ADE-MA, MaMaDroid, and Opcode-SVM with only 1–9 queries. Furthermore, we show that the proposed problem-space adversarial attack is able to preserve its stealthiness against five popular commercial antiviruses with an average of 79% evasion rate, thus demonstrating its feasibility in the real world.
2023
Targeted and Troublesome: Tracking and Advertising on Children's Websites
Z. Moti, A. Senol, H. Bostani, F. Zuiderveen Borgesius, V. Moonsamy, A. Mathur, and G. Acar
Proceedings of the 45th IEEE Symposium on Security and Privacy (IEEE S&P 2024)
On the modern web, trackers and advertisers frequently construct and monetize users’ detailed behavioral profiles without consent. Despite various studies on web tracking mechanisms and advertisements, there has been no rigorous study focusing on websites targeted at children. To address this gap, we present a measurement of tracking and (targeted) advertising on websites directed at children. Motivated by the lack of a comprehensive list of child-directed (i.e., targeted at children) websites, we first build a multilingual classifier based on web page titles and descriptions. Applying this classifier to over two million pages from the Common Crawl dataset, we compile a list of two thousand child-directed websites. Crawling these sites from five vantage points, we measure the prevalence of trackers, fingerprinting scripts, and advertisements. Our crawler detects ads displayed on child-directed websites and determines if ad targeting is enabled by scraping ad disclosure pages whenever available. Our results show that around 90% of child-directed websites embed one or more trackers, and about 27% contain targeted advertisements—a practice that should require verifiable parental consent. Next, we identify improper ads on child-directed websites by developing an ML pipeline that processes both images and text extracted from ads. The pipeline allows us to run semantic similarity queries for arbitrary search terms, revealing ads that promote services related to dating, weight loss, and mental health, as well as ads for sex toys and flirting chat services. Some of these ads feature repulsive, sexually-explicit and highly-inappropriate imagery. In summary, our findings indicate a trend of non-compliance with privacy regulations and troubling ad safety practices among many advertisers and child-directed websites. To ensure the protection of children and create a safer online environment, regulators and stakeholders must adopt and enforce more stringent measures.
2022
Chapter 5 - Hybrid and modified OPFs for intrusion detection systems and large-scale problems
M. Sheikhan, H. Bostani
In Optimum-Path Forest, pp. 109-136, Academic Press, Elsevier, 2022.
In this chapter, in order to show the efficiency of OPF in the intrusion detection systems (IDSs) and also in the large-scale problems, we introduce five hybrid and modified OPFs as follows: (a) a modified OPF using unsupervised learning and social network concept; (b) a hybrid IDS using unsupervised OPF based on MapReduce approach; (c) a hybrid IDS using a modified OPF (MOPF) and selected input features; (d) a modified OPF using Markov cluster process algorithm; and (e) a modified OPF based on the coreset concept. Furthermore, the MOPF-based IDS is improved in the last section as a contribution by using an outperformed clustering algorithm.
2020
A Strong Coreset Algorithm to Accelerate OPF as a Graph-based Machine Learning in Large-Scale Problems
H. Bostani, M. Sheikhan, B. Mahboobi
Information Sciences, vol. 555, pp. 424–441, 2021
Optimum-path forest (OPF) is one of the efficient graph-based frameworks that can determine the patterns of input dataset by extracting the optimal partitions of graph obtained through encoding data into a graph. Since OPF was introduced based on simple assumptions without considering the requirements of large-scale problems, this machine learning is an effective algorithm only for a reasonable size of input datasets. To provide a scalable OPF, this study introduces a strong coreset for accelerating OPF algorithm. Applying this approach can expedite OPF procedure, especially when it is working on massive datasets. Accordingly, a novel algebra is developed to represent the problem of OPF as an optimization problem for the proposed coreset definition. A novel coreset construction algorithm that can approximate the OPF solutions is subsequently proposed in order to improve the OPF construction speed. The simulation results of diverse experiments on various benchmark datasets illustrate computation gain and superiority of the proposed algorithm in terms of the construction and classification speeds as compared to the original algorithm while displaying reliably accurate performance. The presented coreset construction algorithm performs the training and testing phases of OPF up to 6.1 and 4.9 times faster than before, respectively.
2017
Developing a Fast Supervised Optimum-path Forest Based on Coreset
H. Bostani, M. Sheikhan, B. Mahboobi
In Proc. 19th International Symposium on Artificial Intelligence and Signal Processing (AISP’2017), pp. 172-177, 2017
Optimum-path forest (OPF) is an effective graph-based machine learning that simplifies the pattern
recognition problems into the partitioning the corresponding derived graphs of the input datasets.
The amounts of the samples in the input datasets and, consequently the size of the node set of their
corresponding derived graphs has a major effect on the speed of OPF. In this study a novel version of
OPF is introduced which utilizes coreset approach for reducing the scale of the input dataset. From the
aspect of the computational geometry, coreset is a small set of points that includes the best representative
points of the original point set with regard to a geometric objective function. Our method finds the most
informative vertices (samples) by proposing a novel incremental coreset construction algorithm. The
experimental results of the proposed method reduces the input data samples, and the execution times of
the construction and the classification phases of OPF by 80%, 60%, and 12%, respectively, in contrast to
the traditional OPF.
Hybrid of Anomaly-Based and Specification-Based IDS for Internet of Things Using Unsupervised OPF based on Map-Reduce Approach
H. Bostani, M. Sheikhan
Computer Communications, vol. 98, pp. 52-71, 2017
Internet of Things (IoT) is a novel paradigm in computer networks in which resource-constrained
objects connect to unreliable Internet by using a wide range of technologies. The insecure nature of
the Internet and wireless sensor networks, that are the main components of IoT, make IoT vulnerable to
different attacks, especially routing attacks (as insider attacks). A novel real-time hybrid intrusion
detection framework is proposed in this study that consists of anomaly-based and specification-based
intrusion detection modules for detecting two well-known routing attacks in IoT called sinkhole and
selective-forwarding attacks. For this purpose, the specification-based intrusion detection agents,
that are located in the router nodes, analyze the behavior of their host nodes and send their local
results to the root node through normal data packets. In addition, an anomaly-based intrusion detection
agent, that is located in the root node, employs the unsupervised optimum-path forest algorithm for
projecting clustering models by using incoming data packets. This agent, which is based on the MapReduce
architecture, can work in a distributed platform for projecting clustering models and consequently parallel
detecting of anomalies as a global detection approach. The proposed method makes decision about suspicious
behavior by using a voting mechanism. Notably, the proposed method is also extended to detect wormhole
attack. The deployment of the hybrid proposed model is investigated in a smart-city scenario by an existing
platform, as well. The free network's scale and the ability to identify malicious nodes are two key features
of the proposed framework that are evaluated through different experiments in this study. The experimental
results of simulated scenarios showed that the proposed hybrid method can achieve true positive rate
of 76.19% and false positive rate of 5.92% when both sinkhole and selective-forwarding attacks were
launched simultaneously. These rates in detecting wormhole attack are 96.02% and 2.08%, respectively.
Modifying Supervised Optimum-Path Forest in Intrusion Detection Systems Using Social Network Approaches and Unsupervised Learning
H. Bostani, M. Sheikhan
Pattern Recognition, vol. 62, pp. 56-72, 2017
Optimum-path forest (OPF) is a graph-based machine learning method that can overcome some limitations of
the traditional machine learning algorithms that have been used in intrusion detection systems. This paper
presents a novel approach for intrusion detection using a modified OPF (MOPF) algorithm for improving the
performance of traditional OPF in terms of detection rate (DR), false alarm rate (FAR), and time of execution.
To address the problem of scalability in large datasets and also for achieving high attack recognition rates,
the proposed framework employs the k-means clustering algorithm, as a partitioning module, for generating
different homogeneous training subsets from original heterogeneous training samples. In the proposed MOPF
algorithm, the distance between unlabeled samples and the root (prototype) of every sample in OPF is also
considered in classifying unlabeled samples with the aim of improving the accuracy rate of traditional OPF
algorithm. Moreover, the centrality and the prestige concepts in the social network analysis are employed in
a pruning module for determining the most informative samples in training subsets to speed up the traditional
OPF algorithm. The experimental results on NSL-KDD dataset show that the proposed method performs better than
traditional OPF in terms of accuracy rate, DR, FAR, and cost per example (CPE) evaluation metrics.
A Security Mechanism for Detecting Intrusions in Internet of Things Using Selected Features Based on MI-BGSA
M. Sheikhan, H. Bostani
International Journal of Information & Communication Technology Research, vol. 9, no. 2, pp. 53-62, 2017
Internet of things (IoT) is a novel emerging approach in computer networks wherein all heterogeneous objects
around us, which usually are resource-constrained objects, can connect to each other and also the Internet by
using a broad range of technologies. IoT is a hybrid network which includes the Internet and also wireless sensor
networks (WSNs) as the main components of IoT; so, implementing security mechanisms in IoT seems necessary.
This paper introduces a novel intrusion detection architecture model for IoT that provides the possibility of
distributed detection. The proposed hybrid model uses anomaly and misuse intrusion detection agents based on
the supervised and unsupervised optimum-path forest models for providing the ability to detect internal and
externals attacks, simultaneously. The number of input features to the proposed classifier is reduced by a hybrid
feature selection algorithm, as well. The experimental results of simulated scenarios show the superior
performance of proposed security mechanism in multi-faceted detection.
2016
Binary Gravitational Search Algorithm (BGSA): Improved Efficiency
M. Sheikhan, H. Bostani
Encyclopedia of Information Assurance, 2016
Today, detecting anomalous traffic and preventing it in computer networks has become increasingly
important for the community of security researchers. An intrusion detection system (IDS) is an effective
tool for reaching high security. This is a software tool for analyzing system behavior or network traffic as
input data to detect deviations from normal behavior. With the development of computer networks, highdimensional
input data analysis has become a huge problem in IDSs. One solution for overcoming this
problem is feature selection, which is a process for selecting an optimal subset of features. Populationbased
heuristic search algorithms have been widely used for this optimization problem. This entry presents
a novel feature selection method based on a binary gravitational search algorithm (BGSA). The proposed
method, which is called modified BGSA (MBGSA), uses BGSA for performing the global search to find
the best subset of features through the wrapper method. Moreover, for improving the efficiency of BGSA,
mutual information (MI) feature selector under the uniform information distribution (MIFS-U) method,
which works as a filter method, is integrated into BGSA as the inner optimization layer. In fact, with the
computation of the relevance between each selected feature and the target class and the redundancy between
the selected features (in the feature subset generated by the wrapper), MIFS-U will find more valuable
features that have maximum relevance to the target class and minimum redundancy to each other. The
experimental results on NSL-KDD dataset using different classifiers show that the proposed method can
find better subset features and achieve higher accuracy and an improved detection rate using fewer features
as compared to standard BGSA and binary particle swarm optimization (BPSO) feature selection methods.
Modification of Optimum-Path Forest using Markov Cluster Process Algorithm
H. Bostani, M. Sheikhan
In Proc. 2nd International Conference on Signal Processing and Intelligent Systems (ICSPIS’2016), pp. 1-5, 2016 (Winner of the Outstanding Paper Award)
Optimum-path forest (OPF) is a novel supervised graph-based classifier which reduces the
classification problem into partitioning of vertices in a graph derived from the data samples.
One of the main processes in OPF is identifying the optimum set of key samples named prototypes.
This process is based on creating a minimum spanning tree on a complete weighted graph which is
derived from the training samples; hence, it is much time-consuming for large-scale problems.
In this study, for overcoming this limitation, the process of finding the prototypes in
traditional OPF is modified by using Markov cluster (MCL) algorithm. The graph partitioning
in MCL is based on finding key samples named attractors, which attract other related samples;
so the obtained attractors can be selected as prototypes for generating optimum-path trees.
Experiments on public benchmark datasets show that the speed of proposed modified OPF is
improved considerably as compared to the traditional OPF.
A Hybrid Intrusion Detection Architecture for Internet of Things
M. Sheikhan, H. Bostani
In Proc. 8th International Symposium on Telecommunication (IST’2016), pp. 601-606, 2016 (Selected as one of the Best Paper)
In computer networks, Internet of things (IoT) is an emerging paradigm wherein smart and resource-constrained
objects can connect to Internet by using a wide range of technologies. Due to the insecure nature of
Internet and also wireless sensor networks (WSNs), which are the main components of IoT, implementing security
mechanisms in IoT seems necessary. To deal with intrusions which may occur in IoT, a novel intrusion detection
architecture model for IoT is proposed in this paper. This model is based on MapReduce approach with the aim of
distributed detection. To provide multi-faceted detection (from the Internet and WSNs sides), the proposed model
consists of anomaly-based and misuse-based intrusion detection agents that use supervised and unsupervised
optimum-path forest model for intrusion detection. The experimental results of simulated scenarios show the
superior performance of proposed method in intrusion detection for IoT.
2015
Hybrid of Binary Gravitational Search Algorithm and Mutual Information for Feature Selection in Intrusion Detection Systems
H. Bostani, M. Sheikhan
Soft Computing, vol. 21, no. 9, pp. 2307-2324, 2017
Intrusion detection systems (IDSs) play an important role in the security of computer
networks. One of the main challenges in IDSs is the high-dimensional input data analysis.
Feature selection is a solution to overcoming this problem. This paper presents a hybrid
feature selection method using binary gravitational search algorithm (BGSA) and mutual
information (MI) for improving the efficiency of standard BGSA as a feature selection algorithm.
The proposed method, called MI-BGSA, used BGSA as a wrapper-based feature selection method for
performing global search. Moreover, MI approach was integrated into the BGSA, as a filter-based
method, to compute the feature–feature and the feature–class mutual information with the aim of
pruning the subset of features. This strategy found the features considering the least
redundancy to the selected features and also the most relevance to the target class.
A two-objective function based on maximizing the detection rate and minimizing the false
positive rate was defined as a fitness function to control the search direction of the
standard BGSA. The experimental results on the NSL-KDD dataset showed that the proposed
method can reduce the feature space dramatically. Moreover, the proposed algorithm found
better subset of features and achieved higher accuracy and detection rate as compared to
the some standard wrapper-based and filter-based feature selection methods.
|