Due to the existence of malware samples in large amount of data malware
detection techniques are introduced. Machine learning techniques are being
applied to classify the applications focusing malware detection. Android
has impressive growth in the domain of smart phones. Hence to overcome
its better to group malware samples with structural similarities. Clustering
technique in Android applications is an important technique in machine
learning and gives automatic classication of applications by categorizing
malware. Clustering keeps similar applications in one cluster and it gives
good results with information retrieval. Following steps can be included in
the process of applications clustering:
Android Manifest le species the permissions needed by the application.
These les ask for permission to access restricted elements like hardware devices
and contacts of the Android operating system. To cluster the malware
behavior ...view middle of the document...
It uses the agglomerative hierarchical clustering
algorithm. Firstly from N singleton clusters this process successively merges
the two nearest clusters till a single cluster is formed. At each iteration while
merge the best number of clusters is found and also K-medoids algorithm
is used to create a partition instead of agglomerative hierarchical clustering
5.2.1 Weighted K-Mediods
This approach assigns weights dynamically to each feature of a malware
sample. This detects the common features in data set and also the clusters
which are hiding in the subspaces. The importance of features to a cluster
can be estimated by how consistent are the values to the samples in the
cluster and by how best its values dierentiate samples in dierent clusters.
The feature is important if there is small variation within a cluster and large
variations between the clusters. Issues with k-medoids are that it may not
obtain desired number of clusters.
5.2.2 K-means Clustering algorithm
K-means algorithm is a simple way to classify a data set using k clusters.
For each cluster k centroids are dened. The algorithm chooses the centroid
randomly from the applications set. The next step is to take a particular
application and associate it to the nearest centroid in a data set. We extract
many of the android applications to get features and use it in clustering
technique. Precision and recall give the performance of clustering in android
applications to detect malicious applications from a large set. Precision
means how best the clustering algorithms assign samples of varied features to
dierent clusters. Recall means how best the clustering algorithms recognize
similar samples. With these performance measures we may even detect
threats without installing and using them. False positive and false negative
values have to be kept minimized for better performance. Diculties of
clustering arise as it has lack of supervision information. As we do random
initializations dierent clustering algorithms may produce dierent results
even when the algorithms are used multiple times.