My research is about Android malware detection. This repository provides APIs and scripts for static analysis. Most of the codes are written in Python for rapid development.
The workflow of static analysis is the same as most of machine learning applications: acquire a dataset, extract features, and classify the samples.
There are several ways to get APK files. You can download them either from Google Play or from thrid-party app stores like Wandoujia. Extra tools may be needed to download APK files from Google Play such as APKs Downloader. Note the daily quota of a Google account is about 300 to 500 APK files. It takes over a month to collect ten thousand apps by a single account.
After downloading, we should label the APK files as malicious or benign. Some papers trust all the samples from Google Play as benign, and use the malicious dataset provided by Android Malware Genome Project. However, this scheme is too coarse since malicious applications may exist in Google Play. Google Bouncer is not almighty.
VirusTotal is another common way to do this job. It provdes a public API to scan arbitrary files online by several anti-virus scanners. A paper Drebin labels the applications as malicious that are detected by at least two of the ten common scanners, and combines them with Android Malware Genome Project. Another paper MAMA controls the detection ratio of all the scanners empirically.
Everything in an APK file can be a feature. The most basic and common features are permissions and APIs. The latter has dominated for almost two years due to its granularity.
You can use either Python-based Androguard or Java-based Baksmali to disassemble the APK files, and get the features needed. I choose the former one because writting Python scripts is more intuitive for me.
There are many algorithms that can be applied like Bayesian networks, support vector machines (SVMs), J48, and etc. You can implement those algorithms yourself. However, as a programmer, I suggest you using libraries like scikit-learn (Python-based) and Weka (Java-based). Never reinvent the wheel, dude.
If you have not decided which algorithm to use yet, I recommend Weka as a start point. It provides a great GUI so that you can do lots of stuff simply by a mouse, which is conveninent for trial at the beginning. You can also write your complex classification system with Java-based Weka API if you are a Java fan.
-
Ubuntu 14.04 & PostgreSQL 9.3.5
You can use Ubuntu 14.10, too. CentOS 7 and Fedora 20 are also good options. Note that CentOS 6.5 may need virtualenv as it use stable but older libraries, which made me feel a bit annoyed every time I started a
screen
session. -
Python 2.7.6
Three two python modules,
psycopg
andposter
, are needed for database manipulation and uploading files via HTTP, respectively. Useapt-get
to install them:$ sudo apt-get install python-psycopg2 python-poster
-
Androguard 1.9
To install Androguard, make a local clone on your computer:
$ git clone https://github.com/androguard/androguard
Do not forget to add the path to
PYTHONPATH
in yorubashrc
:export PYTHONPATH=/path/to/androguard:$PYTHONPATH
-
Weka 3.6.11
After download Weka, change into the directory and type the following command to execute ti:
$ java -Xmx1000M -jar weka.jar
Note that using
-jar
will overrideCLASSPATH
variable and only useweka.jar
. By the way, I prefer another way. First Add the path toCLASSPATH
inbashrc
:export CLASSPATH=/path/to/weka:$CLASSPATH
Later you can run the program everywhere:
$ java weka.gui.GUIChooser # or java weka.gui.Main
If your operating system is Ubuntu, you will find it convenient with its tab-completion. Finally, do not forget to add SVM Java libary to
CLASSPATH
variable if you need it.