Skip to content

Identifying the compiler family and optimization level using machine and deep learning approaches on BinComp dataset.

Notifications You must be signed in to change notification settings

DohaElHady/Compiler-Provenance-Attribution

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

Compiler Provenance using Machine and Deep Learning

Open In Colab

Problem: Identifying the compiler family and optimization level is a crucial phase for malware analysis and reverse engineering. Cracking binary files for extracting provenance information supports a faster detection of malware files.

Methodologies:

  1. Feature engineering was carried out by Strings and the Ndisasm disassembler through Linex command-line.
  2. Feature selection through ANOVA and Chi-squared was implemented.
  3. Feature pre-processing including data balancing and standardization were deployed.
  4. Logistic Regression, Support Vector Machines (SVM), Multi-Layer Perceptron (MLP), Decision tree, AdaBoost classifier, Random forest, and ensemble learning were exploited for the two classification tasks.
  5. Optimization classification problem was tested over deep learning.

Results: The best test accuracy of 100% was achieved by the stacking model for the classification of the compiler family, and 85.9% for the optimization level by the deep learning model.

Compiler Family Confusion Matrix optimization level results

Dataset

BinComp compiler fingerprinting dataset. https://github.com/BinSigma/BinComp/tree/master/Dataset.

Disassembled and strings csv files are available upon request.

Request Compiler Provenance CSV Dataset!

Contributors

Mohamed Elahl - Hassan Mohamed - Karim Youssef - Doha ElHady

About

Identifying the compiler family and optimization level using machine and deep learning approaches on BinComp dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%