Final project for an MIT class in Advanced NLP.
This codebase is for the Powershell setting only.
See final paper HERE
Distributed representations of words, sentences, and documents have been crucial to finding good ways to use neural networks for NLP tasks (word2vec, doc2vec, seq2seq). These same distributed representations of textual data also serve us well in the related domain of programming languages; programs themselves are just structured text. As programming languages and NLP communities come to see their own common interests, NLP-derived neural models are being built to work on programmatic data as well (code2vec). In this project, we want to investigate whether these neural models, which use distributed representations of lines of code or abstract syntax trees, are robust to various types of obfuscation and adversarial inputs. We find that perturbations to Java programs either by variable substitution or by deadcode insertion cause little difference in classification by code2vec, but that obfuscated PowerShell programs cause an otherwise well-performing malware classifier to perform close to chance.
Autoencoder Seq2Seq model adapted from here
- Miniconda3/Anaconda3
- Python3
- PyTorch
- Clone this repo:
git clone [email protected]:sanjas/Seq2SeqCode.git
- Create Conda environment:
conda env create -f environment.yml
- Make sure you are cd'ed into the repo's root and run:
. ./start.sh
to activate the environment and set the PYTHONPATH.
Make sure you set the dataset path correctly in tools/data_load.py
. On the first run, it will take a long time (how exactly depends on the number of CPUs, since it's parallelized) for the whole dataset to be processed. On subsequent runs it will take less than 30 sec to load.
Run python tools/train.py
. Some hyperameters are set in model/hyperparams.py
.
To see the performance on the test data run python tools/test.py
Run python tools/inference.py
Code located in the classifier
subdirectory