N-gram extractor by AST.
python3 main.py -i ./ast -o features.json
- -i, --input_folder: input folder with ASTs (in JSON format);
- -o, --output_file: output file, which will contain extracted features (in JSON format: "feature": "found_number");
Configuration is specified in main.py and contains the following parameters:
- n: max n in n-gram;
- max_distance: max distance between neighboring nodes (window);
- no_normalize: flag to normalize values (n-grams number);
- include: array of arrays with sub-n-gram witch should be contained in the found n-grams;
- include_strict: required n-grams (the remaining n-grams found will be removed);
- exclude: array of arrays with sub-n-gram witch should be not contained in the found n-grams;
- exclude_strict: n-grams, which should be excluded.
The program is required on input the AST of the following format (example input):
[
{
"type":"FUN",
"chars":"override fun onCreateView(inflater: LayoutInflater?, container: ViewGroup?, savedInstanceState: Bundle?): View? {\n dialog.window.requestFeature(Window.FEATURE_NO_TITLE)\n\n DaggerAppComponent.builder()\n .appModule(AppModule(context))\n .mainModule((activity.application as MyApplication).mainModule)\n .build().inject(this)\n\n var view = inflater?.inflate(R.layout.dialog_signup, container, false)\n\n ButterKnife.bind(this, view!!)\n\n return view\n }",
"children":[
{
"type":"MODIFIER_LIST",
"chars":"override",
"children":[
{
"type":"override",
"chars":"override"
}
]
},
{
"type":"IDENTIFIER",
"chars":"onCreateView"
},
{
"type":"VALUE_PARAMETER_LIST",
"chars":"(inflater: LayoutInflater?, container: ViewGroup?, savedInstanceState: Bundle?)",
"children":[
{
"type":"LPAR",
"chars":"("
},
{
"type":"VALUE_PARAMETER",
"chars":"inflater: LayoutInflater?",
"children":[
{
"type":"IDENTIFIER",
"chars":"inflater"
}
]
}
]
}
]
}
]
It is Kotlin AST, generated by Kotlin custom compiler
Also reqired AST transformer, which is a part of kotlin-source2ast (see lib/helper/AstHelper.py
)
N-grams is written in the JSON format.
For example:
{
"MODIFIER_LIST":1,
"override":1,
"MODIFIER_LIST:override":1,
"WHITE_SPACE":2,
"fun":1,
"IDENTIFIER":2,
"VALUE_PARAMETER_LIST":1,
"LPAR":11,
"VALUE_PARAMETER_LIST:LPAR":1,
"VALUE_PARAMETER":1,
"VALUE_PARAMETER_LIST:VALUE_PARAMETER":3,
"VALUE_PARAMETER_LIST:RPAR":1,
"BLOCK":1,
"LBRACE":1,
"BLOCK:LBRACE":1,
"BLOCK:WHITE_SPACE":11,
"DOT_QUALIFIED_EXPRESSION":13,
"BLOCK:DOT_QUALIFIED_EXPRESSION":6,
"DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION":12,
"BLOCK:DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION":9,
"DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION":29,
"BLOCK:REFERENCE_EXPRESSION":8,
"DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION":29,
"BLOCK:DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION":14,
"DOT_QUALIFIED_EXPRESSION:IDENTIFIER":24,
"DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION:IDENTIFIER":29
}
N-gram list can be used in feature-selection or ast2vec.