Transformation tree (e.g. AST, CST, PSI) to a vector. The vector is constructed using feature extraction from tree.
Program consist the following feature extractors:
- DepthExtractor - min, max or mean depth extraction from tree;
- CharsLengthExtractor - min, max or mean chars length (for some node) from tree;
- NGramsExtractor - calculating number of specified n-grams.
- AllNGramsExtractor - calculating number of all n-grams by specified configuration (n, max_distance, etc). See ast-ngram-extractor (for only n-grams extraction)
This program is used as part of tree-set2matrix
python3 main.py -i ./trees/ast.json -o ./trees_as_vectors/ast_as_vector.json --no_normalize
-i
,--input
: file with tree-o
,--output
: output file, which will contain features and feature values as JSON-d
,--is_normalize
: normalization necessary of vectors on the maximum value
The program is required on input the tree of the following format (example input):
[
{
"type":"FUN",
"children":[
{
"type":"MODIFIER_LIST",
"children":[
{
"type":"override",
}
]
},
{
"type":"IDENTIFIER",
},
{
"type":"VALUE_PARAMETER_LIST",
"children":[
{
"type":"LPAR",
},
{
"type":"VALUE_PARAMETER",
"children":[
{
"type":"IDENTIFIER",
}
]
}
]
}
]
}
]
It is Kotlin AST, generated by Kotlin custom compiler
For Kotlin code parsing to CST (PSI) requires transformer, which is a part of github-kotlin-code-collector (see src/lib/helper/AstHelper.py
)
File with JSON representation of tree must be passed as an argument of program.
For example: python main.py trees/ast_of_my_program.json
Program output is map with name and value features.
Feature values is vector components.
For example:
{
'chars_length_avg': 47.297029702970299,
'chars_length_max': 2047,
'depth': 16,
'depth_avg': 6.3469387755102042,
'CALL_EXPRESSION': 0.06373937677053824,
'DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION:IDENTIFIER': 0.028368794326241134,
'DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION': 0.005689900426742532
}
Features are specified in main.py (keys array for simple features; and objects array for n-grams and other (in the future)).
Example with specified n-grams:
simple_features = [
'depth',
'depth_avg',
'chars_length_avg',
'chars_length_max'
]
features = [
{
'type': 'ngram',
'params': {
'name': 'CALL_EXPRESSION',
'node_types': ['CALL_EXPRESSION']
}
},
{
'type': 'ngram',
'params': {
'name': 'DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION:IDENTIFIER',
'node_types': ['DOT_QUALIFIED_EXPRESSION', 'REFERENCE_EXPRESSION', 'IDENTIFIER'],
'max_distance': 3
}
},
{
'type': 'ngram',
'params': {
'name': 'DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION',
'node_types': ['DOT_QUALIFIED_EXPRESSION', 'DOT_QUALIFIED_EXPRESSION'],
'max_distance': 1
}
}
]
node_types
- type of nodes, which should be on the one path in tree (according to specified distance).
name
- name of feature, it used in output (feature names).
Example with all n-grams with specified configuration:
simple_features = [
'depth',
'depth_avg',
'chars_length_avg',
'chars_length_max'
]
features = [
{
'type': 'all_ngrams',
'params': {
'n': 3,
'max_distance': 3,
'no_normalize': True,
'include': [['CALL_EXPRESSION', 'LPAR'], ['VALUE_ARGUMENT_LIST'], ['SAFE_ACCESS_EXPRESSION']],
'exclude': [['FUN']]
}
}
]
n
: max n in n-gram;max_distance
: max distance between neighboring nodes (window);no_normalize
: flag to normalize values (n-grams number);include
: array of arrays with sub-n-gram witch should be contained in the found n-grams;include_strict
: required n-grams (the remaining n-grams found will be removed);exclude
: array of arrays with sub-n-gram witch should be not contained in the found n-grams;exclude_strict
: n-grams, which should be excluded.