Program for step-by-step conversion of tree set to the matrix (dataset) using tree2vec
Available steps (stages):
- trees2vectors: ast from specified input folder conversion to vectors in specified output folder (with same path);
- sparse_transformation: convert "vectors" (but actually is map "feature-value") to sparse representation (two formats: matrix or map);
- normalize: normalization feature values by the number of all features in the current file;
- collect_statistic: collection features statistic (create sorted lists with features and their frequency for specified n);
- vectors2matrix: transformation files with vectors to matrix in JSON format;
- matrix2csv: transformation matrix in JSON format to csv dataset.
-s
,--stage
-> trees2vectors;-i
,--input_folder
: input folder with trees (in JSON format);-o
,--output_folder
: output folder with "vectors" (but actually is map "feature-value");-f
,--features_file
: path to file with feature list (feature list format see here)
Stage uses tree2vec as submodule, also see tree2vec README (trees format, output format, etc).
At this stage a file with all features statistic is generated (all_features.json in the output folder). This file required in the next stages.
python3 main.py -s asts2vectors -i ./trees -o ./trees_as_vectors --features_file features_config.json
[
{
"type": "all_ngrams",
"params": {
"n": 3,
"max_distance": 3,
"no_normalize": true,
"exclude": [["FUN"]]
}
}
]
-s
,--stage
-> sparse_transformation;-i
,--input_folder
: input folder with "vectors" (but actually is map "feature-value") - from previous stage (asts2vectors);-o
,--output_folder
: output folder with sparsed "vectors";--sparse_format
: list or map.--all_features_file
: file with all feature generated by trees2vectors stage.
If the 'list' was specified (for example):
[4, 0, 0, 11, 0, 0, 0, 12, 0, 0, 0, 0, 65, 4, 0, 0, 0, 0, 0, 91, 0, 1, 0, 0, 0, 0, 0, 0, 45, 0, 0, 103, 0, 0, 0, 0, 0, 9, 3]
The elements are sorted by names of features.
If the 'map' was specified (for example):
{"FUN": 14, "MODIFIER_LIST": 3, "FUN:MODIFIER_LIST": 1, "public": 0, "MODIFIER_LIST:public": 0, "FUN:public": 1, "FUN:MODIFIER_LIST:public": 0, "MODIFIER_LIST:WHITE_SPACE": 0, "FUN:MODIFIER_LIST:WHITE_SPACE": 0, "open": 8, "MODIFIER_LIST:open": 0, "FUN:open": 1, "FUN:MODIFIER_LIST:open": 0, "fun": 1, "FUN:fun": 0, "FUN:IDENTIFIER": 2, "VALUE_PARAMETER_LIST": 3, "FUN:VALUE_PARAMETER_LIST": 0, "VALUE_PARAMETER_LIST:LPAR": 1, "FUN:LPAR": 1, "FUN:VALUE_PARAMETER_LIST:LPAR": 0, "VALUE_PARAMETER": 6, "VALUE_PARAMETER_LIST:VALUE_PARAMETER": 1, "FUN:VALUE_PARAMETER": 1, "FUN:VALUE_PARAMETER_LIST:VALUE_PARAMETER": 0, "VALUE_PARAMETER:IDENTIFIER": 1, "VALUE_PARAMETER_LIST:IDENTIFIER": 1, "VALUE_PARAMETER_LIST:VALUE_PARAMETER:IDENTIFIER": 0, "FUN:VALUE_PARAMETER:IDENTIFIER": 1, "FUN:VALUE_PARAMETER_LIST:IDENTIFIER": 0, "COLON": 2, "VALUE_PARAMETER:COLON": 0, "VALUE_PARAMETER_LIST:COLON": 1, "FUN:COLON": 22, "VALUE_PARAMETER_LIST:VALUE_PARAMETER:COLON": 2, "FUN:VALUE_PARAMETER:COLON": 0, "FUN:VALUE_PARAMETER_LIST:COLON": 0, "VALUE_PARAMETER:WHITE_SPACE": 13}
python3 main.py -s sparse_transformation -i ./trees_as_vectors -o ./trees_as_vectors_sparsed --all_features_file ./tree_as_vectors/all_features.json
-s
,--stage
-> normalize;-i
,--input_folder
: input folder with "vectors" (but actually is map "feature-value") - from previous stages (sparse_transformation with sparse_format=map or trees2vectors: sparsed or not);-o
,--output_folder
: output folder with normalized "vectors";--all_features_file
: file with all feature generated by trees2vectors stage.
{"FUN": 0.0007215007215007215, "MODIFIER_LIST": 0.0007215007215007215, "FUN:MODIFIER_LIST": 0.0007215007215007215, "public": 0.0007215007215007215, "MODIFIER_LIST:public": 0.0007215007215007215, "FUN:public": 0.0007215007215007215, "FUN:MODIFIER_LIST:public": 0.0007215007215007215, "MODIFIER_LIST:WHITE_SPACE": 0.0007215007215007215, "FUN:MODIFIER_LIST:WHITE_SPACE": 0.0007215007215007215, "open": 0.0007215007215007215, "MODIFIER_LIST:open": 0.0007215007215007215, "FUN:open": 0.0007215007215007215, "FUN:MODIFIER_LIST:open": 0.0007215007215007215, "fun": 0.0007215007215007215, "FUN:fun": 0.0007215007215007215, "FUN:IDENTIFIER": 0.0007215007215007215, "VALUE_PARAMETER_LIST": 0.0007215007215007215, "FUN:VALUE_PARAMETER_LIST": 0.0007215007215007215, "VALUE_PARAMETER_LIST:LPAR": 0.0007215007215007215, "FUN:LPAR": 0.0007215007215007215, "FUN:VALUE_PARAMETER_LIST:LPAR": 0.0007215007215007215, "VALUE_PARAMETER_LIST:RPAR": 0.0007215007215007215, "LBRACE": 0.0007215007215007215, "BLOCK:LBRACE": 0.0007215007215007215, "FUN:LBRACE": 0.0007215007215007215, "FUN:BLOCK:LBRACE": 0.0007215007215007215, "RETURN": 0.0007215007215007215, "BLOCK:RETURN": 0.0007215007215007215, "FUN:RETURN": 0.0007215007215007215, "FUN:BLOCK:RETURN": 0.0007215007215007215, "return": 0.0007215007215007215}
main.py -s normalize -i ./trees_as_vectors -o ./trees_as_vectors_normalized --all_features_file ./tree_as_vectors/all_features.json
Collection feature frequency statistic via feature statistic file generated by trees2vectors stage.
-s
,--stage
-> collect_statistic;-o
,--output_folder
: output folder with statistic files (all_features_sorted.json, all_features_sorted_1.json, , all_features_sorted_2.json, ..., all_features_sorted_n.json);--all_features_file
: file with all feature generated by trees2vectors stage.
python3 main.py -s collect_statistic -o ./features_statistic --all_features_file ./trees_as_vectors/all_features.json
Transformation files with vectors to matrix in JSON format.
-i
,--input_folder
-> path to directory with tree representation as vectors (in JSON format), obtained by sparse transformation stage;--output_file
: path to output file, witch will contain matrix in JSON format.
python3 main.py -s vectors2matrix -i ./tree_as_vectors_sparsed --output_file ./dataset.json
Transformation matrix in JSON format to csv dataset.
--input_file
-> path to inout file with matrix in JSON format;--output_file
: path to output file, witch will contain matrix in CSV format.
python3 main.py -s matrix2csv --input_file ./dataset.json --output_file ./dataset.csv