Skip to content

Step-by-step parse tree set conversion to the matrix (dataset): parse tree factorization by specified features and conversion to the dataset for ML algorithms

License

Notifications You must be signed in to change notification settings

petukhovv/tree-set2matrix

Repository files navigation

tree-set2matrix

Program for step-by-step conversion of tree set to the matrix (dataset) using tree2vec

Available steps (stages):

  • trees2vectors: ast from specified input folder conversion to vectors in specified output folder (with same path);
  • sparse_transformation: convert "vectors" (but actually is map "feature-value") to sparse representation (two formats: matrix or map);
  • normalize: normalization feature values by the number of all features in the current file;
  • collect_statistic: collection features statistic (create sorted lists with features and their frequency for specified n);
  • vectors2matrix: transformation files with vectors to matrix in JSON format;
  • matrix2csv: transformation matrix in JSON format to csv dataset.

Program use

Tree to vectors

Program arguments

  • -s, --stage -> trees2vectors;
  • -i, --input_folder: input folder with trees (in JSON format);
  • -o, --output_folder: output folder with "vectors" (but actually is map "feature-value");
  • -f, --features_file: path to file with feature list (feature list format see here)

Stage uses tree2vec as submodule, also see tree2vec README (trees format, output format, etc).

At this stage a file with all features statistic is generated (all_features.json in the output folder). This file required in the next stages.

Example of use

python3 main.py -s asts2vectors -i ./trees -o ./trees_as_vectors --features_file features_config.json

Example features config

[
    {
       "type": "all_ngrams",
       "params": {
           "n": 3,
           "max_distance": 3,
           "no_normalize": true,
           "exclude": [["FUN"]]
       }
   }
]

Sparse transformation

Program arguments

  • -s, --stage -> sparse_transformation;
  • -i, --input_folder: input folder with "vectors" (but actually is map "feature-value") - from previous stage (asts2vectors);
  • -o, --output_folder: output folder with sparsed "vectors";
  • --sparse_format: list or map.
  • --all_features_file: file with all feature generated by trees2vectors stage.

Output format

If the 'list' was specified (for example):

[4, 0, 0, 11, 0, 0, 0, 12, 0, 0, 0, 0, 65, 4, 0, 0, 0, 0, 0, 91, 0, 1, 0, 0, 0, 0, 0, 0, 45, 0, 0, 103, 0, 0, 0, 0, 0, 9, 3]

The elements are sorted by names of features.

If the 'map' was specified (for example):

{"FUN": 14, "MODIFIER_LIST": 3, "FUN:MODIFIER_LIST": 1, "public": 0, "MODIFIER_LIST:public": 0, "FUN:public": 1, "FUN:MODIFIER_LIST:public": 0, "MODIFIER_LIST:WHITE_SPACE": 0, "FUN:MODIFIER_LIST:WHITE_SPACE": 0, "open": 8, "MODIFIER_LIST:open": 0, "FUN:open": 1, "FUN:MODIFIER_LIST:open": 0, "fun": 1, "FUN:fun": 0, "FUN:IDENTIFIER": 2, "VALUE_PARAMETER_LIST": 3, "FUN:VALUE_PARAMETER_LIST": 0, "VALUE_PARAMETER_LIST:LPAR": 1, "FUN:LPAR": 1, "FUN:VALUE_PARAMETER_LIST:LPAR": 0, "VALUE_PARAMETER": 6, "VALUE_PARAMETER_LIST:VALUE_PARAMETER": 1, "FUN:VALUE_PARAMETER": 1, "FUN:VALUE_PARAMETER_LIST:VALUE_PARAMETER": 0, "VALUE_PARAMETER:IDENTIFIER": 1, "VALUE_PARAMETER_LIST:IDENTIFIER": 1, "VALUE_PARAMETER_LIST:VALUE_PARAMETER:IDENTIFIER": 0, "FUN:VALUE_PARAMETER:IDENTIFIER": 1, "FUN:VALUE_PARAMETER_LIST:IDENTIFIER": 0, "COLON": 2, "VALUE_PARAMETER:COLON": 0, "VALUE_PARAMETER_LIST:COLON": 1, "FUN:COLON": 22, "VALUE_PARAMETER_LIST:VALUE_PARAMETER:COLON": 2, "FUN:VALUE_PARAMETER:COLON": 0, "FUN:VALUE_PARAMETER_LIST:COLON": 0, "VALUE_PARAMETER:WHITE_SPACE": 13}

Example of use

python3 main.py -s sparse_transformation -i ./trees_as_vectors -o ./trees_as_vectors_sparsed --all_features_file ./tree_as_vectors/all_features.json

Normalization

Program arguments

  • -s, --stage -> normalize;
  • -i, --input_folder: input folder with "vectors" (but actually is map "feature-value") - from previous stages (sparse_transformation with sparse_format=map or trees2vectors: sparsed or not);
  • -o, --output_folder: output folder with normalized "vectors";
  • --all_features_file: file with all feature generated by trees2vectors stage.

Example output

{"FUN": 0.0007215007215007215, "MODIFIER_LIST": 0.0007215007215007215, "FUN:MODIFIER_LIST": 0.0007215007215007215, "public": 0.0007215007215007215, "MODIFIER_LIST:public": 0.0007215007215007215, "FUN:public": 0.0007215007215007215, "FUN:MODIFIER_LIST:public": 0.0007215007215007215, "MODIFIER_LIST:WHITE_SPACE": 0.0007215007215007215, "FUN:MODIFIER_LIST:WHITE_SPACE": 0.0007215007215007215, "open": 0.0007215007215007215, "MODIFIER_LIST:open": 0.0007215007215007215, "FUN:open": 0.0007215007215007215, "FUN:MODIFIER_LIST:open": 0.0007215007215007215, "fun": 0.0007215007215007215, "FUN:fun": 0.0007215007215007215, "FUN:IDENTIFIER": 0.0007215007215007215, "VALUE_PARAMETER_LIST": 0.0007215007215007215, "FUN:VALUE_PARAMETER_LIST": 0.0007215007215007215, "VALUE_PARAMETER_LIST:LPAR": 0.0007215007215007215, "FUN:LPAR": 0.0007215007215007215, "FUN:VALUE_PARAMETER_LIST:LPAR": 0.0007215007215007215, "VALUE_PARAMETER_LIST:RPAR": 0.0007215007215007215, "LBRACE": 0.0007215007215007215, "BLOCK:LBRACE": 0.0007215007215007215, "FUN:LBRACE": 0.0007215007215007215, "FUN:BLOCK:LBRACE": 0.0007215007215007215, "RETURN": 0.0007215007215007215, "BLOCK:RETURN": 0.0007215007215007215, "FUN:RETURN": 0.0007215007215007215, "FUN:BLOCK:RETURN": 0.0007215007215007215, "return": 0.0007215007215007215}

Example of use

main.py -s normalize -i ./trees_as_vectors -o ./trees_as_vectors_normalized --all_features_file ./tree_as_vectors/all_features.json

Statistic collection

Collection feature frequency statistic via feature statistic file generated by trees2vectors stage.

Program arguments

  • -s, --stage -> collect_statistic;
  • -o, --output_folder: output folder with statistic files (all_features_sorted.json, all_features_sorted_1.json, , all_features_sorted_2.json, ..., all_features_sorted_n.json);
  • --all_features_file: file with all feature generated by trees2vectors stage.

Example of use

python3 main.py -s collect_statistic -o ./features_statistic --all_features_file ./trees_as_vectors/all_features.json

vectors2matrix

Transformation files with vectors to matrix in JSON format.

Program arguments

  • -i, --input_folder -> path to directory with tree representation as vectors (in JSON format), obtained by sparse transformation stage;
  • --output_file: path to output file, witch will contain matrix in JSON format.

Example of use

python3 main.py -s vectors2matrix -i ./tree_as_vectors_sparsed --output_file ./dataset.json

matrix2csv

Transformation matrix in JSON format to csv dataset.

Program arguments

  • --input_file -> path to inout file with matrix in JSON format;
  • --output_file: path to output file, witch will contain matrix in CSV format.

Example of use

python3 main.py -s matrix2csv --input_file ./dataset.json --output_file ./dataset.csv

About

Step-by-step parse tree set conversion to the matrix (dataset): parse tree factorization by specified features and conversion to the dataset for ML algorithms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages