Skip to content

AST factorization: transformation AST of Kotlin source code to a vector

License

Notifications You must be signed in to change notification settings

petukhovv/tree2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tree2vec

Description

Transformation tree (e.g. AST, CST, PSI) to a vector. The vector is constructed using feature extraction from tree.

Program consist the following feature extractors:

  • DepthExtractor - min, max or mean depth extraction from tree;
  • CharsLengthExtractor - min, max or mean chars length (for some node) from tree;
  • NGramsExtractor - calculating number of specified n-grams.
  • AllNGramsExtractor - calculating number of all n-grams by specified configuration (n, max_distance, etc). See ast-ngram-extractor (for only n-grams extraction)

This program is used as part of tree-set2matrix

Example of use

python3 main.py -i ./trees/ast.json -o ./trees_as_vectors/ast_as_vector.json --no_normalize

Program arguments

  • -i, --input: file with tree
  • -o, --output: output file, which will contain features and feature values as JSON
  • -d, --is_normalize: normalization necessary of vectors on the maximum value

Tree format

The program is required on input the tree of the following format (example input):

[
   {
      "type":"FUN",
      "children":[
         {
            "type":"MODIFIER_LIST",
            "children":[
               {
                  "type":"override",
               }
            ]
         },
         {
            "type":"IDENTIFIER",
         },
         {
            "type":"VALUE_PARAMETER_LIST",
            "children":[
               {
                  "type":"LPAR",
               },
               {
                  "type":"VALUE_PARAMETER",
                  "children":[
                     {
                         "type":"IDENTIFIER",
                     }
                  ]
               }
            ]
         }
      ]
   }
]

It is Kotlin AST, generated by Kotlin custom compiler For Kotlin code parsing to CST (PSI) requires transformer, which is a part of github-kotlin-code-collector (see src/lib/helper/AstHelper.py)

File with JSON representation of tree must be passed as an argument of program.

For example: python main.py trees/ast_of_my_program.json

Vector format

Program output is map with name and value features.

Feature values is vector components.

For example:

{
  'chars_length_avg': 47.297029702970299,
  'chars_length_max': 2047,
  'depth': 16,
  'depth_avg': 6.3469387755102042,
  'CALL_EXPRESSION': 0.06373937677053824,
  'DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION:IDENTIFIER': 0.028368794326241134,
  'DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION': 0.005689900426742532
}

Feature configuration

Features are specified in main.py (keys array for simple features; and objects array for n-grams and other (in the future)).

Example with specified n-grams:

simple_features = [
    'depth',
    'depth_avg',
    'chars_length_avg',
    'chars_length_max'
]

features = [
    {
        'type': 'ngram',
        'params': {
            'name': 'CALL_EXPRESSION',
            'node_types': ['CALL_EXPRESSION']
        }
    },
    {
        'type': 'ngram',
        'params': {
            'name': 'DOT_QUALIFIED_EXPRESSION:REFERENCE_EXPRESSION:IDENTIFIER',
            'node_types': ['DOT_QUALIFIED_EXPRESSION', 'REFERENCE_EXPRESSION', 'IDENTIFIER'],
            'max_distance': 3
        }
    },
    {
        'type': 'ngram',
        'params': {
            'name': 'DOT_QUALIFIED_EXPRESSION:DOT_QUALIFIED_EXPRESSION',
            'node_types': ['DOT_QUALIFIED_EXPRESSION', 'DOT_QUALIFIED_EXPRESSION'],
            'max_distance': 1
        }
    }
]

node_types - type of nodes, which should be on the one path in tree (according to specified distance).

name - name of feature, it used in output (feature names).

Example with all n-grams with specified configuration:

simple_features = [
    'depth',
    'depth_avg',
    'chars_length_avg',
    'chars_length_max'
]

features = [
    {
       'type': 'all_ngrams',
       'params': {
           'n': 3,
           'max_distance': 3,
           'no_normalize': True,
           'include': [['CALL_EXPRESSION', 'LPAR'], ['VALUE_ARGUMENT_LIST'], ['SAFE_ACCESS_EXPRESSION']],
           'exclude': [['FUN']]
       }
   }
]

All n-grams extractor arguments

  • n: max n in n-gram;
  • max_distance: max distance between neighboring nodes (window);
  • no_normalize: flag to normalize values (n-grams number);
  • include: array of arrays with sub-n-gram witch should be contained in the found n-grams;
  • include_strict: required n-grams (the remaining n-grams found will be removed);
  • exclude: array of arrays with sub-n-gram witch should be not contained in the found n-grams;
  • exclude_strict: n-grams, which should be excluded.

About

AST factorization: transformation AST of Kotlin source code to a vector

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages