It is a parsing tool based on python for C/C++ to construct code property graph, which is the python version of CppCodeAnalyzerJava, most of functions of CppCodeAnalyzer are similar to Joern, the differences are that:
-
The grammar we utilize here is from the repo of grammars-v4 Antlr official, which means the input of module ast (Antlr AST) is quite different from Joern, but the output customized AST is the same, so the parsing module in ast package is different from Joern.
-
When constructing CFG, CppCodeAnalyzer takes
for-range
andtry-catch
into consideration.-
when parsing code such as
for (auto p: vec){ xxx }
, the CFG is like in graph 1 -
when parsing
try-catch
, we simple ignore statements in catch block because in normal states they are not going to be executed, and the control flow intry-catch
is quite hard to compute. -
when parsing use-def information by udg package, we take the information of pointer uses. For example,
memcpy(dest, src, 100);
defines symbol* dest
and uses symbol* src
, Joern considered pointer define with variableTainted
but did not consider pointer uses.
-
Graph 1
graph LR
EmptyCondition --> A[auto p: vec]
A --> B[xxx]
B --> EmptyCondition
EmptyCondition --> Exit
The pipeline of CppCodeAnalyzer is similar to Joern, which could be illustrated as:
graph LR
AntlrAST --Transform --> AST -- control flow analysis --> CFG
CFG -- dominate analysis --> CDG
CFG -- symbol def use analysis --> UDG
UDG -- data dependence analysis --> DDG
If you want more details, coule refer to Joern工具工作流程分析
-
package ast transform Antlr AST to customized AST.
-
package cfg conduct control flow analysis and convert customized AST into CFG.
-
package cdg conduct statement dominate analysis and construct control dependence relations between statements.
-
package udg analyze the symbols defined and used in each statement independently.
-
package ddg construct data dependence relations between statements with def-use information computed in udg package.
The testfile in directionary test/mainToolTests
illustrated the progress of each module, you could refer to those test cases to learn how to use API in CppCodeAnalyzer.
Environment:
-
python 3.8
-
antlr4-python3-runtime 4.9.2
Used as python package:
-
Download release first and unzip
-
Run
python setup.py bdist_wheel
andpip install dist/CppCodeAnalyzer-1.0-py3-none-any.whl
-
After installing, when import APIs from CppCodeAnalyzer, you just need to add prefix
CppCodeAnalyzer
to the package name, for example, the import statementfrom mainTool.udg.astAnalyzers import ASTDefUseAnalyzer, CalleeInfos, CFGToUDGConverter
in CPGBuildTest.py, you just need to modify tofrom CppCodeAnalyzer.mainTool.udg.astAnalyzers import ASTDefUseAnalyzer, CalleeInfos, CFGToUDGConverter
.
-
When we conduct experiments with Joern tool parsing SARD datasets, we find some error.The statement
wchar_t data[50] = L'A';
should be in a single CFG node, but each token in the statement is assigned to a CFG node, after we check the source code, we believe the root cause is the grammar used by Joern. -
Also, most researches utilize python to write deep-learning programs, it could be more convenient to parse code with python because the parsing module could directly connect to deep-learning module, there would be no need to write scripts to parse output of Joern.
-
Parsing control-flow in
for-range
andtry-catch
is difficult, there are no materials depicting CFG infor-range
andtry-catch
. -
Parsing def-use information of pointer variable is difficult. For example, in
*(p+i+1) = a[i][j];
, symbols defined include* p
and used includep, i, j, a, * a
. However, this is not very accurate, but computing the location of memory staticlly is difficult. This could brings following problems.
s1: memset(source, 100, 'A');
s2: source[99] = '\0';
s3: memcpy(data, source, 100);
-
In results of CppCodeAnalyzer, s1 and s2 define symbol
* source
, but the later kills the front. So, there is only DDG edges2 -> s3
in DDG. -
However, s1 defines
* source
, s2 defines* ( source + 99)
, a precise DDG should contains edges1 -> s3, s2 -> s3
Also, our tool is much more slower than Joern, normally parsing a file in SARD dataset needs 20 - 30 seconds, so we recommand dump output CPG into json format first if you need to train a model. The Java version CppCodeAnalyzerJava is much more faster, if you prefer fast analysis you could use Java version.
calleeInfos.json stores APIs which define or use variable of pointer type, you can use json
package to load these callee infos and set ASTDefUseAnalyzer.calleeInfos according to your own preference when analysing use-def information of each code line.
Note that calleeInfos.json is important to parse data dependence, or you would lose data dependence of pointer variable generated by API (such as memcpy
), you can load like
import json
from CppCodeAnalyzer.mainTool.CPG import initialCalleeInfos, CFGToUDGConverter, ASTDefUseAnalyzer
calleeInfs = json.load(open("path to calleeInfos.json", 'r', encoding='utf-8'))
calleeInfos = initialCalleeInfos(calleeInfs)
converter: CFGToUDGConverter = CFGToUDGConverter()
astAnalyzer: ASTDefUseAnalyzer = ASTDefUseAnalyzer()
astAnalyzer.calleeInfos = calleeInfos
converter.astAnalyzer = astAnalyzer
remember set astAnalyzer.calleeInfos = calleeInfos
and converter.astAnalyzer = astAnalyzer
to load calleeInfos
The package extraTools
contains some preprocess code for vulnerability detectors IVDetect, SySeVR and DeepWuKong. The usage could refer to file in test/extraToolTests
Li Y , Wang S , Nguyen T N . Vulnerability Detection with Fine-grained Interpretations. 2021.