Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Eclipse CDT parser, create SymbolTable #1479

Closed
wants to merge 1 commit into from

Conversation

ivangalkin
Copy link
Contributor

@ivangalkin ivangalkin commented May 20, 2018

Hi @guwirth, @Bertk
Hi All!

this patch implements the symbol table for C++ files (US #1401). It uses free and open source Eclipse CDT parser (https://github.com/eclipse/cdt) in order to preprocess the source file and create an AST. I believe, that the code of Eclipse CDT project in general and the design of AST in particular have very high quality. I think it's a good idea to outsource the parser to the 3rd party project, because it's a) a very complex code b) not a principal purpose of sonar-cxx c) subject of constant changes (C++11, C++14, 17 etc)

The parser is used for SymbolTable only and doesn't break any other visitors of sonar-cxx.

Regarding to the code itself:

a) I tested the SymbolTable on the test from Bertk@78643d3#diff-399f59d3cf1cdb0fa3997373554e742f. I've even added templates in order to make the parsing more complex. The testing is tricky, so as a first step I just added a kind of a dump (see symbols.cc, symbols_declarations.cc and symbols_references.cc). In my opinion, the completeness is 99%.

b) I tested the SymbolTable on the productive and very template-heavy code. It works very satisfying.

the pseudo-code works as following...

ast = cdtparser.preprocess_and_parse(source_file)
declarations = ast.find_all_declarations(); # in source_file and all includes
for declaration in declarations:
    references = ast.search_all_references_in_file(declaration, source_file)
    if len(references) == 0:
        continue
    if not ast.declaration_in_file(declaration, source_file):
        fake_declaration = references[0]
        table[fake_declaration] = references[1: ]
    else:
         table[declaration] = references

Please take a look at the code. There is not much of logic and it isn't complex. I tried to avoid too much SonarQube dependencies in the parser wrapper, so I had to introduce some glue-code w.r.t. to the TextPoint and TextRange.

I'm looking forward for your feedback.


This change is Reviewable

@Bertk
Copy link
Contributor

Bertk commented May 20, 2018

Hi @ivangalkin,
i am still busy with other tasks and we need some more time to revert a former decision to remove CDT (see #272, #640).

Conceptually, youre right: why reenvent the whell, why not reuse the already existing components? In practice, we tried already and failed: some time ago Przemyslaw Kociolek used the components from Eclipse CDT to parse and build AS Trees. The result was pretty bad: we never got it stable, the code throwed exceptions at various occations, it was lots of code, it was unclear how to build on that etc. All this despite the fact that it also was Java, we werent forced to maintain a glue layer using JNI or other wrapping technology etc.

Finally, Ive thrown it away and replaced with the SSLR based thing, which delivered more value from the very beginning.

It has also advantages to be technology compatible with the rest of the SQ language plugins, as one can port featured found there quite straightforward to sonar-cxx.

So Im sceptical about this. But Im ready to change my mind somebody can prove the opposite by showing off a well-working prototype based on clang ;)

I like the last comment of "The art of parsing" discussion:

Exactly. We dont want to fiddle with XML files or anything. libclang is a lib, and as a such, should not force you to use certain output channels. Ideally: we give it the source code in form of a string or a stream along with some context (include paths, options) and it outputs a tree of Java objects. Thats the deal. The 'narrow matching' and excluding the vast amount of code coming from the external libs is of course an important sub-problem. Currently, this is done via our PP implementation, whose handling of an #include is not actually including the code, but recursively scanning the included code for macros.

Unfortunatly, we did not spend effort selecting another parser which solves the current issues.
http://clang.llvm.org/docs/IntroductionToTheClangAST.html
http://blog.glehmann.net/2014/12/29/Playing-with-libclang/

@guwirth
Copy link
Collaborator

guwirth commented May 20, 2018

Hi @ivangalkin,

Thanks for your proposal.

Think there are several points we have to check:

  • Which license does this parser use? Is the license compatible to our one (and open source)?
  • How good maintained is the parser?
  • Which C++ standards are supported? We are supporting C89, C99, C11, C++03, C++11, C++14 and C++17 standards currently
  • Which C++ variants are supported? We are supporting currently
    • Microsoft extensions: C++/CLI, Attributed ATL
    • GNU extensions
    • CUDA extensions
  • Is there a preprocessor included?
  • What's about the performance? Is it faster/slower as our current implementation?
  • What's about the memory consumption?
    ...

Regards,

@guwirth
Copy link
Collaborator

guwirth commented May 22, 2018

... and just another point: the parser must have error recovery: https://github.com/SonarOpenCommunity/sonar-cxx/wiki/Error-Recovery

@ivangalkin
Copy link
Contributor Author

Hi @guwirth, @Bertk
Hi All!

thank you for your feedback!

Although I see a big advantage of using of Eclipse CDT (or other mature 3rd party C/C++ parser) over the SSLR-based solution, I don't suggest to replace the current parsing completely. SSLR seems to work well for the (advanced) tokenization, which is enough for the highlighting and the simple CxxSquidSensor. However the creation of the consistent symbol table requires much more than this. In fact it requires the complete syntactic and semantic analysis (e.g. type inference, ambiguity/overload resolutions, templates processor and much more). Even GCC/clang face frontend bugs from time to time, so unfortunately I just don't see any chance to solve US #1401 properly without an external parser.

So for now, please let us see the scope of Eclipse CDT parser as a tool for symbol table creation only.

Regarding Eclipse CDT (or more precise org.eclipse.cdt.core)

I haven't found anything about other language extensions, but I think, that even the combination of GNU C++14 and the robust parser covers the biggest part of the symbol table. I have no information about the performance/memory consumption yet.

The usage of the API for the extraction of symbol information is easy (please see my patch) and it works really good.

Best regards,
Ivan

@guwirth
Copy link
Collaborator

guwirth commented May 23, 2018

@ivangalkin

It is licensed under Eclipse Public License Version 1.0

As far as I understand EPL and LGPL seems to be compatible. As long we are using the EPL code/component only without changing it we can put the result under LGPL again.

Although I see a big advantage of using of Eclipse CDT (or other mature 3rd party C/C++ parser) over the SSLR-based solution

See also the advantage. But I don't like to have two parsers in the plug-in. In case we change the parser we should switch completely.

Eclipse CDT supports C/C++ with GNU extensions (builtin functions, GNU C/C++ operators, attributes, keywords etc).

@Bertk and me are using mainly Visual Studio and we both have legacy code with Microsoft extensions: C++/CLI, and Attributed ATL. So question is how to handle this?

Preprocessor is supported

Would be another advantage. Current Preprocessor is deprecated.
Are precompiled header files supported?

it supports the error recovery

That's important and would be a no-go if not.

@Bertk / @jmecosta any opinion to this point?

Regards

@ivangalkin
Copy link
Contributor Author

@guwirth

But I don't like to have two parsers in the plug-in. In case we change the parser we should switch completely.

Maybe we could merge the eclipse CDT parser as part of symbol table and try to rewrite the Squid visitors in terms of a new AST? I literally went through all existing visitors and there is no complex algorithms IMHO. So I am not sure if its an argument for the introduction of a new parser or against it. But I am still 100% sure, that building of symbol table on basis of the current AST is a hell of a lot of work.

C++/CLI, and Attributed ATL. So question is how to handle this?

I believe there should be no big problem. Most probably it works already pretty well. Also extending of keywords, built-ins etc. is comparable easy. I've seen even examples of extending of a syntactic constructs (e.g. gcc's labels extension).

Are precompiled header files supported?

From the source code perspective precompiled header files are nothing special. So if all include paths are specified correctly, such [precompiled] header will be preprocessed and parsed correctly. I don't believe, that eclipse CDT can cache and "inline" the ASTs of [precompiled] headers.

@Bertk
Copy link
Contributor

Bertk commented May 24, 2018

Replacing the parser will change the behavior of the plugin and I am not sure how fast the CDT will be adapted to a new C++ standard e.g. C++20 e.g. TS modules. I have more confidence with the Clang parser but I do not know CDT.
Maybe we should create new repo for this parser migration and deprecate the old Cxx plug-in.
Just use a PR to replace the parser might not be the right strategy.

@jmecosta
Copy link
Member

my opinion is always to reuse a parser that will give us the most benefit in terms of community support, maintainability, features. Clang would always be the best option for this. And we can even reuse most of the checks clang already provides. This would be inline with C# plugin that uses roslyn, and the java part is simple consuming the metrics.

CDT was in used before the SSLR, and there was a decision to deprecated in favour of the SSLR back in the day. Then i think it made sense to do it. Now im not sure, having both SSLR and CDT to maintain i think its not the way to go since it will make maintenance of the plugin a harder.

Also the size of the plugin has jumped from
onar-cxx-plugin\target\sonar-cxx-plugin-1.0.0.jar
File 15 MB
to
sonar-cxx-plugin\target\sonar-cxx-plugin-1.0.0.jar
File 25 MB

@ivangalkin
Copy link
Contributor Author

ivangalkin commented May 24, 2018

Hi @Bertk @jmecosta and @guwirth,
Hi All,

  1. regarding clang: I hear much positives from you about usage of clang with respect to the Sonar-CXX/SonarQube. I use clang by myself and would prefer clang for any productive scenarios where compilation of source code is required. I [let's say, indirectly] contributed to its backend and see LLVM as one of the biggest advantages.

What's about the clang's C++ frontend: its ideology is to follow the C++ standard as strict as possible. It it good if you want to check the portability of your code or maybe receive more/better warnings. However clang parser (as far as I know it) is definitely not a superset of all possible dialects. It can be configured to accept the GCC-conform code (I mean GCC's sloppy interpretation of standard w.r.t. templates etc.) , but I am wondering about the support of C++/CLI, Attributed ATL, CUDA etc.

Does clang's parser support something like error recovery?

The design idea of Eclipse CDT is the opposite one: produce the most generic parser. It parses the code in IDE while coding process and is very robust.

  1. regarding the checks: I personally like the idea, that the main responsibility of sonar-cxx is the import of external reports. It gives endless number of freedom degrees w.r.t. tools, rules, configurations, [parallel] analysis execution etc . The only thing I expect to work well (besides the import of course) is the highlighting: both keywords and symbols usage. I believe Eclipse CDT suites this task well, but it's ok to me if some of you wants to implement it with clang.

Having a proper AST will bring advantage to the SquidSensor or maybe other visitors. But, to be honest, the most of them are written like apply-regexp-to-the-each-source-line or apply-regexp-to-the-each-comment-line, so even that benefit is arguable.

  1. regarding the plugin size: is it ciritical?

@ivangalkin
Copy link
Contributor Author

one more question w.r.t. clang

  1. Technical implementation: I searched for the way to get AST from clang to java and haven't find anything satisfying. JNI bridges (if exist) require the installation of clang on the worker's machine. This is a disruptive change and could be a requirement, which is hard to implement. Native front-end port for clang (https://github.com/java-port/clank) seems to exist, but it isn't official and therefore of unknown quality and support.

@jmecosta
Copy link
Member

jmecosta commented May 24, 2018 via email

@ivangalkin
Copy link
Contributor Author

ivangalkin commented May 24, 2018

@jmecosta extracting checks and highlighting to C++ component means almost a re-implementation of the whole plugin from scratch IMHO. The only benefits you'll have are...

+ highlighting and symbol table
? the native execution of clang analysis, which can be imported from XML already
? the AST in order to re-implement all existing checks in C++, which almost always do not use AST, but run on the raw text or tokens
? support of the latest C++ standard, which is not really relevant/critical for the highlighting and probably not important for the CxxSquidSensors checks (see above)

... but it comes on costs of

- more complex cross-language design (parts are written in C++, parts in java, re-invention of some exchange protocols etc)
- problems with building, distribution, execution of binary packages

  • it's not python, java or c# - you'll have to pre-build linux-x64/x86, windows x64/x86 and mac-os like SonarCFamily does
  • depends on when you want to execute this native part, you'll need to implement build interceptor (strace-like/bear-like or SonarCFamily-like one) (?) or "fork" it from java

- undetermined properties of clang parser (how fault-tolerant/robust is it w.r.t. to different C/C++ dialects e.g. C++/CLI?, how easy is to "parse" a superset of all C/C++ standards and C/C++ dialects [if possible])

I am not really sure, clang is worth it.

@jmecosta
Copy link
Member

@ivangalkin yes, i agree that is a massive work to port all those to clang. just sharing all available options, and the pros and cons of each.

@guwirth
Copy link
Collaborator

guwirth commented May 31, 2018

What's about the clang's C++ frontend: its ideology is to follow the C++ standard as strict as possible. ... The design idea of Eclipse CDT is the opposite one: produce the most generic parser.

This would be a clear point preferring Eclipse CDT.

But, to be honest, the most of them are written like apply-regexp-to-the-each-source-line or apply-regexp-to-the-each-comment-line, so even that benefit is arguable.

There are still some static code analysis checks but I anyway like to remove it. But the other reason why we need the parser is to create metrics. At the moment there are also some users using XPath on top of the AST.

more complex cross-language design (parts are written in C++, parts in java, re-invention of some exchange protocols etc)

Think we should stay within one technology. SQ is Java.

@ivangalkin is there a possibility to play around wit the CDT like our sslr-cxx-toolkit?

@ivangalkin
Copy link
Contributor Author

@guwirth the existing metrics are different;

  • some of them do not require the AST at all (FILES, LINES, LINES_OF_CODE, COMMENT_LINES), maybe at most a lexer. What I know for sure, is that one can obtain the information about all existing comments and preprocessor directives (I played with it in order to implement the highlighting, based on the EclipseCDT). So maybe it will help.
  • implementation of the following metrics will become trivial, because of well-typed AST (STATEMENTS, FUNCTIONS, CLASSES)
  • it's hard to tell about the metrics like [COGNITIVE_]COMPLEXITY or PUBLIC[_UNDOCUMENTED]_API. It depends on the business logic. I believe they will profit from the AST too, because the actual type of expressions/declarations is stored in the AST

is there a possibility to play around wit the CDT like our sslr-cxx-toolkit

I haven't found any convenient way to play with the CDT/parser/AST etc yet. The code is open sourced, so it's easy to learn how eclipse C/C++ plugin is implemented. You can see the resulting AST in the Java debugger and finally one can always install EclipseCDT and have a good overview about its functionalities.

@guwirth
Copy link
Collaborator

guwirth commented Jun 15, 2018

@Bertk that's your favourite feature :-)

How should we continue:

  1. Look for a non Eclipse CDT solution?
  2. Merge it and have two parsers in the code?

@Bertk
Copy link
Contributor

Bertk commented Jun 15, 2018

I do not like using 2 parsers for the AST in a plug-in and losing the sslr-cxx-toolkit is a major drawback.
CDT also adds 15 MB to the plug-in which is downloaded with every analysis and I want to introduce PR static code analysis with SQ7.2.
So - my preferred option is 1. "Look for a non-Eclipse CDT solution"

@ivangalkin
Copy link
Contributor Author

Just for the sake of accuracy: CDT adds 10 MB (25 MB vs 15 MB). This might sound gigantic if you measure the diff in percentage (+66%), but for the local [corporate] network the transport of 10 MB takes less than a second. It's not a major argument IMHO

w.r.t. the non-Eclipse CDT solution: @Bertk, do you have any idea about the possible one?

I repeat myself, but I am certain, that in the current parser is a bad basis for implementation of a symbol table. Maybe in the middle term it even should be replaced completely. E.g. because

  • sslr-squid-bridge will lose the preprocessor support soon
  • SquidAstVisitorContext will be removed and therefore all visitors (from highlighting to diverse code analyzers) will be forced to be rewritten
  • the existing AST is very basic and doesn't provide any semantic information. Implementation of a complete and consistent symbol table on top this AST will be at least very painful (and, if you think about the complexity of c++, rather near to impossible).

So if we claim to implement a correct symbol table, we need to rework / to replace the current parser and the AST IMHO.

As alternative we could implement something similar to the highlighting of basic online editors (see e.g. http://ideone.com). The highlighting doesn't use any semantic analysis at all. It uses string matching only, and this minimalistic claim is obvious. I believe such implementation is better than to have an erroneous self-written semantic analysis. And it's better than nothing.

@ivangalkin
Copy link
Contributor Author

I've updated the recent version of the patch and it works even better now (macro expansions are excluded, which prevents indexing non-existing symbols). Please give it a try on your SonarQube installations.

@ivangalkin
Copy link
Contributor Author

One more update: use the newest org.eclipse.cdt.core version 6.5.0 (Eclipse Photon)

@Bertk
Copy link
Contributor

Bertk commented Aug 26, 2018

@guwirth @ivangalkin This PR is not a proper solution for SymbolTable because a second independent AST parser only for SymbolTable is introduced and the existing AST parser is still active.
I agree it is working but the C++ grammar maintenance is more complex because the SymbolTable is created with CDT parser and not with the SSLR parser.
Until now I did not see the performance impact for this PR as well and we only know the package size increases 15 MB.

@ivangalkin
Copy link
Contributor Author

@Bertk

  1. w.r.t. to the JAR size: the last time I compared it, the increase was 10 MB
  2. w.r.t. run-time overhead: introduce CxxAstVisitorProfiler for CxxAstScanner #1507 is able to measure it; I'll do the benchmarking and provide the results (unfortunately not earlier than in 2 or 3 weeks)

* Use Eclipse CDT parser (incl. preprocessor) in order
  to create an AST

* Use it to retrieve the invormation about all symbols
  (declarations and references) of the compilation unit

patchset log:
* rebase and conflict resolving
* use preprocessor (include paths used, defines are not)
* filter macro expansions, since symbols are not placed in
  original source code
* update org.eclipse.cdt.core to 6.5.0 (photon), which
  claims to support c++14 & c++17 features
@guwirth
Copy link
Collaborator

guwirth commented Mar 21, 2020

@ALL: This discussion is an older one. Still not sure in which direction we should go. Found article below where they state at the end, that CDT also wanna use Clang. Does someone know the current state?

https://www.eclipse.org/community/eclipse_newsletter/2017/april/article3.php

@guwirth guwirth closed this Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants