Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Java-CPG language module #1659

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from
Open

Conversation

robinmaisch
Copy link
Contributor

This Draft PR contains the new "Java Code Property Graph" module.
It uses a CPG library and graph transformations to normalise substructures of the code prior to tokenization, which should lead to a higher resilience against refactoring attacks.

It also serves to fix SonarCloud errors before a proper PR is opened.

robinmaisch and others added 17 commits December 29, 2023 11:55
Content:
 - A new CPG language frontend for JPlag
 - An interface to transform submissions into CPGs
 - An interface to transform CPGs into token lists
 - A Graph Transformation Engine (to be extended)
   . interfaces representing node and graph patterns, matches of these patterns, transformations
   . an isomorphism detector
   . a transformation algorithm
 - Some graph transformations (to be extended)
- implemented multi-root graph patterns
- implemented searching for "all matches at once"
- GraphOperations should leave EOG intact
- implemented new kinds of edges and properties for graph patterns
- tokenization works well
- implemented DFG sort pass
-- this requires specialized treatment for all kinds of language features. Surely the considered feature set is incomplete.
It was designed for local use.
Add many comments and put a file in the 'passes' package. When JavaDoc cannot find a Java file in there, it quits.
@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Mar 20, 2024
@robinmaisch robinmaisch marked this pull request as ready for review March 23, 2024 20:22
Copy link
Contributor

@TwoOfTwelve TwoOfTwelve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of pointers concerning your code style. I only added the comments for one occurrence each. It would be great if you could fix those things to make the style more consistent with the rest of the code.

language-api/src/main/java/de/jplag/Language.java Outdated Show resolved Hide resolved
language-api/src/main/java/de/jplag/Language.java Outdated Show resolved Hide resolved
language-api/src/main/java/de/jplag/Language.java Outdated Show resolved Hide resolved
language-api/src/main/java/de/jplag/Language.java Outdated Show resolved Hide resolved
language-api/src/main/java/de/jplag/TokenType.java Outdated Show resolved Hide resolved
* A ATransformationPass is an abstract transformation pass. All transformation passes function the same way, but need
* to be separate classes to work with the CPG pipeline.
*/
abstract class ATransformationPass(ctx: TranslationContext) : TranslationResultPass(ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use Kotlin, as using more languages complicates the process of compiling JPlag and make the code harder to understand for new contributors.

Copy link
Contributor Author

@robinmaisch robinmaisch Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the CPG library requires that Kotlin is used, any subclass of Pass written in Java leads to an exception, unfortunately. I could ask the authors to change that, if that is an absolute requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the exception? Do you remember this?

@robinmaisch

This comment was marked as outdated.

@Kr0nox
Copy link
Member

Kr0nox commented Mar 28, 2024

@robinmaisch

This comment was marked as outdated.

@tsaglam tsaglam requested review from tsaglam, uuqjz and a team April 2, 2024 14:56
Copy link

sonarqubecloud bot commented Apr 3, 2024

Quality Gate Passed Quality Gate passed for 'JPlag Plagiarism Detector'

Issues
0 New issues
1 Accepted issue

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget to add a language module readme. Especially to explain the differences to the other Java modules. It should probably both include information for users and for developers due to the complexity of the PR.

On another note: We should just include Kotlin in the spotless config: https://github.com/diffplug/spotless/tree/main/plugin-maven#kotlin. We can add the config directly, but I would wait to apply the formatting after the reviews are done and the feedback addressed.

Finally, some questions for our next meeting:
This is a LOT of code, I initially hoped more functionality would be provided by the API itself. In your opinion, what is the main cause for the size of the PR? And why does the CPG API require so much boilerplate code?

cli/pom.xml Show resolved Hide resolved
languages/java-cpg/pom.xml Show resolved Hide resolved
clearTransformations();
setReorderingEnabled(false);
}
// TokenizationPass sets tokenList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field tokenList is set to null a few lines before. I thought I'd include that comment to clarify that the tokenList is not supposed to still be null when it is returned in line 49, but it is set by the TokenizationPass via the method CpgAdapter.setTokenList(List<Token>).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, this was not clear to me, thanks.

} catch (InterruptedException e) {
throw e;
} catch (ExecutionException | ConfigurationException e) {
throw new ParsingException(List.copyOf(files).getFirst(), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the copy when only one element is passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because any one element can not be gotten from the Set<File> files. I agree that this is not very elegant.
Maybe replace it with files.stream().findAny().orElse(null)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use files.iterator().next(); instead.

Comment on lines +12 to +17
/**
* This is a wrapper for a graph edge (with a 1:1 relation).
* @param <T> The type of the source node
* @param <R> The type of the related node
*/
public class CpgEdge<T extends Node, R extends Node> extends AEdge<T, R> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me: Why are we wrapping the AEdge which is just a base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refer to IEdge vs. AEdge:
IEdge existed before AEdge, and I did not realize that it could actually just fully replace it. I'll remove the IEdge interface, then.

If you refer to the phrase "wrapper for a graph edge", then let me explain it in a bit more detail.
To summarize: The Edge classes make up for the fact that in the CPG library, edges are represented inconsistently.
Either:

  1. the edge target node is a field of the source (e.g., an IfStatement node has the fields condition, thenStatement, elseStatement); the AST edge itself is not modeled
    --- or, if it is a 1:n relation ----
  2. that relation is represented as a list of nodes (e.g. a FunctionType has the field parameters)
  3. that relation is represented as a list of edges (e.g. a Block node has the field statementEdges)

The edge class of the CPG library is called PropertyEdge.

For the transformations, I need to get and the target(s) of edges and eventually set them to new values. My Edge classes save the appropriate getter and setter functions for all three cases.
In the class de.jplag.java_cpg.transformation.matching.edges.Edges, you can see all the kinds of edges I use, and how I save the appropriate getter and setter functions from the node classes in them.

import de.jplag.java_cpg.transformation.matching.pattern.NodePattern;

/**
* This is a wrapper for a graph edge (1:n relation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate here?

Comment on lines +7 to +12
/**
* This serves as an interface to wrap any kind of {@link PropertyEdge}.
* @param <T> the source node type
* @param <R> the related node type
*/
public interface IEdge<T extends Node, R extends Node> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I am kind of lost in the edge hierarchy.

Comment on lines +664 to +666
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, very large class, with many nested classes. In the long term, maintaining this language module will be very tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move them out, that's not a problem.

import de.jplag.java_cpg.transformation.matching.pattern.Match;
import de.jplag.java_cpg.transformation.matching.pattern.NodePattern;

public abstract sealed class Relation<T extends Node, R extends Node, V> permits OneToNRelation, RelatedNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDoc?

Copy link
Contributor

@uuqjz uuqjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a lock at the things i marked, I've not checked tests

*/
@MetaInfServices(de.jplag.Language.class)
public class JavaCpgLanguage implements Language {
private static final int DEFAULT_MINIMUM_TOKEN_MATCH = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could take the dafault mtm of the java module instead of setting it to 9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possible, but it would require a module dependency on the Java frontend.
Do you think that would be justified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would not be justified.

node: Statement,
essentialNodes: MutableList<Statement>,
): Boolean {
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is necessary? Or can it be formatted so it doesn't span 10 lines?

* @param astRoot the root of the sub-AST
* @return the EOG {@link SubgraphWalker.Border} of the AST
*/
public static SubgraphWalker.Border getEogBorders(Node astRoot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does EOG stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the Evaluation Order Graph, a concept introduced by the authors of the CPG library, where the term is also often used. The EOG is a type of Control Flow Graph, but instead of blocks, it operates on individual Nodes.
For example, in a binary operation expression, the left operand is evaluated first (except if it is an assignment), then the right operand is evaluated, and then the operation itself. An EOG of a binary operation expression would contain a directed edge from the left operand node to the right operand node and from the right operand to the operation node to reflect that. In control statements, the EOG may split.
In the CPG library, the EOG is the basis for complex name analysis, data flow analysis, and other types of analyses.
I use the EOG as the basis for CPG Linearization.

@tsaglam tsaglam changed the title Add Java-CPG frontend Add Java-CPG language module Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants