-
Notifications
You must be signed in to change notification settings - Fork 332
4. Adding New Languages
To add support for a new language to JPlag, a JPlag frontend needs to be created for that specific language. The core purpose of a frontend is to transform each submission to a list of Tokens, an abstraction of the content of the submission files independent of the language of the submissions.
The token lists of the different submissions are then passed on to a comparison algorithm that checks the token lists for matching sequences.
In the context of JPlag, a Token does not represent a lexical unit, as identifiers, keywords or operators. Instead, Tokens represent syntactic entities, like statements, or control structures. More than one token might be needed to represent the nested structure of a statement or expression in a linear token list.
class MyClass extends SuperClass { private String name; }
//CLASS_DECLARATION //CLASS_BODY_BEGIN
//FIELD_DECLARATION
//CLASS_BODY_END
Each comment is intended to represent one token.
From this example in Java, you may be able to see the following things:
- a class declaration is represented by three tokens of different types:
CLASS_DECLARATION
,CLASS_BODY_BEGIN
andCLASS_BODY_END
- a token is associated with a position in a code file.
- the abstraction is incomplete, many details of the code are omitted. The original code cannot be reconstructed from the token list, but its structure can.
A few more points about Tokens in JPlag:
- a token list contains the Tokens from all files of one submission. For that reason, Tokens save the filename of their origin in addition to their position.
- Token types are represented by the
TokenType
interface which has to be adapted for each language individually.- For brevity, each token type is also associated with a String description, usually shorter than their name. Looking at the String representations used in existing frontends, you may recognize a kind of convention about how they are formed. The example above uses the full names of token types.
Here is an outline of the transformation process.
- each submitted file is parsed. The result is a set of ASTs for each submission.
- each AST is now traversed depth-first. The nodes of the AST represent the grammatical units of the language.
- upon entering and exiting a node, Tokens can be created that match the type of the node. They are added to the current token list.
- for block-type nodes like bodies of classes or if expressions, the point of entry and exit correspond to the respective
BEGIN
andEND
token types. If done correctly, the token list should contain balanced pairs of matchingBEGIN
andEND
tokens.
@Override
public void enterClassDeclaration(ClassDeclarationContext context) {
Token token = new Token(CLASS_DECLARATION, /* more parameters ... */);
addToken(token);
}
@Override
public void exitClassDeclaration(ClassDeclarationContext context) {
// class declarations get no end token -> do nothing
}
@Override
public void enterClassBody(ClassBodyContext context) {
Token token = new Token(CLASS_BODY_START, /* more parameters ... */);
addToken(token);
}
@Override
public void enterClassDeclaration(ClassBodyContext context) {
Token token = new Token(CLASS_BODY_END, /* more parameters ... */);
addToken(token);
}
The way the traversal works and how you can interact with the process depends on the parser technology used. In the example above, ANTLR-generated parsers were used, as was in most of the current JPlag frontends. We recommend to use ANTLR for any new frontend.
If a hard-coded (as opposed to dynamically generated) parser library is available for your language, it may make sense to use it. An implementation of the visitor pattern for the resulting AST should be included.
A frontend consists of these parts:
Component/Class | Superclass | Function | How to get there |
---|---|---|---|
Language class | de.jplag.Language |
access point for the frontend | copy with small adjustments |
pom.xml |
- | Maven submodule descriptor | copy with small adjustments; add dependencies for parser |
README.md |
- | documentation for the frontend | copy for consistent structure; adjust from there |
TokenType class | de.jplag.TokenType |
contains the language-specific token types | implement new |
Lexer and Parser | - | transform code into AST | depends on technology |
ParserAdapter class | de.jplag.AbstractParser |
sets up Parser and calls Traverser | depends on technology |
Traverser/ TraverserListener classes |
- | creates tokens traversing the AST | depends on technology |
For example, if ANTLR is used, the setup is as follows:
Antlr specific parts/files | Superclass | Function | How to get there |
---|---|---|---|
Grammar files (.g4 ) |
- | capture syntax of the language | most likely available in ANTLRv4 Grammar Repository |
Lexer and Parser |
Lexer , Parser (ANTLR) |
transform code into AST | generated from grammar files by antlr4-maven-plugin |
Traverser |
ParseTreeWalker (ANTLR) |
traverses AST and calls listener | included in antlr4-runtime library, can be used as is |
TraverserListener class |
ParseTreeListener (ANTLR) |
creates tokens when called | implement new |
ParserAdapter class | de.jplag.AbstractParser |
sets up Parser and calls Traverser | copy with small adjustments |
As the table shows, much of a frontend can be reused, especially when using ANTLR. The only parts left to implement specifically for each frontend are
- the ParserAdapter (for custom parsers)
- the TokenTypes, and
- the TraverserListener.
Note for parser libraries other than ANTLR:
- It should still be rather easy to implement the ParserAdapter from the library documentation.
- Instead of using a listener pattern, the library may require you to do the token extraction in a Visitor subclass. In that case, there is only one method call per element, called e.g.
traverseClassDeclaration
. The advantage of this version is that the traversal of the subtrees can be controlled freely. See the Scala frontend for an example.
flowchart LR
JPlag -->|"parse(files)"| Language
subgraph frontend[LanguageFrontend]
Language -->|"parse(files)"| ParserAdapter
ParserAdapter -->|"parse(files)"| Parser -.->|ASTs| ParserAdapter
ParserAdapter -->|"walk(ASTs)"| Traverser
Traverser -->|"enterXYNode()"| TraverserListener
Traverser -->|"exitXYNode()"| TraverserListener
TraverserListener -.-> action{{"tokenList.add(new Token())"}}
end
Note: In existing frontends, the token list is managed by the ParserAdapter, and from there it is returned to the Language class and then to JPlag.
The following adjustments have to be made beyond creating the frontend submodule itself:
- Register the submodule in the aggregator POM for every build profile.
<!-- ${jPlagRoot}/pom.xml -->
<modules>
...
<module>jplag.frontend.my-frontend</module>
...
</modules>
- Add a dependency from the aggregator module to the new frontend
- Add a dependency from the jplag module to the new frontend
<!-- ${jPlagRoot}/pom.xml -->
<!-- ${jPlagRoot}/jplag/pom.xml -->
<dependencies>
...
<dependency>
<groupId>de.jplag</groupId>
<artifactId>jplag.frontend.my-frontend</artifactId>
<version>${revision}</version>
</dependency>
...
</dependencies>
That's it! The new frontend should now be usable as described in the main README. The name of the frontend used with the CLI -l
option is the IDENTIFIER
set in the Language class.
Apart from extracting the tokens correctly, the task of deciding which syntactical elements should be assigned a token is the essential part when designing a frontend.
This guideline is solely based on experience and intuition – this "worked well" so far. More research might hint towards a more systematic process of token selection.
The goal of the abstraction is to create a token list that is
- accurate: a fair representation of the code as input to the comparison algorithm
- consistent per se: insensitive to small changes in the code that might obfuscate plagiarism; constructs are represented equally throughout the file
- consistent with the output of other trusted frontends—only to the extent that their respective languages are comparable, naturally.
To create a set of tokens in line with these objectives, we offer the tips below.
Elements with BIG_AND_FAT
text represent tokens, while elements in lowercase surrounded by <angle-brackets>
represent subexpressions that may produce any number of tokens themselves.
? marks optional parts which may occur zero or one times, * marks elements that may occur any number of times.
-
Use a separate token for both ends of every type of block or body.
-
More generally, for any type of composite expression or statement, the number of designated token types needed to separate them in the token list is the number of subexpressions + 1. Additional tokens may make be needed in certain locations, like optional parts.
Expression type | #expressions | #tokens | Example code and tokenization pattern |
---|---|---|---|
loop (Rust) | 1 | 2 |
loop { println!("{}", 1) }
LOOP{ <statements> }LOOP
|
if (C) | 2 | 3 |
if (true) { printf("1"); }
IF <condition> IF{ <statements> }IF
|
do-while (C) | 2 | 3 |
do { printf("1") } while (true);
DO{ <statements> }DO-WHILE( <condition> )WHILE alt.: DO{ <statements> }DO WHILE( <condition> )WHILE
|
Ternary – ?: (C) | 3 | 4 |
true ? 1 : 0
COND( <condition> IF_TRUE: <expression> IF_FALSE: <expression> )COND
|
In the do-while example above, the }DO-WHILE(
marks the end of the loop block and the beginning of the condition expression at once. For the sake of having a designated token for the ending of the loop block, a possible alternative tokenization with an extra token is given.
-
For list subtrees, a single token to mark the beginning of each element may suffice.
Note: If lists of the same type are nested, the end of the inner list may become unclear. Additional tokens for both ends of the list may be appropriate in that case. -
For leaf subtrees (that do not subdivide further), a single token may suffice.
-
For optional subtrees, a single token may suffice to indicate that it occurred.
Optional expression type | #expressions | #tokens | Example code and tokenization pattern |
---|---|---|---|
Class declaration: generic type parameters | n + 1 | n + 2 |
class Map<K,V> { ... }
CLASS (TYPE_PARAM )* CLASS{ <body> }CLASS
|
Method invocation: arguments | n | n + 1 |
printf("%d: %s", 1, "one");
APPLY (ARG <expression> )* |
if statement: else block | 2 (+ 1) | 3 (+ 2) |
if (true) { printf("1"); } else { printf("0"); }
IF <condition> IF{ <statements> }IF (ELSE{ <statements> }ELSE )? |
-
Keywords that influence the control flow receive a token, for example
-
return
,break
,continue
-
-
Semantic information, references and concrete values are generally omitted, for example
- identifiers
- type information
-
final
modifier - access modifiers
- instructions to the compiler/VM:
transient
,volatile
- references to classes, objects, fields, array accesses
- numbers and other literals, as well as built-in operations
-
Statements with side effects generally receive a token, for example
- constructor calls
- declarations of variables and fields
- assignments
- method calls
-
Regarding sensitivity: Very similar constructs may receive the same token even if they are syntactically distinct, for example
- variable and constant declarations
- method and function declarations
- different variations of
for
-loops - (See the documentation about token extraction for more examples)
The token extraction is the most time-consuming part of the frontend design. How difficult it is is largely dependent on the underlying grammar of the parser.
This article deals with the implementation of the listener which is called at every stage of traversal of the AST. The examples center around tokenizing the Java language, using a grammar written in ANTLR4.
flowchart LR
AstVisitor -->|"enterNode(Node)"| Listener
AstVisitor -->|"exitNode(Node)"| Listener
Listener --> |"add(Token)"| List["List<Token>"]
In the actual listener, there will be concrete enter
and exit
implementations for each syntactic category, e.g. enterClassDeclaration
, enterIfStatement
and so on.
The basic case is that a syntactic category of the grammar corresponds to a token directly.
@Override
public void enterIfStatement(IfStatementContext context) {
addToken(IF_STATEMENT, context);
}
@Override
public void enterIfBody(IfBodyContext context) {
addToken(IF_BODY_START, context);
}
private void addToken(TokenType tokenType, ParserRuleContext context) {
tokens.add(new Token(tokenType, context.getFile(), context.getLine(), context.getColumn(), context.getLength()));
}
The complexity comes from the degree to which syntactic categories are reused in different contexts in the grammar. Instead of a distinct ifBody
category, for example, there may be only one category for 'control structure bodies' or 'block expressions' of any kind.
@Override
public void enterBlockExpression(BlockExpressionContext context) {
// Now, is this the body of a method, a for/while/if/try/catch expression?
}
// unedited grammar
ifStatement:
'if' '(' expression ')' statement
('else' statement))? ;
statement:
blockStatement | ifStatement | forStatement | ... ;
If you can alter the grammar definition, it may be well worth considering. Introduce a separate rule for each use of rules that you want to differentiate. Example in antlr4:
// after editing
ifStatement:
'if' '(' expression ')' ifBody ('else' elseBody)? ;
ifBody:
statement ;
elseBody:
statement ;
statement:
blockStatement | ifStatement | forStatement | ... ;
This does introduce some kind of redundancy to the grammar, but it makes the implementation of the listener methods much easier. The caveat that comes with this approach is that updated grammars will have to be edited again. The licence of the grammar should also be considered.
If you are stuck with the grammar and parser as they are, you can mimic what the parser does and introduce a stack machine to the listener. Then, anytime a relevant syntactical structure is entered, you add a context to the stack, and when you enter an ambiguous subtree, the current context will help distinguish the different cases.
@Override
public void enterIfStatement(IfStatementContext context) {
addToken(IF_STATEMENT, context);
contexts.enterContext(IF_CONTEXT);
}
@Override
public void enterBlockExpression(BlockExpressionContext context) {
switch (contexts.peek()) {
case IF_BODY -> addToken(IF_BODY_START, context);
case FOR_BODY -> addToken(FOR_BODY_START, context);
...
}
}
@Override
public void exitIfStatement(IfStatementContext context) {
contexts.popContext();
}
The management of the context makes the listener much more complicated to read and maintain.
Notes:
- If there are optional parts in the grammar rule, the index of terminals may not be static. A more sophisticated search method may be necessary, possibly using the text content of the child nodes (
ParserRuleContext::getText
). - In this example, the curly braces themselves are optional. The case where they are omitted needs to be covered as well.
The exit
methods can be used to add END
tokens for bodies and blocks. If you put the ´enter` and ´exit´ methods of a kind directly next to each other in the code as a pair, there should be little room for confusion about which token types should be added there.
Depending on the implementation of the grammar, some keywords or symbols may not have a rule for themselves. Using Antlr, you can always catch their occurrences in the visitTerminal(TerminalNode)
method.
@Override
public void visitTerminal(TerminalNode node) {
switch (node.getText()) {
case "catch" -> addToken(CATCH, node.getToken());
//...
}
}
The token extraction process can support the prevention of simple refactoring attacks by treating equivalent constructs the same. For example, a language may support multi-definitions:
var a, b, c = 1;
This statement could translate to the token list VAR_DEF VAR_DEF VAR_DEF ASSIGN
. An easy refactoring would produce a different token list:
var a = 1;
var b = 1;
var c = 1;
Now, this looks more like VAR_DEF ASSIGN VAR_DEF ASSIGN VAR_DEF ASSIGN
. It might be a good idea to duplicate the ASSIGN
token in the first case as well, so that this simple attack is overcome easily. (The resulting token list may be unsorted, but that is acceptable.)
This almost goes in the direction of partly compiling the code in your mind. Another example would be the treatment of control structures, which can be refactored into each other fairly easily as well:
for (<initialization>; <condition>; <incrementation>) {
<body>;
}
// is the same as:
<initialization>
while (<condition>) {
<body>;
<incrementation>;
}
<ref> = <condition> ? <valueIfTrue> : <valueIfFalse>;
// is the same as
if (<condition>) {
<ref> = <valueIfTrue>;
} else {
<ref> = <valueIfFalse>;
}
The degree to which the Token extraction process should try to preemptively generalize over similar grammatical elements is unclear and heavily dependent on the language.
To make the language configurable language specific options can be added. To do so you need to subclass LanguageOptions.
import de.jplag.options.LanguageOption;
import de.jplag.options.LanguageOptions;
import de.jplag.options.OptionType;
public class ExampleLanguageOptions extends LanguageOptions {
final LanguageOption<String\> testOption = createOption(OptionType.string(), "testOption");
final LanguageOption<Integer\> sensitivity = createDefaultOption(OptionType.integer(), "sensitivity", 3);
}
Then you need to publish it in your language:
public class Language implements de.jplag.Language {
[...]
private ExampleLanguageOptions options = new ExampleLanguageOptions();
[...]
@Override
public ExampleLanguageOptions getOptions() { // Make sure to set your subclass as the return type. Otherwise it will not be useful for api access.
return options;
}
}
These options will then be available in the cli. You can then access the options values the following way:
options.sensitivity.getValue();
You should pass the options to your parser if neccesary.
To check the output of your frontend against the input, the TokenPrinter
can be helpful. The TokenPrinter
prints the input line by line, and the tokens of each line below it.
10 public class Example {
|CLASS |CLASS{
11 private int number;
|FIELD
12 public int getNumber() {
|METHOD |METHOD{
13 return number;
|RETURN
14 }
|}METHOD
15 }
|}CLASS
To test a frontend, set up a JUnit test class where the TokenPrinter
prints the output of the parse
method of the frontend. Read through the output and check whether the List<Token>
satisfies the given requirements.
The frontend should be tested with 'authentic' sample code as well as a 'complete' test file that covers all syntactic elements that the frontend should take into account. If you are using an ANTLR parser, such a complete test file may be included in the parser test files in the ANTLRv4 Grammar Repository.
-
The token list represents the input code correctly.
- In particular, the nesting tokens are correctly nested and balanced.
-
The token list represents the input code with an acceptable coverage —how that can be measured and what coverage is acceptable depends on the language. One approach would be line coverage, e.g. 90 percent of code lines should contain a token.
-
There are no
TokenTypes
that can never be produced by the frontend for any input.- Put another way, the complete test code produces a token list that contains every type of token.
The language-testutils module provides a simple way to implement tests for language modules. To use that you have to make your test class extend LanguageModuleTest:
import de.jplag.testutils.datacollector.TestDataCollector;
import de.jplag.testutils.datacollector.TestSourceIgnoredLinesCollector;
public class ExampleTest extends LanguageModuleTest {
public ExampleTest() {
super(new ExampleLanguage(), Arrays.asList(ExampleLanguageTokens.values()));
}
@Override
protected void collectTestData(TestDataCollector collector) {
}
@Override
protected void configureIgnoredLines(TestSourceIgnoredLinesCollector collector) {
}
}
In the 'collectTestData' method you can configure the test sources, that you want to test and what kind of test should be run. For more details take a look at the javadoc.
collector.testFile("firstFile.java", "secondFile.java").testSourceCoverage().testContainedTokens(<someTokenType>);
collector.inlineSource("public static void main(String\[\] args) {\\n}").testCoverages();
The 'configureIgnoredLines' method can be used to exclude lines from the source coverage check. This example ignores all java comments. Empty lines are always ignored.
collector.ignoreLinesByPrefix("//");
collector.ignoreMultipleLines("/*", "*/");
In case you don't use the default layout for test source files, you can override the 'getTestFileLocation'
@Override
protected File getTestFileLocation() {
return super.getTestFileLocation();
}
To ensure your language gets properly registered and its code is correctly highlighted in the report-viewer:
- Add your language to the
ParserLanguage
enum in 'src/model/Language.ts'. As the value for the entry use its frontend name. - Add your language to the switch-case in
src/utils/CodeHighlighter.ts
and return the correct highlight.js name. If your language is not supported by default, also register the language here.