diff --git a/docs/1.-How-to-Use-JPlag.md b/docs/1.-How-to-Use-JPlag.md index 8702a9706..c940816d7 100644 --- a/docs/1.-How-to-Use-JPlag.md +++ b/docs/1.-How-to-Use-JPlag.md @@ -119,7 +119,15 @@ The report will always be zipped unless there is an error during the zipping pro ## Viewing Reports -The newest version of the report viewer is always accessible at https://jplag.github.io/JPlag/. Drop your `result.zip` folder on the page to start inspecting the results of your JPlag run. Your submissions will neither be uploaded to a server nor stored permanently. They are saved in the application as long as you view them. Once you refresh the page, all information will be erased. +Starting with version v6.0.0, the report viewer is bundled with JPlag and will be launched automatically. The `--mode` option controls this behavior. +By default, JPlag will process the input files and produce a zipped result file. After that, the report viewer is launched (on localhost), and the report will be shown in your browser. + +The option `--mode show` will only open the report viewer. +This allows you to view existing reports. +You can optionally provide the path to a report file to immediately display it in the viewer; otherwise, the viewer will require you to select a report, just like the online version. +By specifying `--mode run`, JPlag will run but generate the zipped report but will not open the report viewer. + +An online version of the viewer is still hosted at https://jplag.github.io/JPlag/ in order to view pre-v6.0.0 reports. Your submissions will neither be uploaded to a server nor stored permanently. They are stored as long as you view them. Once you refresh the page, all information will be erased. ## Basic Concepts diff --git a/docs/3.-Contributing-to-JPlag.md b/docs/3.-Contributing-to-JPlag.md index f87801e79..5f770753f 100644 --- a/docs/3.-Contributing-to-JPlag.md +++ b/docs/3.-Contributing-to-JPlag.md @@ -21,10 +21,12 @@ Please try to make well-documented and clearly structured submissions: ## Building from sources 1. Download or clone the code from this repository. + ### Core 2. Run `mvn clean package` from the root of the repository to compile and build all submodules. Run `mvn clean package assembly:single -P with-report-viewer` instead if you need the full jar, which includes all dependencies. 3. You will find the generated JARs in the subdirectory `jplag.cli/target`. + ### Report Viewer 2. Run `npm install` to install all dependencies. 3. Run `npm run dev` to launch the development server. The report viewer will be available at `http://localhost:8080/`. diff --git a/docs/4.-Adding-New-Languages.md b/docs/4.-Adding-New-Languages.md index 1cfbb9563..5fd6f876c 100644 --- a/docs/4.-Adding-New-Languages.md +++ b/docs/4.-Adding-New-Languages.md @@ -84,7 +84,7 @@ For example, if ANTLR is used, the setup is as follows: | Lexer and Parser | `Lexer`, `Parser` (ANTLR) | transform code into AST | generated from grammar files by antlr4-maven-plugin | | Traverser | `ParseTreeWalker` (ANTLR) | traverses AST and calls listener | included in antlr4-runtime library, can be used as is | | TraverserListener class | `ParseTreeListener` (ANTLR) | creates tokens when called | **implement new** | -| ParserAdapter class | `de.jplag.AbstractParser` | sets up Parser and calls Traverser | copy with small adjustments | +| ParserAdapter class | `de.jplag.AbstractAntlrParser` | sets up Parser and calls Traverser | copy with small adjustments | As the table shows, much of a language module can be reused, especially when using ANTLR. The only parts left to implement specifically for each language module are - the ParserAdapter (for custom parsers) @@ -95,7 +95,130 @@ As the table shows, much of a language module can be reused, especially when usi - It should still be rather easy to implement the ParserAdapter from the library documentation. - Instead of using a listener pattern, the library may require you to do the token extraction in a _Visitor subclass_. In that case, there is only one method call per element, called e.g. `traverseClassDeclaration`. The advantage of this version is that the traversal of the subtrees can be controlled freely. See the Scala language module for an example. -### Basic procedure outline +## Setting up a new language module with ANTLR + +JPlag provides a small framework to make it easier to implement language modules with ANTLR + +### Create the Language class + +Extends the AbstractAntlrLanguage class and implements all required methods. There are two options for creating the parser. +It can either be passed to the superclass in the constructor, as shown below, or created later by overriding the initializeParser method. +The latter option should be used if the parser requires dynamic parameters. + +```java +public class TestLanguage extends AbstractAntlrLanguage { + public TestLanguage() { + super(new TestParserAdapter()); + } + + @Override + public String[] suffixes() { + return new String[] {"expression"}; //return a list of file suffixes for your language + } + + @Override + public String getName() { + return "Test"; //return the name of the language (e.g. Java). Can be anything that describes the language module shorty + } + + @Override + public String getIdentifier() { + return "test"; //return the identifier for the language (e.g. java). Should be something simple and unique + } + + @Override + public int minimumTokenMatch() { + return 9; //The minimum number of tokens required to form a match. Leave this at 9 if your module doesn't require anything different + } +} +``` + +### Implement the parser adapter + +The generated code by ANTLR always looks slightly different. The AbstractAntlrParserAdapter class is able to perform most of the required steps automatically. +The implementation only needs to call the correct generated methods. They should be named roughly the same as the example. The javadoc of each method contains additional information. + +```java +public class TestParserAdapter extends AbstractAntlrParserAdapter { + private static final TestListener listener = new TestListener(); + + @Override + protected Lexer createLexer(CharStream input) { + return new TestLexer(input); + } + + @Override + protected TestParser createParser(CommonTokenStream tokenStream) { + return new TestParser(tokenStream); + } + + @Override + protected ParserRuleContext getEntryContext(TestParser parser) { + return parser.expressionFile(); + } + + @Override + protected AbstractAntlrListener getListener() { + return listener; + } +} +``` + +### Implement the token type enum + +This is the same as non ANTLR modules. The enum should look something like this: + +```java +public enum TestTokenType implements TokenType { + TOKEN_NAME("TOKEN_DESCRIPTION"); //the description works as a visual name. Look at other language modules for examples + + private final String description; + + TestTokenType(String description) { + this.description = description; + } + + @Override + public String getDescription() { + return description; + } +} +``` + +### Implement the listener + +In contrast to the java module, the framework for the ANTLR module a set of extraction rules has to be defined instead of a traditional listener. +All rules are independent of each other, which makes it easier to debug the token extraction. + +The basic structure looks like this: + +```java +class TestListener extends AbstractAntlrListener { + + TestListener() { + //add rules + } +} +``` + +To make the class easier to read the constructor should only call methods which contain the rules. These methods shouldn't be too long and contain the rules for a specific category of token. + +Extraction rules can be very complicated, but in most cases simple ones will suffice. The easiest option is to directly map antlr tokens to JPlag tokens: + +```java +visit(VarDefContext.class).map(VARDEF); +``` + +There are some different variants of map, which determine the length of the tokens. The javadoc contains details on that. Map can also receive two JPlag token types, which creates one JPlag token for the start of the context and one for the end. +visit can also receive a type of ANTLR terminal node to create tokens from terminal nodes. + +Additional features for rules: + +1. Condition - Can be passed as a second argument to visit. The rule only applies if the condition returns true (see CPP language module for examples) +2. Semantics - Can be passed by using withSemantics after the map call (see CPP language module for examples) +3. Delegate - To have more precise control over the token position and length a delegated visitor can be used (see Go language module for examples) + +## Basic procedure outline ```mermaid flowchart LR @@ -113,7 +236,7 @@ flowchart LR Note: In existing language modules, the token list is managed by the ParserAdapter, and from there it is returned to the Language class and then to JPlag. -### Integration into JPlag +## Integration into JPlag The following adjustments have to be made beyond creating the language module submodule itself: diff --git a/endtoend-testing/README.md b/endtoend-testing/README.md index 4a82070bd..f27e7019d 100644 --- a/endtoend-testing/README.md +++ b/endtoend-testing/README.md @@ -1,114 +1,73 @@ - # JPlag - End-To-End Testing -With the help of the end-to-end module, changes to the detection of JPlag are to be tested. -With the help of elaborated plagiarism, which has been worked out from suggestions in the literature on the topic of "plagiarism detection and avoidance", a wide range of detectable changes can be covered. The selected plagiarisms are the decisive factor here as to whether a change in recognition can be perceived. -## References -These elaborations provide basic ideas on how a modification of the plagiarized source code can look or be adapted. -These code adaptations refer to various changes, from -adding/removing comments to architectural changes in the deliverables. +The end-to-end test module contains tests that report any chance in the similarities reported by JPlag. +There are two kinds of tests: +1. Simple tests that fail if the similarity between two submissions changed +2. Gold standard tests -The following elaborations were used to be able to create the plagiarisms with the broadest coverage: -- [Detecting Source Code Plagiarism on Introductory Programming Course Assignments Using a Bytecode Approach - Oscar Karnalim](https://ieeexplore.ieee.org/abstract/document/7910274 "Detecting Source Code Plagiarism on Introductory Programming Course Assignments Using a Bytecode Approach - Oscar Karnalim") -- [Detecting Disguised Plagiarism - Hatem A. Mahmoud](https://arxiv.org/abs/1711.02149 "Detecting Disguised Plagiarism - Hatem A. Mahmoud") -- [Instructor-centric source code plagiarism detection and plagiarism corpus](https://dl.acm.org/doi/abs/10.1145/2325296.2325328 "Instructor-centric source code plagiarism detection and plagiarism corpus") +## Gold standard tests -## Steps Towards Plagiarism -The following changes were applied to sample tasks to create test cases: - +A gold standard test serves as a metric for the change in detection quality. It needs a list of plagiarism instances in the data set. +JPlag outputs comparisons split into those that should be reported as plagiarism and those that shouldn't. +The test will fail if the average similarity on one of those groups changed. In contrast to the other kind of test, this offers a rough way to check if the changes made JPlag better or worse. -More detailed information about the creation as well as about the subject of the issue can be found in the issue [Develop an end-to-end testing strategy](https://github.com/jplag/JPlag/issues/193 "Develop an end-to-end testing strategy"). +## Updating tests -**The changes listed above have been developed and evaluated for purely scientific purposes and are not intended to be used for plagiarism in the public or private domain.** +If the similarities reported by JPlag change and these changes are wanted, the reference values for the end-to-end tests need to be updated. +To do that the test in [EndToEndGeneratorTest.java](src/test/java/de/jplag/endtoend/EndToEndGeneratorTest.java) have to be executed. +This will generate new reference files. -## JPlag - End-To-End TestSuite Structure -The construction of an end-to-end test is done with the help of the JPlag API. -The tests are generated dynamically according to the existing test data and allow the creation of end-to-end tests for all supported languages of JPlag without making any changes to the code. -The helper loads the existing test data from the designated directory and creates dynamic tests for the individual directories. It is possible to create different test classes for the other languages. - -To distinguish which domain of the recognition changes have occurred, fine granular test cases are used. These are composed of the changes already mentioned above. The plagiarism is compared with the original delivery; thus, detecting and testing small sections of the recognition is possible. - -The comparative values were discussed and tested. The following results of the JPlag scan are used for the comparison: -1. minimal similarity as `double` -2. maximum similarity as `double` -3. matched token number as `int` - -The comparative values were discussed and elaborated in the issue [End-to-end testing - "comparative values"](https://github.com/jplag/JPlag/issues/548 "End-to-end testing - \"comparative values\""). - -Additionally, it is possible to create several options for the test data. More information about the test options can be found at [JPlag - option variants for the end-to-end tests #590](https://github.com/jplag/JPlag/issues/590 "JPlag - option variants for the end-to-end tests #590"). Currently, various settings are supported by the `minimumTokenMatch`. This can be extended as desired in the record class `Options`. - -The current JPlag scans will be compared with the stored ones. -This was done by storing the data in a *.json file which is read at the beginning of each test run. - -### JSON Result Structure - -The structures of the JSON file can be traced using the individual record classes, which can be found under `de.jplag.endtoend.model`. -The outer structure of the JSON file is recorded in the `ResultDescription` record. -The record contains a map of several options and the corresponding results. -The internal structure consists of several `Option` records, each containing information about the test run's current configuration. -Thus the results can be kept apart from the other configurations. -The test results for the specified options are also specified in the object. This consists of the `ExpectedResult` record, which contains the detection results. - -Here the hierarchy is as follows: - -```JSON -[{ - "options":{ - "minimum_token_match":"int" - }, - "tests":{ - "languageIdentifier":{ - "minimal_similarity":"double", - "maximum_similarity":"double", - "matched_token_number":"int" - }, - "/..." - } -}, -{ - "options":{ - "minimum_token_match":"int" - }, - "tests":{ - "languageIdentifier":{ - "minimal_similarity":"double", - "maximum_similarity":"double", - "matched_token_number":"int" - }, - "/..." - } -}] -``` +## Adding new tests + +This segment explains the steps for adding new test data + +### Obtain test data + +New test data can be obtained in multiple ways. + +Ideally, real-world data is used. To use gold standard tests, real-world data needs to contain information about which submission pairs are plagiarism and which aren't. + +Alternatively, test data can be generated using various methods. One such method is explained below. ---- +The test data should be placed under [data](src/test/resources/data). It can either be added as a directory containing submissions or as a zip file. -## Create New Language End-To-End Tests +### Defining the data set for the tests + +This is done in [dataSets](src/test/resources/dataSets). To add a new data set a new json file needs to be placed here. + +A minimum example for a configuration can be found in [progpedia.json](src/test/resources/dataSets/progpedia.json). A full example using all options in [sortAlgo.json](src/test/resources/dataSets/sortAlgo.json). + +For available options look at [dataSetTemplate.json](src/test/resources/dataSetTemplate.json). + +### Generating the reference results + +See Updating tests above + +## Creating test data manually + +The following changes were applied to sample tasks to create the sortAlgo data set: + +* Inserting comments or empty lines (normalization level) +* Changing variable names or function names (normalization level) +* Insertion of unnecessary or changed code lines (token generation) +* Changing the program flow (token generation) (statements and functions must be independent of each other) + * Variable declaration at the beginning of the program + * Combining declarations of variables + * Reuse of the same variable for other functions +* Changing control structures + * for(...) to while(...) + * if(...) to switch-case +* Modification of expressions + * (X < Y) to !(X >= Y) and ++x to x = x + 1 +* Splitting and merging statements + * x = getSomeValue(); y = x- z; to y = (getSomeValue() - Z; + +More detailed information about the creation as well as about the subject of the issue can be found in the issue [Develop an end-to-end testing strategy](https://github.com/jplag/JPlag/issues/193 "Develop an end-to-end testing strategy"). + +**The changes listed above have been developed and evaluated for purely scientific purposes and are not intended to be used for plagiarism in the public or private domain.** -This section explains how to create new end-to-end tests in the existing test suite. ### Creating The Plagiarism + Before you add a new language to the end-to-end tests, I would like to point out that the quality of the tests depends dreadfully on the plagiarism techniques you choose, which were explained in section [Steps Towards Plagiarism](#steps-towards-plagiarism). If you need more information about creating plans for this purpose, you can also read the elaborations that can be found under [References](#references). The more various changes you apply, the more accurate the end-to-end tests for the language will be. @@ -157,69 +116,13 @@ public void BubbleSortWithoutRecursion(Integer arr[]) { //... } ``` -### Copying Plagiarism To The Resources - -The plagiarisms created in [Creating The Plagiarism](#creating-the-plagiarism) must now be copied to the corresponding resources folder. For each test suite, the resources must be placed in `JPlag/jplag.endToEndTesting/src/test/resources/languageTestFiles//`. For example, for the existing test suite `sortAlgo` of language `java`, the path is `JPlag/jplag.endToEndTesting/src/test/resources/languageTestFiles/java/sortAlgo`. -It is important to note that the language identifier must match `Language#getIdentifier` to load the language during testing correctly. - -To automatically generate expected results, the test in `EndToEndGeneratorTest` can be executed to generate a JSON result description file. This file has to be copied to `JPlag/jplag.endToEndTesting/src/test/resources/results//.json`. -Once the test data has been copied, the end-to-end tests can be successfully tested. As soon as a change in the detection takes place, the results will differ from the stored results, and the tests will fail if the results have changed. - -### Extending The Comparison Value -As already described, the current comparisons in the end-to-end test treat the values of `minimal similarity`, `maximum similarity`, and `matched token number`. -As soon as there is a need to extend these comparison values, this section describes how this can be achieved. -Beforehand, however, this should be discussed in a new issue about this need. - -- For new comparison values, these properties must be extended in the `ExpectedResult` record at the package `de.jplag.endtoend.model`. Here it is sufficient to add the values in the record and to enter the JSON name as `@JsonProperty("json_name")`. - -```JAVA -public record ExpectedResult( - @JsonProperty("minimal_similarity") float resultSimilarityMinimum, - @JsonProperty("maximum_similarity") float resultSimilarityMaximum, - @JsonProperty("matched_token_number") int resultMatchedTokenNumber) { -} -``` - -- To include the new value in the tests, they must be added to the `EndToEndSuiteTest` as a comparison operation at the package `de.jplag.endtoend`. The `runJPlagTestSuite()` function provided for this purpose must be extended to include the new comparison value. To do this, create the comparison as shown in the code example below. - -```JAVA -//... - if (areDoublesDifferent(result.resultSimilarityMaximum(), jPlagComparison.maximalSimilarity())) { - addToValidationErrors("maximal similarity", String.valueOf(result.resultSimilarityMaximum()), - String.valueOf(jPlagComparison.maximalSimilarity())); - } -//... -``` - -- Once the tests run the first time, they will fail due to the missing values in the old JSON result file used for the test cases. The old results must then be replaced with new ones. -For this purpose, the last section of the chapter [Copying Plagiarism To The Resources](#copying-plagiarism-to-the-resources) can help. - -### Extending JPlag Test Run Options -The end-to-end tests support the possible scan options of the JPlag API. Currently, `minimumTokenMatch` is used in the end-to-end tests. These values are also stored in the JSON as configuration to keep the test cases at the options apart. Likewise, also changes in the logic of the different options are to be determined to be able. - -- To extend new options to the end-to-end tests, they must be added to the record object `Options` in the package `de.jplag.endtoend.model`. Here it is sufficient to add the values in the record and to enter the JSON name as `@JsonProperty("json_name")`. - -```JAVA -public record Options( -@JsonProperty("minimum_token_match") Integer minimumTokenMatch) { -} -``` - -- After the new value has been added to the record, the creation of the object must now also be adjusted in the `EndToEndSuiteTest`. The 'setRunOptions' function is provided for this purpose. The options can be added in any order and combination. It should be noted that each test case is run with these options. - -```JAVA - private void setRunOptions() { - options = new ArrayList<>(); - options.add(new Options(1)); - options.add(new Options(15)); - } -``` - -- If you want to create individual test cases by testing the options only on a specific dataset, a new test case must be created for this purpose. The transfer parameter options can be adjusted and specified for the new test cases. This can then be tested with the function `runTests`. - ```JAVA - runTests(directoryName, option, currentLanguageIdentifier, testCase, currentResultDescription); -``` +## References +These elaborations provide basic ideas on how a modification of the plagiarized source code can look or be adapted. +These code adaptations refer to various changes, from +adding/removing comments to architectural changes in the deliverables. -- Once the tests run the first time, they will fail due to the missing values in the old JSON result file used for the test cases. The old results must then be replaced with new ones. -For this purpose, the last section of the chapter [Copying Plagiarism To The Resources](#copying-plagiarism-to-the-resources) can be used as help. +The following elaborations were used to be able to create the plagiarisms with the broadest coverage: +- [Detecting Source Code Plagiarism on Introductory Programming Course Assignments Using a Bytecode Approach - Oscar Karnalim](https://ieeexplore.ieee.org/abstract/document/7910274 "Detecting Source Code Plagiarism on Introductory Programming Course Assignments Using a Bytecode Approach - Oscar Karnalim") +- [Detecting Disguised Plagiarism - Hatem A. Mahmoud](https://arxiv.org/abs/1711.02149 "Detecting Disguised Plagiarism - Hatem A. Mahmoud") +- [Instructor-centric source code plagiarism detection and plagiarism corpus](https://dl.acm.org/doi/abs/10.1145/2325296.2325328 "Instructor-centric source code plagiarism detection and plagiarism corpus")