Common code segment selector (C2S2) is a independent tool to select common code segments for exclusion from code similarity detection. It accepts a set of student submissions and lists the common segments. The segments are subject to manual investigation before being excluded from similarity detection. If the similarity detection does not accommodate such exclusion, but can deal with uncompilable code, C2S2 can remove the common segments from that set of programs. Further details can be seen in the corresponding paper published at 52nd ACM Technical Symposium on Computer Science Education (SIGCSE 2021), or the recorded presentation. Currently, the tool covers two programming languages: Java and Python.
This mode lists any common segments from given student submissions and stores the result in an output file.
Quick command:
select <input_dirpath> <programming_language> <output_filepath>
Complete command:
select <input_dirpath> <programming_language> <output_filepath> <additional_keywords_path> <inclusion_threshold> <min_ngram_length> <max_ngram_length> coderesult generalised startident lineexclusive subremove
Any of the last five arguments can be removed to adjust the selection's behaviour. Further details about those can be seen below.
This mode removes common code segments from given student submissions. This accepts a directory containing the code files and generates the results under a new directory named '[result]' + given input directory.
Command:
remove <input_dirpath> <programming_language> <common_code_filepath> <additional_keywords_path> <common_code_type>
A string representing a file containing additional keywords with newline as the delimiter. Keywords with more than one token should be written by embedding spaces between the tokens. For example, 'System.out.print' should be written as 'System . out . print'. If unused, please set this to 'null'.
A string representing a file containing common code segments. The file can be either the mode 1's output or an arbitrary code written in compliance to the programming language's syntax.
A string that should be either 'code', 'codegeneralised', or 'complete'. The first one means the common code file is a regular code file. The second one is similar to the first except that the code tokens will be generalised prior compared for exclusion. The third one means the common code file is the mode 1's output without 'coderesult' parameter.
A string representing the input directory containing student submissions (each submission is represented by either one file or one sub-directory). Please use quotes if the path contains spaces.
A floating number representing the minimum percentage threshold for common segment inclusion. Any segments which submission occurrence proportion is higher than or equal to the threshold are included. This is assigned with 0.75 by default; all segments that occur in more than or equal to three fourths of the submissions are included.
Value: a floating number between 0 to 1 (inclusive).
A number depicting the largest n-gram length of the filtered common segments. This is assigned 50 by default.
Value: a positive integer higher than <min_ngram_length>.
A number depicting the smallest n-gram length of the filtered common segments. This is assigned 10 by default.
Value: a positive integer.
A string representing the filepath of the output, containing the common segments. Please use quotes if the path contains spaces.
A constant depicting the programming language used on given student submissions.
Value: 'java' (for Java) or 'py' (for Python).
This ensures the suggested segments are displayed as raw code instead of generalised while having no information about the variation. The segments can be passed directly to a code similarity detection tool for exclusion. It is set false by default.
This enables token generalisation while selection common segments. It is set true for quick command. See the paper for details.
This ensures the common segment selection only considers segments that start at the beginning of a line and end at the end of a line. It is set as true for quick command. See the paper for details.
This ensures the common segment selection only considers segments that start with identifier or keyword. It is set true for quick command. See the paper for details.
This removes any common segments that are a part of longer fragments from the result. It is set true for quick command. See the paper for details.
This tool uses ANTLR to tokenise given programs. It also adapts arunjeyapal's implementation of RKR-GST to remove common code segments.