This repository contains Oreo, a clone detector designed to find code clones in the Twilight zone, along with the input data needed for running this tool, and the materials used in evaluating it. In the following sections, we explain all the materials contained in this repository.
As mentioned, Oreo’s main goal is finding clones in the twilight zone. In order to define what twilight zone means, we need to first explain clone types. Source code clones are categorized into four types by the increase in the difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). In between Type-3 and Type-4, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect; we regard this category as the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. Oreo is a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones in an accurate manner, but also, it is capable of finding harder-to-detect clones in the Twilight Zone. Oreo uses a combination of machine learning, information retrieval, and software metrics to attain its goal. The recall of Oreo has been evaluated using BigCloneEval popular benchmarking tool (https://jeffsvajlenko.weebly.com/bigcloneeval.html), and its precision has been estimated by manual inspection. Oreo has been demonstrated to have both high recall and precision values. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.
At the root directory of this repo, there is a folder named oreo
; this folder contains Oreo’s source code. Oreo’s source code is also located at a GitHub repository under the following address: https://github.com/Mondego/oreo
The deep neural network classifier used by Oreo is located at the aforementioned GitHub repository, and also inside the ml_model
folder located in oreo
folder of this repo. The model can also be found at: https://drive.google.com/drive/folders/1CHAYFbF42ZzZTGNnkfMg0MSiRdzFZwln?usp=sharing. Details for executing Oreo using the provided source code is elaborated in the Installation
file.
This repo also contains the input data for executing Oreo; these data were used in evaluating Oreo as reported in the paper. This input data is based on the BigCloneBench benchmarking dataset (https://github.com/clonebench/BigCloneBench). Please note that the reduced version of BigCloneBench, which is used by BigCloneEval (https://jeffsvajlenko.weebly.com/bigcloneeval.html) for conducting recall studies, is provided. A part of input data are located in a folder named input
at the root of this repo, and the rest of them are uploaded to Google drive (because of their size which is about 300MB totally). The input files are as follows:
bcb_dataset.zip
is the zipped version of Java source files in BigCloneBench reduced version which can be used as the input to Oreo. This file is located at https://drive.google.com/open?id=1AUC2uA7ik7ZWrQ4x9ZcKGyyX6YNStG_yblocks.file
contains the metrics calculated for each method in the above-mentioned BigCloneBench dataset. Each line of this file points to one method and its metrics. Please note that Oreo needs the metrics for each method in its input dataset ready to launch clone detection. One can either run the Metrics Calculation component of Oreo on the unzipped version ofbcb_dataset.zip
to get the metrics file, or directly useblocks.file
which contains the calculated metrics. This file is located at https://drive.google.com/open?id=1AUC2uA7ik7ZWrQ4x9ZcKGyyX6YNStG_yblocks_small.file
contains 10000 methods from BigCloneBench dataset, along with their metrics, for the purpose of running Oreo quickly.
We have evaluated Oreo by comparing its recall and precision, with other state of the art tools. We are providing all the materials that we used in the evaluation process with this repo. These materials are located at a folder named evaluation in the root directory of this repo.
The evaluation folder contains a folder named Recall; this folder contains the data produced during the evaluation of Oreo's recall using BigCloneEval. Recall contains one folder:
- The folder named
BigCloneEvalReports
contains the recall experiment reports generated by BigCloneEval on the BigCloneBench Dataset for each tool mentioned in the recall study of Oreo in the paper (SourcererCC (Scc) [1], Nicad [2], CloneWorks [3], and Oreo; please note that in the paper, we compared Deckard’s [4] recall to Oreo as well; however, as mentioned in the paper, Deckard produced more than 400G of clone pairs. We could not run Deckard on BigCloneEval since BigCloneEval failed to process this huge amount of clone pairs. As mentioned in the paper, the recall numbers for Deckard were taken from SourcererCC’s paper [1]). In other words, these are the reports generated by BigCloneEval after evaluating the clone pairs generated by these clone detectors using this tool. Please note that for running BigCloneEval, we setmit
(minimum number of tokens) to 50 (as also mentioned in paper, to ensure that we do not have empty methods, and also it is the standard minimum clone size for benchmarking [5]), andmat
(maximum number of tokens) to 500,000 (Since it is highly unlikely for methods to be more than 500,000 tokens, and such huge methods are considered outliers). Also,mis
(minimum similarity) was set to zero so that the evaluation is carried out for all clone types (Type-1 through Type-4). - Since the clone pairs files reported by each tool are very large (totally, they take about 7GB of space), we are not including them in the repo repository. However, we have uploaded them to Google drive and they are accessible through this link: https://drive.google.com/open?id=1EtYWqaYk7C4ez5r8383HYGPtpGI37iJ1
The other folder placed inside evaluation folder is Precision. This folder contains the files used in the Precision experiment of the 5 tools: Oreo, SourcererCC, CloneWorks, Deckard, and NiCad. You will find the following folders inside this folder:
- The folder named
inputfiles
, contains the 400 clone pairs used in the precision study of each tool. These clone pairs are sampled from the clone pairs reported by the tools during the Recall Experiment. - The folder
judge_1
contains the judgements of Judge 1. You will find a folder for each tool named after the tool. Inside these folders you will find the input file (for each tool, it starts with the name of the tool), true positives file (namedtp.txt
), and the false positives file (namedfp.txt
) generated during the precision experiment by Judge 1. - The folder
judge_2
contains the judgements of Judge 2. You will find a folder for each tool named after the tool. Inside these folders you will find the input file (for each tool, it starts with the name of the tool), true positives file (namedtp.txt
), and the false positives file (namedfp.txt
) generated during the precision experiment by Judge 2. - We consolidated the false positives of both judges (for each tool) and then the judges went through these consolidated false positives again to resolve the conflicts. The files related to this part of the experiment are located inside the
consolidated_fps_from_both_judges
folder.
In summary, this repo contains the following artifacts:
- Oreo source code, also located at: https://github.com/Mondego/oreo;
- The input files that can be used by Oreo;
- The files related to recall and precision experiments done for Oreo.
[1] Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE16). IEEE, 1157–1168.
[2] Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC08). IEEE, 172–181.
[3] Jeffrey Svajlenko and Chanchal K Roy. 2017. Fast and flexible large-scale clone detection with cloneworks. In Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, 27–30.
[4] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 96–105.
[5] Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering 33, 9 (Sept 2007), 577–591.