CrossSim is a versatile tool by exploiting the graph representation: it can incorporate various features into the similarity computation, e.g., third-party libraries, API funtion calls, package names, to name a few. An evaluation on a dataset containing 580 GitHub projects shows that the tool outperforms MUDABlue, CLAN, and RepoPal with respect to different quality metrics.
This repository contains tools and dataset for the following papers:
- A paper published in the Proceedings of the 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2018).
CrossSim: exploiting mutual relationships to detect similar OSS projects (Link)
Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei, Davide Di Ruscio
Department of Information Engineering, Computer Science and Mathematics, Università degli Studi dell'Aquila
Via Vetoio 2, 67100 -- L'Aquila, Italy
The tools and dataset that support this paper are available in the release https://github.com/crossminer/CrossSim/tree/0.0.1
- A paper that has been accepted for publication by Software Quality Journal
An Automated Approach to Assess the Similarity of GitHub Repositories
Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei, Davide Di Ruscio
Department of Information Engineering, Computer Science and Mathematics, Università degli Studi dell'Aquila
Via Vetoio 2, 67100 -- L'Aquila, Italy The tools and dataset that support this paper are available in the release https://github.com/crossminer/CrossSim/releases/tag/0.0.2
To execute CrossSim on a dataset consisting of 580 (tool/CrossSim/evalaution.properties file specifies the input path of mined data) GitHub projects, please run the following command:
$ mvn -e exec:java -Dexec.mainClass="org.crossminer.similaritycalculator.CrossSim.Runner"
CrossSim takes as input two files: Graph is stored in file "graph", and each line in the file has the following format:
node1#node2
which specifies the representation of one edge in the graph: node1 -> node2.
File "dictionary" is used to store all the artifacts included in the computation. It serves as a reference to the graph nodes. The nodes are either:
- users/developers, *projects, or *dependencies (third party library).
CrossSim outputs a matrix of similarity scores. This matrix is stored in a set of files (one file for each project in the dataset). Each file contains the ranked list of similarity to all the other projects in the dataset.
For example, the first few lines in the output file AKSW_RDFUnit.txt
for the usage example above are:
git://github.com/AKSW/RDFUnit.git git://github.com/AKSW/RDFUnit.git 1.0
git://github.com/AKSW/RDFUnit.git git://github.com/pyvandenbussche/sparqles.git 0.0020606061443686485
git://github.com/AKSW/RDFUnit.git git://github.com/dbpedia/links.git 0.001839826931245625
git://github.com/AKSW/RDFUnit.git git://github.com/rdfhdt/hdt-java.git 0.001507760607637465
git://github.com/AKSW/RDFUnit.git git://github.com/AKSW/Sparqlify.git 9.407114703208208E-4
git://github.com/AKSW/RDFUnit.git git://github.com/streamreasoning/CSPARQL-engine.git 8.780991774983704E-4
git://github.com/AKSW/RDFUnit.git git://github.com/jprante/elasticsearch-plugin-rdf-jena.git 7.993730832822621E-4
git://github.com/AKSW/RDFUnit.git git://github.com/AKSW/jena-sparql-api.git 7.858243770897388E-4
git://github.com/AKSW/RDFUnit.git git://github.com/nkons/r2rml-parser.git 7.024793303571641E-4
git://github.com/AKSW/RDFUnit.git git://github.com/castagna/freebase2rdf.git 6.818181718699634E-4
For comparison purposes, we re-implemented MUDABlue, CLAN, and RepoPal. These are provided in the tool/RepoPal, tool/MudaBlue, and tool/CLAN folder respectively. Each folder contains a readme file that describes how to run the corresponding tool.
MUDABlue is a Java source code implementation of the following paper:
MUDABlue: An Automatic Categorization System for Open Source Repositories (link);
CLAN is a Java source code implementation of the following paper:
Detecting Similar Software Applications (link);
RepoPal is a Java source code implementation of the following paper:
Detecting Similar Repositories on GitHub (link);
The dataset used in the paper is available in the dataset/
subdirectory. In particular:
- queries.txt is the list of queries used in the evaluation;
- human evaluation.xlsx contains the qualitative analysis results. It inludes the scores given by human evaluators;
- repository.txt is the list of repositories;
- RepoPal results are in RepoPal_Results.csv;
- CrossSim3 results are in CrossSim_Results.csv;
- MudaBlue results are in MudaBlue_Results.csv;
- CLAN results are in Clan_Results.csv;
Please report any bugs using GitHub's issue tracker.
If you use the tool or the dataset in your research, please cite our work using the following BibTex entry:
@INPROCEEDINGS{8498236,
author={P. T. Nguyen and J. Di Rocco and R. Rubei and D. Di Ruscio},
booktitle={2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)},
title={CrossSim: Exploiting Mutual Relationships to Detect Similar OSS Projects},
year={2018},
volume={},
number={},
pages={388-395},
keywords={Libraries;Open source software;Ecosystems;Semantics;Computational modeling;Software systems;Mining software repositories, software similarities, SimRank},
doi={10.1109/SEAA.2018.00069},
ISSN={},
month={Aug}
}
If you use any of the reimplementations for MUDABlue, CLAN, RepoPal, please cite also the following paper:
@article{Nguyen:2019:SQJ:CrossSim,
doi = {10.1007/s11219-019-09483-0},
url = {https://doi.org/10.1007%2Fs11219-019-09483-0},
year = 2020,
month = {feb},
publisher = {Springer Science and Business Media {LLC}},
author = {Phuong T. Nguyen and Juri {Di Rocco} and Riccardo Rubei and Davide {Di Ruscio}},
title = {An automated approach to assess the similarity of {GitHub} repositories},
journal = {Software Quality Journal}
}
If you encounter any difficulties in working with the tool or the datasets, please do not hesitate to contact us at one of the following emails: [email protected], [email protected]. We will try our best to answer you as soon as possible.