Configuration Datasets

We release the datasets we used to study the configuration practices and issues of real-world users. Some of the findings are summarized in the paper,

T. Xu, L. Jin, X. Fan, Y. Zhou, S. Pasupathy, and R. Talwadker, Hey, You Have Given Me Too Many Knobs! ---Understanding and Dealing with Over-Designed Configuration in System Software, Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'15), Oct. 31-Sep. 4, 2015, Bergamo, Italy. [[Download] (http://cseweb.ucsd.edu/~tixu/papers/fse15.pdf)]

Please cite the paper if you use the datasets :-)

What is included in the datasets?

The datasets contain two subsets, the configuration files used by real-world users and the configuration issues encountered by the users. The former can help studies on the users' practices in configuring software systems, while the latter can help studies on the user's difficulties and errors during their configuration tasks.

The layout of the dataset:

configfiles: this directory contains the datasets of the configuration files
configissues: this directory contains the datasets of the configuration issues

What is the scale of the datasets?

configfiles

The configfiles dataset contains configuration files of two software systems, Apache HTTP server (httpd) (http://httpd.apache.org/) and MySQL database server (https://www.mysql.com/).

In our study, we only used specific versions of httpd and MySQL, i.e., httpd-2.2.x and mysqld-5.x. But the dataset contains the configuration files of all the versions we collected.

Software	Version	#files
httpd	all	311
httpd	2.2.x	168
mysqld	all	823
mysqld	5.x	260

Only the files we studied (httpd-2.2.x and mysqld-5.x) are included in the datasets. For the rest, we provide you with the link so you can download them directly if you needed.

configissues

The configissues dataset contains the issues of three software projects, Apache HTTP server (http://httpd.apache.org/), MySQL (https://www.mysql.com/), and Hadoop (https://hadoop.apache.org/).

Software	#issues
httpd	97
mysqld	96
hadoop	98

Note that we do not consider version in this datasets, which means an issue could be from any version of the software.

How did we collect the datasets?

Please refer to the paper. Basically, all the files are collected from the web, including Q&A forums and the official mailing lists of the software, including

StackOverflow (http://stackoverflow.com/)
ServerFault (http://serverfault.com/)
Pro Webmasters (http://webmasters.stackexchange.com/)
Database Administrators (http://dba.stackexchange.com/)
MYSQL General List ([email protected])
MySQL on Win32 ([email protected])
MySQL Cluster ([email protected])
MySQL Replication ([email protected])
MySQL Backup ([email protected])
InnoDB Storage Engine ([email protected])
Apache HTTP Server Users Mailing Lists ([email protected])
Hadoop User Mailing Lists ([email protected])

Believe or not, we crawled/downloaded the entire mailing list archives and online posts with specific tags from the first mail/post to the most recent ones at the collection period. The collection is conducted during January to March in 2014. We automatically parsed the entire mailing list archives and all the posts on the above sources via a number of filter pipelines. At the end of the pipeline, we manually collect the files from the sources.

The issue dataset was collected from the same sources. We used a rolling-based sampling methods to get the datasets, which means we didn't look at all the mail threads/posts (which is not possible).

How can you use the datasets?

The instruction of the file format and layout can be found in the README in the dataset folders.

Warning

When using the dataset, you have to be aware that the datasets only contain information about parameter-based configurations. It does not contain configuration issues or practices such as compatibility or components. In other words, it reflects a subset of the whole problem.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configfiles		configfiles
configissues		configissues
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Configuration Datasets

What is included in the datasets?

What is the scale of the datasets?

configfiles

configissues

How did we collect the datasets?

How can you use the datasets?

Warning

About

Releases

Packages

Languages

tianyin/configuration_datasets

Folders and files

Latest commit

History

Repository files navigation

Configuration Datasets

What is included in the datasets?

What is the scale of the datasets?

configfiles

configissues

How did we collect the datasets?

How can you use the datasets?

Warning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages