Spark Experiment Runner

All the code contained in this repo is licensed under the Apache License, version 2.

This repo supposes you have Hadoop, Spark, Hive, HDFS and YARN correctly installed and configured. Most of the shell scripts in this repo are Bourne shell compliant.

Edit the config.sh file to set parameters for PySpark and the TPC-DS benchmark data generation;
Generate the TCP-DS benchmark data using setup.sh in the gen_data folder;
Run experiments with run_pyspark_queries.sh.

Configuration

The configuration file, config.sh, is thoroughly commented.

Notice that the Spark versions preceding 1.5.0 did not provide a REST endpoint to obtain all the logs related to an application. If your installation is recent enough, set REST_API=yes and write the HTTP address to the Spark History Server in the HISTORY_SERVER variable. Otherwise, disable REST_API and provide the HDFS path where the History Server stores its logs via the SPARK_LOGS variable.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
gen_data		gen_data
queries		queries
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.sh		config.sh
hdfs_delete.sh		hdfs_delete.sh
log_download.sh		log_download.sh
preamble.py		preamble.py
run_pyspark_queries.sh		run_pyspark_queries.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Experiment Runner

Configuration

About

Releases

Packages

Contributors 2

Languages

License

deib-polimi/Spark-Experiment-Runner

Folders and files

Latest commit

History

Repository files navigation

Spark Experiment Runner

Configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages