Uber Remote Shuffle Service provides the capability for Apache Spark applications to store shuffle data on remote servers. See more details on Spark community document: [SPARK-25299][DISCUSSION] Improving Spark Shuffle Reliability.
Please contact us ([email protected]) for any question or feedback.
- The master branch supports Spark 2.4.x. The spark30 branch supports Spark 3.0.x.
Make sure JDK 8+ and maven is installed on your machine.
- Run:
mvn clean package -Pserver -DskipTests
This command creates remote-shuffle-service-xxx-server.jar file for RSS server, e.g. target/remote-shuffle-service-0.0.9-server.jar.
- Run:
mvn clean package -Pclient -DskipTests
This command creates remote-shuffle-service-xxx-client.jar file for RSS client, e.g. target/remote-shuffle-service-0.0.9-client.jar.
- Pick up a server in your environment, e.g.
server1
. Run RSS server jar file (remote-shuffle-service-xxx-server.jar) as a Java application, for example,
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry standalone -dataCenter dc1
-
Upload client jar file (remote-shuffle-service-xxx-client.jar) to your HDFS, e.g.
hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
-
Add configure to your Spark application like following (you need to adjust the values based on your environment):
spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=standalone
spark.shuffle.rss.serviceRegistry.server=server1:12222
spark.shuffle.rss.dataCenter=dc1
- Run your Spark application
Remote Shuffle Service could use a Apache ZooKeeper cluster and register live service instances in ZooKeeper. Spark applications will look up ZooKeeper to find and use active Remote Shuffle Service instances.
In this configuration, ZooKeeper serves as a Service Registry for Remote Shuffle Service, and we need to add those parameters when starting RSS server and Spark application.
- Assume there is a ZooKeeper server
zkServer1
. Pick up a server in your environment, e.g.server1
. Run RSS server jar file (remote-shuffle-service-xxx-server.jar) as a Java application onserver1
, for example,
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry zookeeper -zooKeeperServers zkServer1:2181 -dataCenter dc1
-
Upload client jar file (remote-shuffle-service-xxx-client.jar) to your HDFS, e.g.
hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
-
Add configure to your Spark application like following (you need to adjust the values based on your environment):
spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=zookeeper
spark.shuffle.rss.serviceRegistry.zookeeper.servers=zkServer1:2181
spark.shuffle.rss.dataCenter=dc1
- Run your Spark application