diff --git a/core/pom.xml b/core/pom.xml index be56911b9e45a..3e9b23733499f 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -35,7 +35,7 @@ org.apache.hadoop hadoop-client - + net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md new file mode 100644 index 0000000000000..a3179fce59c13 --- /dev/null +++ b/docs/openstack-integration.md @@ -0,0 +1,237 @@ +layout: global +title: Accessing Openstack Swift from Spark +--- + +# Accessing Openstack Swift from Spark + +Spark's file interface allows it to process data in Openstack Swift using the same URI +formats that are supported for Hadoop. You can specify a path in Swift as input through a +URI of the form `swift:// + --------- + + org.apache.hadoop + hadoop-openstack + 2.3.0 + + ---------- + + +in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` + + + ---------- + + org.apache.hadoop + hadoop-openstack + + ---------- + +# Configuration of Spark +Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be +configured: declaration of the Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via + + + + + + + +
Property NameValue
fs.swift.implorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
+ +Additional parameters requiered by Keystone and should be provided to the Swift driver. Those +parameters will be used to perform authentication in Keystone to access Swift. The following table +contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlKeystone Authentication URLMandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixKeystone endpoints prefixOptional
fs.swift.service.PROVIDER.tenantTenantMandatory
fs.swift.service.PROVIDER.usernameUsernameMandatory
fs.swift.service.PROVIDER.passwordPasswordMandatory
fs.swift.service.PROVIDER.http.portHTTP portMandatory
fs.swift.service.PROVIDER.regionKeystone regionMandatory
fs.swift.service.PROVIDER.publicIndicates if all URLs are publicMandatory
+ +For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`. +Than `core-sites.xml` should include: + + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + + +Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, +`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach. +We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` + +# Usage examples +Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log` +from Spark the `swift://` scheme should be used. + +## Running Spark via spark-shell +Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. + + val sfdata = sc.textFile("swift://logs.SparkTest/data.log") + sfdata.count() + +## Job submission via spark-submit +In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Example of Java usage: + + /* SimpleApp.java */ + import org.apache.spark.api.java.*; + import org.apache.spark.SparkConf; + import org.apache.spark.api.java.function.Function; + + public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } + } + +The directory sturture is + + find . + ./src + ./src/main + ./src/main/java + ./src/main/java/SimpleApp.java + +Maven pom.xml is + + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + + + +Compile and execute + + mvn package + SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar + diff --git a/pom.xml b/pom.xml index 0d46bb4114f73..953f2d4c3d7b4 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ 3.0.0 1.7.6 0.7.1 - + 64m 512m