Skip to content

Commit

Permalink
First draft of python sql programming guide.
Browse files Browse the repository at this point in the history
  • Loading branch information
marmbrus authored and ahirreddy committed Apr 15, 2014
1 parent b0192d3 commit e00980f
Showing 1 changed file with 92 additions and 4 deletions.
96 changes: 92 additions & 4 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ a schema that describes the data types of each column in the row. A SchemaRDD i
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).

**All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell.**
**All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`.**

</div>

Expand All @@ -33,6 +33,19 @@ a schema that describes the data types of each column in the row. A JavaSchemaR
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
</div>

<div data-lang="python" markdown="1">

Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
Spark. At the core of this component is a new type of RDD,
[SchemaRDD](). SchemaRDDs are composed
[Row]() objects along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).

**All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell.**
</div>
</div>

***************************************************************************************************
Expand All @@ -44,7 +57,7 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac

The entry point into all relational functionality in Spark is the
[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its
decendents. To create a basic SQLContext, all you need is a SparkContext.
descendants. To create a basic SQLContext, all you need is a SparkContext.

{% highlight scala %}
val sc: SparkContext // An existing SparkContext.
Expand All @@ -60,7 +73,7 @@ import sqlContext._

The entry point into all relational functionality in Spark is the
[JavaSQLContext](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one
of its decendents. To create a basic JavaSQLContext, all you need is a JavaSparkContext.
of its descendants. To create a basic JavaSQLContext, all you need is a JavaSparkContext.

{% highlight java %}
JavaSparkContext ctx = ...; // An existing JavaSparkContext.
Expand All @@ -69,6 +82,19 @@ JavaSQLContext sqlCtx = new org.apache.spark.sql.api.java.JavaSQLContext(ctx);

</div>

<div data-lang="python" markdown="1">

The entry point into all relational functionality in Spark is the
[SQLContext]() class, or one
of its decedents. To create a basic SQLContext, all you need is a SparkContext.

{% highlight python %}
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
{% endhighlight %}

</div>

</div>

## Running SQL on RDDs
Expand All @@ -81,7 +107,7 @@ One type of table that is supported by Spark SQL is an RDD of Scala case classes
defines the schema of the table. The names of the arguments to the case class are read using
reflection and become the names of the columns. Case classes can also be nested or contain complex
types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be
registered as a table. Tables can used in subsequent SQL statements.
registered as a table. Tables can be used in subsequent SQL statements.

{% highlight scala %}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Expand Down Expand Up @@ -176,6 +202,27 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {

</div>

<div data-lang="python" markdown="1">

One type of table that is supported by Spark SQL is an RDD of dictionaries. The keys of the
dictionary define the columns names of the table, and the types are inferred by looking at the first
row. Any RDD of dictionaries can converted to a SchemaRDD and then registered as a table. Tables
can be used in subsequent SQL statements.

{% highlight python %}
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: {"name": p[0], "age": int(p[1])})

peopleTable = sqlCtx.inferSchema(people)
peopleTable.registerAsTable("people")

teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenNames = teenagers.map(lambda p: "Name: " + p.name)
{% endhighlight %}

</div>

</div>

**Note that Spark SQL currently uses a very basic SQL parser.**
Expand Down Expand Up @@ -231,6 +278,27 @@ parquetFile.registerAsTable("parquetFile");
JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");


{% endhighlight %}

</div>

<div data-lang="python" markdown="1">

{% highlight python %}

peopleTable # The SchemaRDD from the previous example.

# JavaSchemaRDDs can be saved as parquet files, maintaining the schema information.
peopleTable.saveAsParquetFile("people.parquet")

// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a JavaSchemaRDD.
parquetFile = sqlCtx.parquetFile("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerAsTable("parquetFile");
teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")

{% endhighlight %}

</div>
Expand Down Expand Up @@ -318,4 +386,24 @@ Row[] results = hiveCtx.hql("FROM src SELECT key, value").collect();

</div>

<div data-lang="python" markdown="1">

When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
adds support for finding tables in in the MetaStore and writing queries using HiveQL. In addition to
the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
expressed in HiveQL.

{% highlight python %}

from pyspark.context import HiveContext
hiveCtx = HiveContext(sqlCtx)

hiveCtx.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hiveCtx.hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL.
results = hiveCtx.hql("FROM src SELECT key, value").collect()

{% endhighlight %}

</div>

0 comments on commit e00980f

Please sign in to comment.