Skip to content

Commit

Permalink
Create starting point for combined user guide for DataFusion and Ball…
Browse files Browse the repository at this point in the history
…ista (#20)
  • Loading branch information
andygrove authored Apr 21, 2021
1 parent c365a4f commit abe84cf
Show file tree
Hide file tree
Showing 19 changed files with 191 additions and 27 deletions.
2 changes: 0 additions & 2 deletions ballista/docs/user-guide/.gitignore

This file was deleted.

1 change: 1 addition & 0 deletions docs/user-guide/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
book
14 changes: 4 additions & 10 deletions ballista/docs/user-guide/README.md → docs/user-guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,15 @@
specific language governing permissions and limitations
under the License.
-->
# Ballista User Guide Source
# DataFusion User Guide Source

This directory contains the sources for the user guide that is published at https://ballistacompute.org/docs/.
This directory contains the sources for the DataFusion user guide.

## Generate HTML

To generate the user guide in HTML format, run the following commands:

```bash
cargo install mdbook
mdbook build
```

## Deploy User Guide to Web Site

Requires ssh certificate to be available.

```bash
./deploy.sh
```
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# under the License.

[book]
authors = ["Andy Grove"]
authors = ["Apache Arrow"]
language = "en"
multilingual = false
src = "src"
title = "Ballista User Guide"
title = "DataFusion User Guide"
33 changes: 33 additions & 0 deletions docs/user-guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Summary

- [Introduction](introduction.md)
- [Example Usage](example-usage.md)
- [Use as a Library](library.md)
- [Distributed](distributed/introduction.md)
- [Create a Ballista Cluster](distributed/deployment.md)
- [Docker](distributed/standalone.md)
- [Docker Compose](distributed/docker-compose.md)
- [Kubernetes](distributed/kubernetes.md)
- [Ballista Configuration](distributed/configuration.md)
- [Clients](distributed/clients.md)
- [Rust](distributed/client-rust.md)
- [Python](distributed/client-python.md)
- [Frequently Asked Questions](faq.md)
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,6 @@
specific language governing permissions and limitations
under the License.
-->
# Summary
# Python

- [Introduction](introduction.md)
- [Create a Ballista Cluster](deployment.md)
- [Docker](standalone.md)
- [Docker Compose](docker-compose.md)
- [Kubernetes](kubernetes.md)
- [Ballista Configuration](configuration.md)
- [Clients](clients.md)
- [Rust](client-rust.md)
- [Python](client-python.md)
- [Frequently Asked Questions](faq.md)
Coming soon.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,7 @@ The k8s deployment consists of:
Ballista is at an early stage of development and therefore has some significant limitations:

- There is no support for shared object stores such as S3. All data must exist locally on each node in the
cluster, including where any client process runs (until
[#473](https://github.com/ballista-compute/ballista/issues/473) is resolved).
cluster, including where any client process runs.
- Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a
backing store.

Expand Down
File renamed without changes.
76 changes: 76 additions & 0 deletions docs/user-guide/src/example-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Example Usage

Run a SQL query against data stored in a CSV:

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let mut ctx = ExecutionContext::new();
ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;

// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
Ok(())
}
```

Use the DataFrame API to process data stored in a CSV:

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let mut ctx = ExecutionContext::new();
let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

let df = df.filter(col("a").lt_eq(col("b")))?
.aggregate(vec![col("a")], vec![min(col("b"))])?
.limit(100)?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
Ok(())
}
```

Both of these examples will produce

```text
+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2 |
+---+--------+
```
File renamed without changes.
44 changes: 44 additions & 0 deletions docs/user-guide/src/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# DataFusion

DataFusion is an extensible query execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
in-memory format.

DataFusion supports both an SQL and a DataFrame API for building
logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

## Use Cases

DataFusion is used to create modern, fast and efficient data
pipelines, ETL processes, and database systems, which need the
performance of Rust and Apache Arrow and want to provide their users
the convenience of an SQL interface or a DataFrame API.

## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

28 changes: 28 additions & 0 deletions docs/user-guide/src/library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Using DataFusion as a library

DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).

To get started, add the following to your `Cargo.toml` file:

```toml
[dependencies]
datafusion = "4.0.0-SNAPSHOT"
```

0 comments on commit abe84cf

Please sign in to comment.