Skip to content

Commit

Permalink
Move configuration information out of example usage page (#11300)
Browse files Browse the repository at this point in the history
  • Loading branch information
alamb authored Jul 11, 2024
1 parent ed65c11 commit 0b2eb50
Show file tree
Hide file tree
Showing 5 changed files with 177 additions and 133 deletions.
6 changes: 6 additions & 0 deletions datafusion/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,12 @@ doc_comment::doctest!(
user_guide_example_usage
);

#[cfg(doctest)]
doc_comment::doctest!(
"../../../docs/source/user-guide/crate-configuration.md",
user_guide_crate_configuration
);

#[cfg(doctest)]
doc_comment::doctest!(
"../../../docs/source/user-guide/configs.md",
Expand Down
8 changes: 6 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,16 @@ DataFusion offers SQL and Dataframe APIs, excellent
CSV, Parquet, JSON, and Avro, extensive customization, and a great
community.

To get started with examples, see the `example usage`_ section of the user guide and the `datafusion-examples`_ directory.
To get started, see

See the `developer’s guide`_ for contributing and `communication`_ for getting in touch with us.
* The `example usage`_ section of the user guide and the `datafusion-examples`_ directory.
* The `library user guide`_ for examples of using DataFusion's extension APIs
* The `developer’s guide`_ for contributing and `communication`_ for getting in touch with us.

.. _example usage: user-guide/example-usage.html
.. _datafusion-examples: https://github.com/apache/datafusion/tree/main/datafusion-examples
.. _developer’s guide: contributor-guide/index.html#developer-s-guide
.. _library user guide: library-user-guide/index.html
.. _communication: contributor-guide/communication.html

.. _toc.asf-links:
Expand Down Expand Up @@ -80,6 +83,7 @@ See the `developer’s guide`_ for contributing and `communication`_ for getting

user-guide/introduction
user-guide/example-usage
user-guide/crate-configuration
user-guide/cli/index
user-guide/dataframe
user-guide/expressions
Expand Down
21 changes: 19 additions & 2 deletions docs/source/library-user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,25 @@

# Introduction

The library user guide explains how to use the DataFusion library as a dependency in your Rust project. Please check out the user-guide for more details on how to use DataFusion's SQL and DataFrame APIs, or the contributor guide for details on how to contribute to DataFusion.
The library user guide explains how to use the DataFusion library as a
dependency in your Rust project and customize its behavior using its extension APIs.

If you haven't reviewed the [architecture section in the docs][docs], it's a useful place to get the lay of the land before starting down a specific path.
Please check out the [user guide] for getting started using
DataFusion's SQL and DataFrame APIs, or the [contributor guide]
for details on how to contribute to DataFusion.

If you haven't reviewed the [architecture section in the docs][docs], it's a
useful place to get the lay of the land before starting down a specific path.

DataFusion is designed to be extensible at all points, including

- [x] User Defined Functions (UDFs)
- [x] User Defined Aggregate Functions (UDAFs)
- [x] User Defined Table Source (`TableProvider`) for tables
- [x] User Defined `Optimizer` passes (plan rewrites)
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes

[user guide]: ../user-guide/example-usage.md
[contributor guide]: ../contributor-guide/index.md
[docs]: https://docs.rs/datafusion/latest/datafusion/#architecture
146 changes: 146 additions & 0 deletions docs/source/user-guide/crate-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Crate Configuration

This section contains information on how to configure DataFusion in your Rust
project. See the [Configuration Settings] section for a list of options that
control DataFusion's behavior.

[configuration settings]: configs.md

## Add latest non published DataFusion dependency

DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process)

If you would like to test out DataFusion changes which are merged but not yet
published, Cargo supports adding dependency directly to GitHub branch:

```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"}
```

Also it works on the package level

```toml
datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"}
```

And with features

```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] }
```

More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies)

## Optimized Configuration

For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.

```toml
[dependencies]
datafusion = { version = "22.0" }
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.3"

[profile.release]
lto = true
codegen-units = 1
```

Then, in `main.rs.` update the memory allocator with the below after your imports:

```rust ,ignore
use datafusion::prelude::*;

#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
Ok(())
}
```

Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally
with `native` or at least `avx2`.

```shell
RUSTFLAGS='-C target-cpu=native' cargo run --release
```

## Enable backtraces

By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error,
like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file:

```toml
datafusion = { version = "31.0.0", features = ["backtrace"]}
```

Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables)

```bash
RUST_BACKTRACE=1 ./target/debug/datafusion-cli
DataFusion CLI v31.0.0
> select row_numer() over (partition by a order by a) from (select 1 a);
Error during planning: Invalid function 'row_numer'.
Did you mean 'ROW_NUMBER'?

backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13
3: std::backtrace::Backtrace::capture
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9
4: datafusion_common::error::DataFusionError::get_back_trace
at /datafusion/datafusion/common/src/error.rs:436:30
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr
............
```
The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs`
```
#[tokio::test]
async fn test_get_backtrace_for_failed_code() -> Result<()> {
let ctx = SessionContext::new();

let sql = "
select row_numer() over (partition by a order by a) from (select 1 a);
";

let _ = ctx.sql(sql).await?.collect().await?;

Ok(())
}
```
To obtain a backtrace:
```bash
cargo build --features=backtrace
RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture
```
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored
129 changes: 0 additions & 129 deletions docs/source/user-guide/example-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,29 +33,6 @@ datafusion = "latest_version"
tokio = { version = "1.0", features = ["rt-multi-thread"] }
```

## Add latest non published DataFusion dependency

DataFusion changes are published to `crates.io` according to [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process)
In case if it is required to test out DataFusion changes which are merged but yet to be published, Cargo supports adding dependency directly to GitHub branch

```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"}
```

Also it works on the package level

```toml
datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"}
```

And with features

```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] }
```

More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies)

## Run a SQL query against data stored in a CSV

```rust
Expand Down Expand Up @@ -201,109 +178,3 @@ async fn main() -> datafusion::error::Result<()> {
| 1 | 2 |
+---+--------+
```

## Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:

- [x] User Defined Functions (UDFs)
- [x] User Defined Aggregate Functions (UDAFs)
- [x] User Defined Table Source (`TableProvider`) for tables
- [x] User Defined `Optimizer` passes (plan rewrites)
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes

## Optimized Configuration

For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.

```toml
[dependencies]
datafusion = { version = "22.0" }
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.3"

[profile.release]
lto = true
codegen-units = 1
```

Then, in `main.rs.` update the memory allocator with the below after your imports:

```rust ,ignore
use datafusion::prelude::*;

#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
Ok(())
}
```

Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally
with `native` or at least `avx2`.

```shell
RUSTFLAGS='-C target-cpu=native' cargo run --release
```

## Enable backtraces

By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error,
like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file:

```toml
datafusion = { version = "31.0.0", features = ["backtrace"]}
```

Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables)

```bash
RUST_BACKTRACE=1 ./target/debug/datafusion-cli
DataFusion CLI v31.0.0
> select row_number() over (partition by a order by a) from (select 1 a);
Error during planning: Invalid function 'row_number'.
Did you mean 'ROW_NUMBER'?

backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13
3: std::backtrace::Backtrace::capture
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9
4: datafusion_common::error::DataFusionError::get_back_trace
at /datafusion/datafusion/common/src/error.rs:436:30
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr
............
```
The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs`
```
#[tokio::test]
async fn test_get_backtrace_for_failed_code() -> Result<()> {
let ctx = SessionContext::new();

let sql = "
select row_number() over (partition by a order by a) from (select 1 a);
";

let _ = ctx.sql(sql).await?.collect().await?;

Ok(())
}
```
To obtain a backtrace:
```bash
cargo build --features=backtrace
RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture
```
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored

0 comments on commit 0b2eb50

Please sign in to comment.