Improve and test dataframe API examples in docs #11290

alamb · 2024-07-05T19:02:15Z

Which issue does this PR close?

Rationale for this change

I am trying to make the documentation better and examples easier to find

I would like to consolidate the examples for using DataFrames from examples https://github.com/apache/datafusion/tree/main/datafusion-examples/examples (e.g dataframe.rs and dataframe_in_memory and dataframe_output

However, first I wanted to make sure that the existing DataFrame API docs were in good shape -- and it turns out they needed some attention to get the examples compiling

So to keep the size of this PR small, I started with some improvements to the initial content.

I will actually consolidate some of the dataframe examples in a follow on PR

What changes are included in this PR?

Run examples as part of doctests
Make examples run
Improve dataframe documentation while I was in there

Are these changes tested?

Yes, now these examples run as part of the CI

Are there any user-facing changes?

alamb · 2024-07-05T19:02:44Z

docs/source/library-user-guide/using-the-dataframe-api.md


-`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.
-
-```rust


I moved this example to the end as I think it makes more sense once you see the DataFrame in action

alamb · 2024-07-05T19:02:57Z

datafusion/core/src/lib.rs

 );
+
+#[cfg(doctest)]
+doc_comment::doctest!(


This runs the examples as part of cargo test --doc

alamb · 2024-07-05T19:03:51Z

docs/source/library-user-guide/using-the-sql-api.md

@@ -29,16 +29,15 @@ using the [`SessionContext::sql`] method. For lower level control such as
 preventing DDL, you can use [`SessionContext::sql_with_options`] or the
 [`SessionState`] APIs

-[`sessioncontext`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html


I moved this down to the bottom of the doc so the link for SessionContext::sql wasn't duplicated (there was a doc warning about this)

alamb · 2024-07-06T10:22:00Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let ctx = SessionContext::new();


While these examples are now more verbose, they all stand on their own, which I think @efredine suggested would be an improvement

efredine

I learnt a lot reading and reviewing this ;-). Left some suggestions as seen through the eyes of a new user like me.

efredine · 2024-07-06T15:52:25Z

docs/source/library-user-guide/using-the-dataframe-api.md

-You can use `collect` or `execute_stream` to execute the query.
+DataFusion [`DataFrame`]s are modeled after the [Pandas DataFrame] interface,
+and is implemented as thin wrapper over a [`LogicalPlan`] that adds
+functionality for building and executing those plans.



Ok taking advantage of my relative ignorance to share some things that might be confusing to new users like me.

I think it might be better to start with a short section called Reading a Dataframe? Because that's probably the first thing people will want to do. It should show a simple example of reading a csv and displaying it. We just want to establish that when you read from a file the thing you get back is a dataframe! Then I wonder if the next section might be called Generating a New Dataframe? And I think it might be preferable to have the SQL example second rather than first?

Interestingly, the distinction between a "table" and a "dataframe" is hazy to me. There is also some sort of subtle distinction going on here between a dataframe as a thing that contains some data (which is how I think about it mentally when I read from a csv) and a thing which contains an executable plan that performs a transformation.

This is great feedback -- I looked around and the more basic introduction seems like it is in https://datafusion.apache.org/user-guide/dataframe.html (the "user guide"). I'll add some text that points there and rearrange the content (as well as make a PR to clean up that page)

docs/source/library-user-guide/using-the-dataframe-api.md

efredine · 2024-07-06T16:00:17Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+#[tokio::main]
+async fn main() -> Result<()>{
+    let ctx = SessionContext::new();


Ok - this comment, really clarified the distinction between a table and a dataframe for me. In essence, the simplest possible dataframe is one that scans a table and that table can be in a file or in memory. I think this might be worth including in the introduction. Maybe worth consistently using scan when reading a file.

I tried to clarify and add some additional comments.

This example really drives it home.

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

efredine · 2024-07-06T16:05:51Z

docs/source/library-user-guide/using-the-dataframe-api.md

+    // read the contents of a CSV file into a DataFrame
+    let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+    // execute the query and collect the results as a Vec<RecordBatch>
+    let batches = df.collect().await?;


For consistency with the next example, it might be worth iterating the batch here as well.

alamb

Thank you very much @efredine for the insightful feedback - I have tried to address all comments. I would appreciate knowing what you think of the new version if you get a chance

efredine

I think this looks great!

Feel free to tag me on these example changes. I share you view that reviewing and refining documentation and examples is high impact and it's a great way for me to continue learning more.

docs/source/library-user-guide/using-the-dataframe-api.md

efredine · 2024-07-07T21:32:48Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+#[tokio::main]
+async fn main() -> Result<()>{
+    let ctx = SessionContext::new();


This example really drives it home.

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

Co-authored-by: Eric Fredine <[email protected]>

alamb · 2024-07-07T21:46:59Z

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

I actually think we can do it now:

datafusion/datafusion-examples/examples/plan_to_sql.rs

Lines 118 to 139 in 6f330c9

    
               // create a logical plan from a SQL string and then programmatically add new filters 
        
               let df = ctx 
        
                   // Use SQL to read some data from the parquet file 
        
                   .sql( 
        
                       "SELECT int_col, double_col, CAST(date_string_col as VARCHAR) \ 
        
                   FROM alltypes_plain", 
        
                   ) 
        
                   .await? 
        
                   // Add id > 1 and tinyint_col < double_col filter 
        
                   .filter( 
        
                       col("id") 
        
                           .gt(lit(1)) 
        
                           .and(col("tinyint_col").lt(col("double_col"))), 
        
                   )?; 
        
               let sql = plan_to_sql(df.logical_plan())?.to_string(); 
        
               assert_eq!( 
        
                   sql, 
        
                   r#"SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))"# 
        
               ); 
        
               Ok(())

Maybe we should add a function to DataFrame to make it easier to find 🤔

alamb · 2024-07-07T21:47:52Z

Feel free to tag me on these example changes. I share you view that reviewing and refining documentation and examples is high impact and it's a great way for me to continue learning more.

Thank you very much @efredine -- I couldn't agree more.

Now I just need to find more time (it takes me much longer to update docs typically than it does to work on code :) )

efredine · 2024-07-08T15:57:53Z

docs/source/library-user-guide/using-the-dataframe-api.md

+    // ---|-------------
+    // 1  | 9000
+    // 2  | 8000
+    // 3  | 7000


The in-memory examples are concise and its easy to get the gist of what's going on. But it also throws people in to the deep end of the Arrow format which lacks a gentle introduction IMO. The Arrow-rs documentation gets immediately into the weeds!

It's likely that many users might never even need to know or access the arrow format directly. They will just read and write to csv or parquet.

I don't think this needs to change, but perhaps what's missing is a section on how and when to use the Arrow format? A gentler introduction to Record Batches.

I think a gentle arrow introduction would be awesome -- here is a ticke tracking such a thing upstream: apache/arrow-rs#4071

I actually think the basic content / structure could be copied from https://jorgecarleitao.github.io/arrow2/main/guide/ with the examples being updated to reflect arrow-rs

We could also add a small section in the DataFusion docs about record batches as well - filed #11336 to track that idea

comphead

lgtm thanks @alamb

alamb · 2024-07-09T21:45:29Z

Thank you @efredine and @comphead -- I think we'll have more to do to improve the docs, but this is a step forward I think so merging it in. I'll keep working to improve other sections over time

* Improve and test dataframe API examples in docs * Update introduction with pointer to user guide * Make example consistent * Make read_csv comment consistent * clarifications * prettier + tweaks * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> --------- Co-authored-by: Eric Fredine <[email protected]>

github-actions bot added the core Core DataFusion crate label Jul 5, 2024

alamb commented Jul 5, 2024

View reviewed changes

Improve and test dataframe API examples in docs

d6540db

alamb force-pushed the alamb/more_exmaples branch from 7d3c37c to d6540db Compare July 6, 2024 10:21

alamb added the documentation Improvements or additions to documentation label Jul 6, 2024

alamb commented Jul 6, 2024

View reviewed changes

efredine reviewed Jul 6, 2024

View reviewed changes

alamb added 6 commits July 7, 2024 16:15

Merge remote-tracking branch 'apache/main' into alamb/more_exmaples

71315bd

Update introduction with pointer to user guide

1d868e1

Make example consistent

2ff5baf

Make read_csv comment consistent

567bb89

clarifications

2ba4fbd

prettier + tweaks

bb02f51

github-actions bot removed the documentation Improvements or additions to documentation label Jul 7, 2024

alamb commented Jul 7, 2024

View reviewed changes

alamb mentioned this pull request Jul 7, 2024

Improve DataFrame Users Guide #11324

Merged

efredine approved these changes Jul 7, 2024

View reviewed changes

alamb added the documentation Improvements or additions to documentation label Jul 7, 2024

alamb and others added 2 commits July 7, 2024 17:45

Update docs/source/library-user-guide/using-the-dataframe-api.md

bf66b0e

Co-authored-by: Eric Fredine <[email protected]>

Update docs/source/library-user-guide/using-the-dataframe-api.md

3bb54b4

Co-authored-by: Eric Fredine <[email protected]>

github-actions bot removed the documentation Improvements or additions to documentation label Jul 7, 2024

efredine reviewed Jul 8, 2024

View reviewed changes

alamb marked this pull request as ready for review July 8, 2024 16:19

alamb mentioned this pull request Jul 8, 2024

Add a "Gentle Introduction to Arrow / Record Batches" #11336

Open

comphead approved these changes Jul 8, 2024

View reviewed changes

alamb added 2 commits July 9, 2024 17:09

Merge remote-tracking branch 'apache/main' into alamb/more_exmaples

ac591ed

Merge remote-tracking branch 'apache/main' into alamb/more_exmaples

b1e75a1

alamb merged commit 1e0c06e into apache:main Jul 9, 2024
24 checks passed

alamb deleted the alamb/more_exmaples branch July 9, 2024 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and test dataframe API examples in docs #11290

Improve and test dataframe API examples in docs #11290

alamb commented Jul 5, 2024

alamb Jul 5, 2024

alamb Jul 5, 2024

alamb Jul 5, 2024 •

edited

Loading

alamb Jul 6, 2024

efredine left a comment

efredine Jul 6, 2024

alamb Jul 7, 2024

efredine Jul 6, 2024

alamb Jul 7, 2024

efredine Jul 7, 2024

efredine Jul 6, 2024

alamb Jul 7, 2024

alamb left a comment

efredine left a comment

efredine Jul 7, 2024

alamb commented Jul 7, 2024

alamb commented Jul 7, 2024

efredine Jul 8, 2024

alamb Jul 8, 2024

alamb Jul 8, 2024

comphead left a comment

alamb commented Jul 9, 2024


		`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.

		```rust

Improve and test dataframe API examples in docs #11290

Improve and test dataframe API examples in docs #11290

Conversation

alamb commented Jul 5, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efredine left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

efredine left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 7, 2024

alamb commented Jul 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

alamb commented Jul 9, 2024

alamb Jul 5, 2024 •

edited

Loading