Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic]: Improve Documentation, Tutorials, and Examples #3058

Closed
20 of 22 tasks
andygrove opened this issue Aug 6, 2022 · 3 comments
Closed
20 of 22 tasks

[Epic]: Improve Documentation, Tutorials, and Examples #3058

andygrove opened this issue Aug 6, 2022 · 3 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@andygrove
Copy link
Member

andygrove commented Aug 6, 2022

“Write cool software and tell people about it” – Paul Dix @pauldix (Founder and CTO of InfluxData)

Call to action:

The DataFusion community has invested a lot in the cool software; Now is the time to do better on the “tell people about it” part.

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion is too difficult to learn for new users. See https://towardsdev.com/writing-a-data-pipeline-in-rust-with-datafusion-25b5e45410ca for one users experience, which is summarized here:

  • The API documentations are pretty bad, the less frequently used function does not provide any document or incomplete document and lacks examples that how to use them. So I had to guess and try out many things to use some of the function such as to_timestamp date_part when(...).otherwise()
  • Data Reading example also lacks example, there is only 2–3 example mentioned in the API doc. But you might need another way of reading the data. For example, first define the schema and then use the schema to read the data without inferring the scheme by the framework. But this in not in the doc
  • There is no tutorial like example in the doc, this also true for Rust API for Polars DataFrame
  • From user point of view, I think documentation is the weakest part of the Framework, on top of that rust is not that easy itself.

Describe the solution you'd like
TBD. This issue is an EPIC to track tasks to improve the situation.

User Guide

Rust Docs (docs.rs)

Developer/Contributor Guide

Python Docs

Blog posts

Older Issues To Be Reviewed

@andygrove andygrove added documentation Improvements or additions to documentation enhancement New feature or request labels Aug 6, 2022
@kmitchener
Copy link
Contributor

As a relatively new user of DataFusion, I agree the docs on the site are pretty bad, but not so bad that I wasn't able to understand it without being overwhelmed :) However, there's plenty of room for improvement and I've been doing a few things related to this:

  • reviewing all the documentation-tagged issues in this project
  • reviewing other similar project's documentation to see how it's structured, what's helpful to me as a newbie to their project, what's confusing about their documentation
  • trying (locally) different structure for documentation based on the above

I've come up with a few changes that I think will help:

  • realize that the DataFusion library itself is the product here, not DataFusion-CLI. DataFusion-CLI is a tool for demonstrating the power of the library quickly and easily for potential users (among other uses, but to me this seems to be the primary use case -- if it's not, it's worth clarifying what the intended use of DataFusion-CLI is).
    • as such, DataFusion-CLI should have a supporting role in the documentation
    • as such, there should be more examples how potential new users can use DataFusion-CLI to quickly demonstrate for themselves how DataFusion library can help them
    • therefore there should be multiple examples how to run DataFusion-CLI, how to register a variety of data into DataFusion context, and the power of the SQL -- examples should cover loading data from local, from object store like S3, partitioned and not, different formats, etc. Queries should be run. Explain plans and explain analyze should be shown.
    • probably CLI itself will need some changes to make it easier to use it against S3 or Azure Blob Storage. It should be able to parse the location given in a create external table command to automatically register an object store using sensible default authentication methods. for example, if I already have my environment configured to use the AWS CLI, then I should be able to startup DataFusion-CLI and run create external table test stored as parquet location 's3::/my-bucket/content' and it should "just work".
  • there should be a separate User Guide and a Developer Guide (or maybe call it a "Contributor Guide"?)
  • yes, in the User Guide, I think the functions we have in DataFusion should be documented. for new prospective users browsing docs, it's important to see the wide variety of useful functions that exist.
    • related to this, some of the documentation makes it appear that DataFusion is less capable than it actually is -- in particular, the bit that describes the very basic SQL syntax that works (implying that more complex SQL won't work)
  • ETL is called out as particular use case for DataFusion, but none of the examples demonstrate ETL pipelines. In my opinion, definitely need some examples or even just write-ups about how DataFusion can work as an ETL tool.

It's still very much a WIP, but I have a structure that I think mostly makes sense in this branch: https://github.com/kmitchener/arrow-datafusion/tree/doc-improvements (related to PR #3005 )

The above incorporates some of the thoughts from #1821 and #1814 as well.

@andygrove
Copy link
Member Author

Thanks @kmitchener that is great feedback

@alamb
Copy link
Contributor

alamb commented Jul 18, 2023

I had a conversation with @MrPowers today which inspired me to try and organize ideas to improve the datafusion documentation today.

I moved all the unfinished tickets into a new epic #7013 and am going to close this one so the current state of things is clearer. Let's continue the conversation there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants