Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great Expectations vs Pandera #590

Closed
Veganveins opened this issue Aug 20, 2021 · 6 comments
Closed

Great Expectations vs Pandera #590

Veganveins opened this issue Aug 20, 2021 · 6 comments
Labels
question Further information is requested

Comments

@Veganveins
Copy link

Question about pandera

Hi there, I've used pandera in the past to validate data processing pipelines for ML workflows. My current org is doing a spike on Great Expectations to try to improve the quality of our data ingestion process.

Could anyone here provide insight as to the differences between Great Expectations and Pandera, whether or not they overlap or do similar things? It seems like there is some overlap but I'm sure the group here could tell me more about the nuances and differences between the two resources.

Thanks in advance for your help!

@Veganveins Veganveins added the question Further information is requested label Aug 20, 2021
@Veganveins Veganveins changed the title Great Expectations vs Padera Great Expectations vs Pandera Aug 20, 2021
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Aug 23, 2021

Hi @Veganveins thanks for your question!

So one big caveat here is that I haven't used GE very extensively but I'll do my best to summarize the similarities and differences.

Overlap

The main overlap is that both libraries aim to solve the same problem of ensuring data quality, but I think the approach pandera takes is closer in spirit to pydantic or dataclasses, in that it's a light weight package that focuses on one thing, which is parsing and validation of in-memory dataframes. Think of this as run-time enforced type-annotations for your dataframes.

Differences

GE provides data validation, profiling, and documentation, and is closer to a declarative tool that you'd integrate with your various data stores (SQL, etc.) or cluster computing environments like Spark. Their docs go into more detail on the package's functionality.

Pandera is designed to be useful with zero configuration, and it's syntax is optimized for intuitiveness and ease of use for folks already familiar with pandas/pandas-like libraries. Currently pandas is only supported, but we're working on getting support for Koalas, Modin, and eventually Dask and other dataframe frameworks (SQL parsing/validation would be a heavy lift, but might be added to the roadmap if there's enough demand).

On the other hand, GE looks like it requires some upfront investment on configuration and setup, but once that's done it provides a whole suite of useful features (data profiling and docs look super useful), as well as a GUI for updating validation rules.

One thing that Pandera offers that GE doesn't is data synthesis strategies, which integrates with hypothesis for automatically generating mock data for use in a (e.g. pytest) test suite.

Syntax

Syntactically, Pandera schemas are primarily written in python, either with the object-based API or class-based API, though it does support a yaml format and reading from frictionless schemas. It separates the concern between the schema specification and the object to be validated.

With GE, it looks like the primary UX is to define validation rules declaratively in json files, which can then be loaded into a python runtime to validate your tables of interest. It also exposes a python API that (I think?) inherits from pandas dataframes and extends the pd.DataFrame object with additional methods like expect_column_to_exist.

Conclusion

Note that these two libraries are not mutually exclusive: e.g. you could use Pandera for in-memory parsing/validation, and GE for validating data on disk, or Pandera when doing prototyping and research and port Pandera schemas to GE suite (a Pandera Schema -> GE expectation suite seems like a good idea to facilitate this 🤔)

Let me know if you have other questions!

@rdmolony
Copy link

You might find fugue interesting. They are running pandera on spark and dask through fugue, Kevin Kho (@kvnkho) wrote up a medium post on it here!

@Veganveins
Copy link
Author

Thank you @cosmicBboy and @rdmolony ! Great content and very useful context. I don't have any other questions right now but I will follow up if I think of anything else :)

@kvnkho
Copy link
Contributor

kvnkho commented Aug 31, 2021

Thanks for tagging @rdmolony . Coincidentally, there is this pull request into the pandera docs on how to use pandera on top of the Spark execution engine through Fugue. We connected with @cosmicBboy after PyCon.

I talked about Great Expectations versus pandera in my PyCon presentation, but not detailed enough since it was 30 mins. @goodwanghan and I will also be using both in our upcoming Oreilly course that came as a result of the PyCon presentation.

I don't have much more to add to what @cosmicBboy said. Let's just say that the Great Expectations has a larger surface area when it comes to your project, but you have to opt-in to get those benefits (like data documentation). pandera is lightweight and is non-invasive into your code. I'd be happy to chat more @Veganveins through Zoom or wherever if you're interested. My contact info is in my Github bio. 😄

@Veganveins
Copy link
Author

Wow thanks @kvnkho !! This presentation looks excellent and the O'Reilly course looks great!

image

@cosmicBboy
Copy link
Collaborator

@Veganveins thanks for the question, the discussion in here is great! Going to convert this to a github discussion, would you mind selecting my response as the answer?

@unionai-oss unionai-oss locked and limited conversation to collaborators Sep 6, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants