Great Expectations vs Pandera #590

Veganveins · 2021-08-20T15:45:25Z

Question about pandera

Hi there, I've used pandera in the past to validate data processing pipelines for ML workflows. My current org is doing a spike on Great Expectations to try to improve the quality of our data ingestion process.

Could anyone here provide insight as to the differences between Great Expectations and Pandera, whether or not they overlap or do similar things? It seems like there is some overlap but I'm sure the group here could tell me more about the nuances and differences between the two resources.

Thanks in advance for your help!

cosmicBboy · 2021-08-23T14:07:28Z

Hi @Veganveins thanks for your question!

So one big caveat here is that I haven't used GE very extensively but I'll do my best to summarize the similarities and differences.

Overlap

The main overlap is that both libraries aim to solve the same problem of ensuring data quality, but I think the approach pandera takes is closer in spirit to pydantic or dataclasses, in that it's a light weight package that focuses on one thing, which is parsing and validation of in-memory dataframes. Think of this as run-time enforced type-annotations for your dataframes.

Differences

GE provides data validation, profiling, and documentation, and is closer to a declarative tool that you'd integrate with your various data stores (SQL, etc.) or cluster computing environments like Spark. Their docs go into more detail on the package's functionality.

Pandera is designed to be useful with zero configuration, and it's syntax is optimized for intuitiveness and ease of use for folks already familiar with pandas/pandas-like libraries. Currently pandas is only supported, but we're working on getting support for Koalas, Modin, and eventually Dask and other dataframe frameworks (SQL parsing/validation would be a heavy lift, but might be added to the roadmap if there's enough demand).

On the other hand, GE looks like it requires some upfront investment on configuration and setup, but once that's done it provides a whole suite of useful features (data profiling and docs look super useful), as well as a GUI for updating validation rules.

One thing that Pandera offers that GE doesn't is data synthesis strategies, which integrates with hypothesis for automatically generating mock data for use in a (e.g. pytest) test suite.

Syntax

Syntactically, Pandera schemas are primarily written in python, either with the object-based API or class-based API, though it does support a yaml format and reading from frictionless schemas. It separates the concern between the schema specification and the object to be validated.

With GE, it looks like the primary UX is to define validation rules declaratively in json files, which can then be loaded into a python runtime to validate your tables of interest. It also exposes a python API that (I think?) inherits from pandas dataframes and extends the pd.DataFrame object with additional methods like expect_column_to_exist.

Conclusion

Note that these two libraries are not mutually exclusive: e.g. you could use Pandera for in-memory parsing/validation, and GE for validating data on disk, or Pandera when doing prototyping and research and port Pandera schemas to GE suite (a Pandera Schema -> GE expectation suite seems like a good idea to facilitate this 🤔)

Let me know if you have other questions!

rdmolony · 2021-08-31T16:20:08Z

You might find fugue interesting. They are running pandera on spark and dask through fugue, Kevin Kho (@kvnkho) wrote up a medium post on it here!

Veganveins · 2021-08-31T16:33:38Z

Thank you @cosmicBboy and @rdmolony ! Great content and very useful context. I don't have any other questions right now but I will follow up if I think of anything else :)

kvnkho · 2021-08-31T20:45:20Z

Thanks for tagging @rdmolony . Coincidentally, there is this pull request into the pandera docs on how to use pandera on top of the Spark execution engine through Fugue. We connected with @cosmicBboy after PyCon.

I talked about Great Expectations versus pandera in my PyCon presentation, but not detailed enough since it was 30 mins. @goodwanghan and I will also be using both in our upcoming Oreilly course that came as a result of the PyCon presentation.

I don't have much more to add to what @cosmicBboy said. Let's just say that the Great Expectations has a larger surface area when it comes to your project, but you have to opt-in to get those benefits (like data documentation). pandera is lightweight and is non-invasive into your code. I'd be happy to chat more @Veganveins through Zoom or wherever if you're interested. My contact info is in my Github bio. 😄

Veganveins · 2021-08-31T23:04:09Z

Wow thanks @kvnkho !! This presentation looks excellent and the O'Reilly course looks great!

cosmicBboy · 2021-09-06T15:58:50Z

@Veganveins thanks for the question, the discussion in here is great! Going to convert this to a github discussion, would you mind selecting my response as the answer?

Veganveins added the question Further information is requested label Aug 20, 2021

Veganveins changed the title ~~Great Expectations vs Padera~~ Great Expectations vs Pandera Aug 20, 2021

unionai-oss locked and limited conversation to collaborators Sep 6, 2021

cosmicBboy closed this as completed Sep 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Great Expectations vs Pandera #590

Great Expectations vs Pandera #590

Veganveins commented Aug 20, 2021

cosmicBboy commented Aug 23, 2021 •

edited

Loading

rdmolony commented Aug 31, 2021

Veganveins commented Aug 31, 2021

kvnkho commented Aug 31, 2021

Veganveins commented Aug 31, 2021

cosmicBboy commented Sep 6, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Great Expectations vs Pandera #590

Great Expectations vs Pandera #590

Comments

Veganveins commented Aug 20, 2021

Question about pandera

cosmicBboy commented Aug 23, 2021 • edited Loading

Overlap

Differences

Syntax

Conclusion

rdmolony commented Aug 31, 2021

Veganveins commented Aug 31, 2021

kvnkho commented Aug 31, 2021

Veganveins commented Aug 31, 2021

cosmicBboy commented Sep 6, 2021

This issue was moved to a discussion.

cosmicBboy commented Aug 23, 2021 •

edited

Loading