PySpark Expectations

⚠️	This repository is not being maintained anymore.

PySpark Expectations

PySpark Expectations helps to check the most often data quality failures using pyspark modules. The implementation with pyspark makes fast quality testing of big data possible.

PySpark Expectations are inspired by Great Expectations.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

First of all, you have to install Python and Pip on your computer.

The best practice is to run the project in a virtual environment. Next install it using pipenv or virtualenv.
Finally, install all packages from "requiremens.txt" in the virtual environment.
With pipenv:

    $ pipenv install -r requirements.txt

Or in the environment created with virtualenv:

    $ pip install -r requirements.txt

Installing

A step by step series of examples that tell you how to get a development env running

Download the pyspark-expectations on your computer from github.
Open a terminal at the folder, where you've downloaded the pyspark-expectations and create a Source Distribution.
```
$ python setup.py sdist
```
Install the created Distribution using pip in python enviroment, where you are going to use pyspark-expectations functions. After this step you can import pyspark-expectations package and use the functions from it.
```
$ pip install dist/*.tar.gz 
```
You can also uninstall pyspark-expectations using pip.
```
$ pip uninstall pyspark_expectations
```

Usage

Description

Pyspark-expectations contains functions, that checking the most common failures in big data. The given functions are applied on Pyspark Dataframe. Folowing checks are implemented:

expect_table_row_count_to_be_between
expect_column_values_to_be_unique
expect_column_values_to_be_null
expect_column_values_to_not_be_null
expect_column_values_to_match_regex
expect_column_values_not_to_match_regex
expect_column_values_to_be_in_set
expect_column_values_to_be_between

The names of the functions are describing the expected behavior.

Output

The output of the functions is dictionary with two values:

Value "success" can be true or false. The values is true, if condition is fulfilled.
Value "summary" contains additional information about checking result.

Example

In the folowing example will be checked, how many unique value contains Pyspark-dataframe

import pyspark_expectations
from pyspark.sql import SparkSession, SQLContext

spark = (
        SparkSession.builder.master("local")
        .config("spark.sql.shuffle.partitions", "1")
        .config("spark.driver.host", "localhost")
        .getOrCreate()
    )


l = [["Patrik", 1.92], ["Tanja", 1.62], ["Tobi", None]] * 10
df = spark.createDataFrame(l, ["name", "height"])
result_from_unique_values_analyse = df.expect_column_values_to_be_unique("height")

print("""
    Percent of duplicate values : {duplicates_count_percent},
    Percent of distinct values : {distinct_count_percent},
    Total count : {total_count},
    Number of duplicate values : {duplicates_count},
    Number of distinct values : {distinct_count}
""".format(**result_from_unique_values_analyse["summary"]))

spark.stop()

Output after running:

Percent of duplicate values : 0.9,
Percent of distinct values : 0.1,
Total count : 30,
Number of duplicate values : 27,
Number of distinct values : 3

Running the tests

The tests are implemented via pytest framework. To run the test execute such command at the terminal:

    $ pytest

If you have any error issues, check the $PYTHONPATH variable of the environment.

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pyspark_expectations		pyspark_expectations
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Expectations

Getting Started

Prerequisites

Installing

Usage

Description

Output

Example

Running the tests

License

About

Releases

Packages

Contributors 2

Languages

License

dm-drogeriemarkt/pyspark-expectations

Folders and files

Latest commit

History

Repository files navigation

PySpark Expectations

Getting Started

Prerequisites

Installing

Usage

Description

Output

Example

Running the tests

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages