This repository is not being maintained anymore. |
---|
PySpark Expectations helps to check the most often data quality failures using pyspark modules. The implementation with pyspark makes fast quality testing of big data possible.
PySpark Expectations are inspired by Great Expectations.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
- First of all, you have to install Python and Pip on your computer.
- The best practice is to run the project in a virtual environment. Next install it using pipenv or virtualenv.
- Finally, install all packages from "requiremens.txt" in the virtual environment.
With pipenv:
$ pipenv install -r requirements.txt
Or in the environment created with virtualenv:
$ pip install -r requirements.txt
A step by step series of examples that tell you how to get a development env running
-
Download the pyspark-expectations on your computer from github.
-
Open a terminal at the folder, where you've downloaded the pyspark-expectations and create a Source Distribution.
$ python setup.py sdist
-
Install the created Distribution using pip in python enviroment, where you are going to use pyspark-expectations functions. After this step you can import pyspark-expectations package and use the functions from it.
$ pip install dist/*.tar.gz
-
You can also uninstall pyspark-expectations using pip.
$ pip uninstall pyspark_expectations
Pyspark-expectations contains functions, that checking the most common failures in big data. The given functions are applied on Pyspark Dataframe. Folowing checks are implemented:
- expect_table_row_count_to_be_between
- expect_column_values_to_be_unique
- expect_column_values_to_be_null
- expect_column_values_to_not_be_null
- expect_column_values_to_match_regex
- expect_column_values_not_to_match_regex
- expect_column_values_to_be_in_set
- expect_column_values_to_be_between
The names of the functions are describing the expected behavior.
The output of the functions is dictionary with two values:
- Value "success" can be true or false. The values is true, if condition is fulfilled.
- Value "summary" contains additional information about checking result.
In the folowing example will be checked, how many unique value contains Pyspark-dataframe
import pyspark_expectations
from pyspark.sql import SparkSession, SQLContext
spark = (
SparkSession.builder.master("local")
.config("spark.sql.shuffle.partitions", "1")
.config("spark.driver.host", "localhost")
.getOrCreate()
)
l = [["Patrik", 1.92], ["Tanja", 1.62], ["Tobi", None]] * 10
df = spark.createDataFrame(l, ["name", "height"])
result_from_unique_values_analyse = df.expect_column_values_to_be_unique("height")
print("""
Percent of duplicate values : {duplicates_count_percent},
Percent of distinct values : {distinct_count_percent},
Total count : {total_count},
Number of duplicate values : {duplicates_count},
Number of distinct values : {distinct_count}
""".format(**result_from_unique_values_analyse["summary"]))
spark.stop()
Output after running:
Percent of duplicate values : 0.9,
Percent of distinct values : 0.1,
Total count : 30,
Number of duplicate values : 27,
Number of distinct values : 3
The tests are implemented via pytest framework. To run the test execute such command at the terminal:
$ pytest
If you have any error issues, check the $PYTHONPATH variable of the environment.
This project is licensed under the MIT License - see the LICENSE file for details