schemadiff

schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like Apache Spark or Google BigQuery, as unexpected schema differences can disrupt data loading and processing.

Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:

BigQuery: Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet
Spark: Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.

Installation

Install the package with pip:

pip install schemadiffed # schemadiff taken :p

Usage

The package can be used as a Python library or as a command-line tool.

Python Library

Here's an example of using schemadiff to group files by their schema:

import os
from schemadiff import compare_schemas

os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')

In this example, compare_schemas groups the Parquet files in the directory path/to/parquet_files by their schema. It saves the results to report.json and also returns the grouped files as a list for potential downstream use.

Command-Line Interface

schemadiff can also be used as a command-line tool. After installation, the command compare-schemas is available in your shell:

python schemadiff  --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'

Features

Efficient processing by reading the metadata of Parquet files.
Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
Supports wildcard characters for flexible file selection.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
schemadiff		schemadiff
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

schemadiff

Installation

Usage

Python Library

Command-Line Interface

Features

About

Releases

Packages

Languages

License

Elsayed91/schemadiff

Folders and files

Latest commit

History

Repository files navigation

schemadiff

Installation

Usage

Python Library

Command-Line Interface

Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages