Skip to content

Latest commit

 

History

History
42 lines (24 loc) · 2.99 KB

README.md

File metadata and controls

42 lines (24 loc) · 2.99 KB

Pomps

Poor Man's PySpark

Description

This is meant to give a simple example of the apparatus needed to load, transform and merge datasets of indeterminate size without the use of a more complex framework like Spark.

  1. load_and_transform_source_data - A load_func() is passed that accepts a filepath arg. It is expected this results in a jsonl file at the filepath specified. A transform_func() tells us how to shape the data.

  2. group_data - A group_key_func() is passed telling us how to group the data. With this, a sort-merge index is created.

  3. merge_data_sources - Given two data sets grouped by the group_data() function, we pass a merge_func() that is used to combine this data together into a new dataset.

Example

A toy example can be seen in the test_pomps.py file, test_merge_data_sources - This creates some jsonl files on the fly, groups, merges them together and tests the result looks as expected.

The example.py file shows a more realistic example that merges together some of the IMDB data files found here. Be warned, this IMDB data appears to have undependable nconst and tconst ids (various runs can load title data that points to an nconst that is not found in the loaded name data.) But, it proves out the internal workings of Pomps well enough.

Quickstart

There are no external dependencies, this is tested on Windows and Linux with Python 3.9 but should work just fine on OS X as well.

$ git clone [email protected]:eliwjones/pomps.git
$ cd pomps
$ python test_pomps.py  # or python example.py if you have ~42 minutes to spare (~23 minutes with pypy)

Features

Basic resumeability is built in. If you are in the middle of a large load, merge of multiple datasets and a bug breaks a transform, load or group func, you can fix your bug and just re-run the code using the same namespace, execution_date. It will pick up where it last left off.

Checks like this or this are what enable this behavior.

TODO

The general aim is to keep this as stupidly simple as possible while clearly showing how to transform, group and merge data.

With that said, we are leaving a lot of CPU on the table when grouping and sorting the buckets, so I may add a multiprocessing Pool somewhere in here