Poor Man's PySpark
This is meant to give a simple example of the apparatus needed to load, transform and merge datasets of indeterminate size without the use of a more complex framework like Spark.
-
load_and_transform_source_data - A
load_func()
is passed that accepts afilepath
arg. It is expected this results in a jsonl file at the filepath specified. Atransform_func()
tells us how to shape the data. -
group_data - A
group_key_func()
is passed telling us how to group the data. With this, a sort-merge index is created. -
merge_data_sources - Given two data sets grouped by the
group_data()
function, we pass amerge_func()
that is used to combine this data together into a new dataset.
A toy example can be seen in the test_pomps.py
file, test_merge_data_sources - This creates some jsonl files on the fly, groups, merges them together and tests the result looks as expected.
The example.py file shows a more realistic example that merges together some of the IMDB data files found here. Be warned, this IMDB data appears to have undependable nconst
and tconst
ids (various runs can load title data that points to an nconst
that is not found in the loaded name data.) But, it proves out the internal workings of Pomps well enough.
There are no external dependencies, this is tested on Windows and Linux with Python 3.9 but should work just fine on OS X as well.
$ git clone [email protected]:eliwjones/pomps.git
$ cd pomps
$ python test_pomps.py # or python example.py if you have ~42 minutes to spare (~23 minutes with pypy)
Basic resumeability is built in. If you are in the middle of a large load, merge of multiple datasets and a bug breaks a transform, load or group func, you can fix your bug and just re-run the code using the same namespace, execution_date. It will pick up where it last left off.
Checks like this or this are what enable this behavior.
The general aim is to keep this as stupidly simple as possible while clearly showing how to transform, group and merge data.
With that said, we are leaving a lot of CPU on the table when grouping and sorting the buckets, so I may add a multiprocessing Pool somewhere in here