ModuPipe : A modular and extensible ETL-like pipeline builder

Benefits

Entirely typed
Abstract, so it fits any use case
Class-based for easy configurations and injections

Usage

Extract-Transform-Load (ETL) pipelines are a classic form of data-processing pipelines used in the industry. It consists of 3 main elements:

An Extractor, which returns data in a stream-like structure (Iterator in Python) using a pull strategy.
Some Mapper (optional), which transforms (parse, converts, filters, etc.) the data obtained from the source(s). Mappers can be chained together and chained to an extractor (with +) in order to form a new extractor.
A Loader, which receives the maybe-transformed data using a push strategy. Loaders can be multiple (with LoaderList) or chained together (with +).

Therefore, those 3 processes are offered as interfaces, easily chainable and interchangeable at any time.

An interface Runnable is also offered in order to interface the concept of "running a pipeline". This enables a powerfull composition pattern for wrapping the execution behaviour of runnables.

Examples

Usage examples are present in the examples folder.

Discussion

Optimizing pushing to multiple loaders

If you have multiple loaders (using the LoaderList class or many chained PushTo mappers), but performance is a must, then you should use a multi-processing approach (with modupipe.runnable.MultiProcess), and push to 1 queue per loader. Each queue will also become a direct extractor for each loader, all running in parallel. This is especially usefull when at least one of the loaders takes a long processing time.

As an example, let's take a Loader 1 which is very slow, and a Loader 2 which is normally fast. You'll be going from :

┌────── single pipeline ──────┐        ┌──────────────── single pipeline ───────────────┐
 Extractor ┬─⏵ Loader 1 (slow)    OR    Extractor ──⏵ Loader 1 (slow) ──⏵ Loader 2 (late)
           └─⏵ Loader 2 (late)

to :

┌────── pipeline 1 ──────┐               ┌────────── pipeline 2 ─────────┐
 Extractor ┬─⏵ PutToQueue ──⏵ Queue 1 ⏴── GetFromQueue ──⏵ Loader 1 (slow)
           └─⏵ PutToQueue ──⏵ Queue 2 ⏴── GetFromQueue ──⏵ Loader 2 (not late)
                                         └──────────── pipeline 3 ───────────┘

This will of course not accelerate the Loader 1 processing time, but all the other loaders performances will be greatly improved by not waiting for each other.

Name	Name	Last commit message	Last commit date
Latest commit semantic-release and actions-user chore(release): v1.0.2 Aug 23, 2022 92da674 · Aug 23, 2022 History 29 Commits
.github/workflows	.github/workflows	fix: semantic versioning (#7 )	Aug 23, 2022
ci	ci	fix: semantic versioning (#7 )	Aug 23, 2022
examples	examples	refact: New interfaces and namings (#4 )	Aug 8, 2022
modupipe	modupipe	chore(release): v1.0.2	Aug 23, 2022
tests	tests	refact: New interfaces and namings (#4 )	Aug 8, 2022
.gitignore	.gitignore	updated gitignore	Jul 26, 2022
CHANGELOG.md	CHANGELOG.md	chore(release): v1.0.2	Aug 23, 2022
LICENSE	LICENSE	Create LICENSE	Jul 26, 2022
README.md	README.md	refact: New interfaces and namings (#4 )	Aug 8, 2022
poetry.lock	poetry.lock	moved to mockito	Jul 26, 2022
pyproject.toml	pyproject.toml	chore(release): v1.0.2	Aug 23, 2022
setup.cfg	setup.cfg	more tests	Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModuPipe : A modular and extensible ETL-like pipeline builder

Benefits

Usage

Examples

Discussion

Optimizing pushing to multiple loaders

About

Releases 1

Packages

Languages

License

vigenere23/modupipe-python

Folders and files

Latest commit

History

Repository files navigation

ModuPipe : A modular and extensible ETL-like pipeline builder

Benefits

Usage

Examples

Discussion

Optimizing pushing to multiple loaders

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages