initial pass at a pipelining transform #424

daw3rd · 2024-07-18T18:01:37Z

Why are these changes needed?

We would like to define a sequence of transforms using python and run the sequence as any other transform would be run.

Related issue number (if any).

#374

Signed-off-by: David Wood <[email protected]>

blublinsky

This is really trvialization

blublinsky · 2024-07-18T19:58:40Z

data-processing-lib/python/src/data_processing/transform/binary_pipeline.py

+            raise ValueError(f"Missing configuration key {transform_key} specifying the list of transforms to run")
+        for transform in self.transforms:
+            if not isinstance(transform, AbstractBinaryTransform):
+                raise ValueError(f"{transform} is not an instance of AbstractBinaryTransform")


Every transform here can have its own config parameters. Where are transforms initialized?

Good question. not sure we need that in the first pass, but open to suggestions. Initially this may be for the non-launched/embedded/notebooks.

blublinsky · 2024-07-18T20:02:32Z

data-processing-lib/python/src/data_processing/transform/binary_pipeline.py

+                # Capture the list of outputs from this transform as inputs to the next (or as the return values).
+                for transformation in transformation_tuples:
+                    transformed, extension = transformation
+                    fname = transform_name + "-output" + extension


This will break transform logic, if the need the input file name, for example codetoparquet or pdf conversion

don't they all only operate on the extension?

not all of them

i could use use original name, but then using the same name for all seems incorrect. Suggestion?

blublinsky · 2024-07-18T20:05:12Z

data-processing-lib/python/src/data_processing/transform/binary_pipeline.py

+        r_bytes = []
+        r_metadata = {}
+        for transform in self.transforms:
+            transformation_tuples, metadata = transform.flush_binary()


This is completely wrong. flush from the first transform has to be processed by the rest of them, then fluch from the second transform has to be processed by remaining one and so on

ah good. yes, A->B->C, A.flush() has to be fed to B.transform(). ugly, but true.

blublinsky · 2024-07-18T20:08:05Z

data-processing-lib/python/test/data_processing_tests/transform/test_binary_pipeline.py

+        # which use/expect parquet files.
+        fixtures = []
+        noop0 = NOOPTransform({"sleep": 0})
+        noop1 = NOOPTransform({"sleep": 0})


This is complete cheating

its a start. that said, we're not testing the application of the underlying transforms, as much as the structure of the output. but yes, would nice to have a better test, but would require having transforms othre than NOOP in the test_support packages.

Ah you are referring to the configuration part. We have always been saying that transforms can be configured outside of the CLI/runtime mechanics. I'm doing that here. However, it is true, that to run a pipeline transform in a runtime may require more work - this is more for the python only non-runtime users, at least initially.

I repeat my sentiment. Its circumventing the "normal" execution

blublinsky · 2024-07-19T07:10:56Z

These issues just magnify why this approach is not starter (in my opinion)

Signed-off-by: David Wood <[email protected]>

daw3rd · 2024-07-19T19:29:02Z

from typing import Any
import pyarrow as pa
from pyarrow import Table
from data_processing.transform import AbstractTableTransform
from data_processing.transform.binary_pipeline import PipelinedBinaryTransform
from data_processing.utils import TransformUtils

class HelloTransform(AbstractTableTransform):
    """" Adds a column of greetings """
    def __init__(self, config:dict):
        self.who = config.get("who", "World")
        self.column_name = config.get("column_name", "greeting")

    def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Table], dict[str, Any]]:
        # Create a new column with each row holding the who value
        new_column = ["Hello " + self.who + "!"] * table.num_rows
        # Append the column to create a new table with the configured name.
        table = TransformUtils.add_column(table=table, name=self.column_name, content=new_column)
        return [table], {}

if __name__ == "__main__":

    ids = pa.array([0 ])
    contents = pa.array(["Some content"])
    names = ["doc_id", "content"]
    table = pa.Table.from_arrays([ids, contents], names=names)

    hello0 = HelloTransform({})
    hello1 = HelloTransform({"column_name": "greating2", "who": "David"})
    hello2 = HelloTransform({"column_name": "greating3", "who": "Boris"})
    pipeline = PipelinedTransform({"transforms": [ hello0, hello1, hello2]})
    tables, metadata = pipeline.transform(table)
    table = tables[0]
    print(f"table={table}")

produces

table=pyarrow.Table
doc_id: int64
content: string
greeting: string
greating2: string
greating3: string
----
doc_id: [[0]]
content: [["Some content"]]
greeting: [["Hello World!"]]
greating2: [["Hello David!"]]
greating3: [["Hello Boris!"]]

…m_binary() Signed-off-by: David Wood <[email protected]>

Signed-off-by: David Wood <[email protected]>

blublinsky · 2024-07-21T12:56:25Z