-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign dataset format and interface #567
Comments
Proposed designThe component specThe consumes and produces section will be flattened, to only contain fields:
Data storageThe produced output by each component is written as a new parquet dataset.
The complete dataset after the third component can be found by looking from the right (format parquet-dataset:field):
The manifestThe manifest provides this view at each point in the pipeline. After step 4, it will represent the view above: {
"metadata": {
"base_path": "gs://bucket",
"pipeline_name": "test_pipeline",
"run_id": "test_pipeline_12345",
"component_id": "component4"
},
"index": {
"location": "/component4"
},
"fields": {
"field1": {
"location": "/component1",
"type": "..."
},
"field2": {
"location": "/component1",
"type": "..."
}
"field3": {
"location": "/component3",
"type": "..."
}
"field4": {
"location": "/component2",
"type": "..."
}
"field5": {
"location": "/component2",
"type": "..."
}
}
} Additional fieldsWe still need a mechanism to remove additional fields from the output dataset of a component if it changes the index (eg. LaionRetrieval components which go from a prompt id to a Laion id). For example, if component 3 above would define produces:
field3:
type: ...
additionalFields: false The data storage would still look exactly the same as above, but now the manifest only looks back until the output of component3: {
"metadata": {
...
},
"index": {
"location": "/component4"
},
"fields": {
"field3": {
"location": "/component3",
"type": "..."
}
}
} User interfaceWe still present all the data in a manifest as a single dataframe to the user. Both the
We can create this dataframe by creating a reverse mapping of the locations and fields in the manifest: {
"/component4": ["index"],
"/component3": ["field3"],
"/component2": ["field4", "field5"],
"/component1": ["field1", "field2"],
} And reading the fields from each location and merging them together. dataframe = dd.empty()
for location, fields in field_mapping.items():
partial_df = dd.read_parquet(location, columns=fields)
dataframe = dd.merge(dataframe, partial_df, how="left") |
As an example, let's see what the component specs, manifests, and data storage would look like for the following pipeline based on our ControlNet example:
1. Prompt generationfondant_component.yaml produces:
text:
type: string data storage
manifest {
"index": {
"location": "/generate_prompts"
},
"fields": {
"text": {
"location": "/generate_prompts",
"type": "string"
}
}
} 2. Laion Rerievalfondant_component.yaml consumes:
text:
type: string
produces:
image_urls:
type: string
additionalFields: false data storage
manifest {
"index": {
"location": "/laion_retrieval"
},
"fields": {
"image_urls": {
"location": "/laion_retrieval",
"type": "string"
}
}
} 3. Download imagesfondant_component.yaml consumes:
image_urls:
type: string
produces:
image:
type: bytes
width:
type: int32
height:
type: int32 data storage
manifest {
"index": {
"location": "/download_images"
},
"fields": {
"image_urls": {
"location": "/laion_retrieval",
"type": "string"
},
"image": {
"location": "/download_images",
"type": "bytes"
},
"width": {
"location": "/download_images",
"type": "int32"
},
"height": {
"location": "download_images",
"type": "int32"
},
}
} 4. Filter resolutionfondant_component.yaml
data storage
manifest {
"index": {
"location": "/filter_resolution"
},
"fields": {
"image_urls": {
"location": "/laion_retrieval",
"type": "string"
},
"image": {
"location": "/download_images",
"type": "bytes"
},
"width": {
"location": "/download_images",
"type": "int32"
},
"height": {
"location": "download_images",
"type": "int32"
},
}
} 5. Crop imagesfondant_component.yaml
data storage
manifest {
"index": {
"location": "/filter_resolution"
},
"fields": {
"image_urls": {
"location": "/laion_retrieval",
"type": "string"
},
"image": {
"location": "/crop_images",
"type": "bytes"
},
"width": {
"location": "/download_images",
"type": "int32"
},
"height": {
"location": "download_images",
"type": "int32"
},
}
} This is quite a logical and simple flow I believe. The last step shows the only issue I see with this approach: the data is cropped and overwritten, but the width and height still contain the old values. We had the same issue with the previous approach though (although |
|
I think this approach tackles a lot of the recent feedback as you have mentioned. A few questions from my side to clarify my understanding.
Isn't it more an issue related to the component design (not updating the width and height after image cropping) instead of an issue of your approach?
I still think that this is one of the biggest blocker regarding to the reusability of components. From an users perspective we are offering different building blocks which can operate on custom dataframes. I can assume a lot of users creating custom LoadComponents and define a custom data schema. Now we should offer capabilities to choose a component that applies specific transformation, and define a column to operate on. Naively, I would like to have an interface to choose columns to operate on and optional column names which will be written to the dataframe. We could use the custom mappings to access the data within the component and write the custom mappings to the manifest too. This should be possible without changing the ComponentSpec. Am I overlooking something? |
Thanks Robbe! Looks really promising so far, Few questions:
|
I'm not completely sure what you mean. We indeed write all columns that are produced by a component to a single parquet dataset. Parquet is a columnar storage format, so the data is still stored column-wise.
Partially, yes. However, it might be difficult for a component to know up front all possible fields that might be related. For instance, there might have been a I think the user will have to handle this. We might want to provide a mechanism for the user to invalidate certain columns on the pipeline level.
So you mean that the user should explicitly map the fields to a component every time? I think that could make sense if we can keep the interface simple enough. I'm open to proposals on this front, I think it's the main open step before we can decide to move forward with this.
Not sure. We already merge the index and different subsets now, and I haven't really noticed an impact from this. We will be merging more in this new proposal, but both sides should always contain indexes, and maybe even be sorted, so the impact might be limited. It would probably be good to test this before implementation if we choose to go this way.
I don't think we can detect his automatically, since this is about the semantic meaning of the index. I believe a component that changes the semantic meaning of the index needs to be able to mark this in its component spec, since it will invalidate all older data with 100% certainty. I'm open to other ways of marking this though. |
Actually, the resulting file structure looks like this: ├── component_1/
│ ├── index/
│ │ ├── part.0.parquet
│ │ ├── part.1.parquet
│ │ └── ...
│ └── text/
│ ├── part.0.parquet
│ ├── part.1.parquet
│ └── ...
└── component_2/
├── index/
│ ├── part.0.parquet
│ ├── part.1.parquet
│ └── ...
└── image_urls/
├── part.0.parquet
├── part.1.parquet
└── ... I thought we are achieving this by using our custom write to parquet approach. ...
write_tasks = [
dd.to_parquet(index_df, "/component_1/index", compute=False),
dd.to_parquet(url_df, "/component_2/text", compute=False)
]
dd.compute(*write_tasks, ...) I was mainly wondering if your proposed approach has an effect on this. Essentially, should we continue with our approach or combine this into a single write task, which would result in something like this: ├── component_1/
│ ├── part.0.parquet
│ ├── part.1.parquet
│ └── ...
└── component_2/
├── part.0.parquet
├── part.1.parquet
└── ... |
This is indeed what I propose, and what I meant with "a single parquet dataset". I don't see any reason to split them, since we can select which columns to read from a parquet dataset. |
Oh I initially though that we were planning on storing them separately (individual folder per column). I think this is better especially for the merging since there might be a higher probability that some of the columns to consume might be in the same "single parquet dataset" |
Interesting, I like it so far. Do we always need to write the Index in every component? Or only the ones that modify the number of rows (filter/expand) |
We need to write it in every component, but we are no longer writing it separately every time. If a component only updates the index, only the index will be written. Otherwise, it will just be written as part of the data, which is necessary anyway. This means that we write the index less often, while we still keep the advantage of only updating the index for filtering components. So #70 would be fixed automatically by adopting this approach. |
I am not sure if there are is a better alternative then just adding a mapping directly to the componentOp So let's say the user starts with this pipeline in place load_data (generic)
produces:
- image_array
- text_data
embed_text (custom)
consumes:
- text_data
produces:
- image_embedding and then later on the user decides to add a resuable component to caption text because they loaded other images that don't have captions: load_data (generic)
produces
- image_array
caption_data (reusable)
consumes:
- images_data
produces:
- image_captions
embed_text (custom)
consumes:
- text_data
produces:
- image_embedding What we can do is add a mapper mapper = FieldMapper()
mapper.add_input_mapping(from_input_dataset_field=image_array, to_consumed_component_field = images_data)
mapper.add_output_mapping(from_produced_component_field=image_captions, to_output_dataset_field = text_data)
caption_op = ComponentOp(dir=..., mapper=mapper) or alternatively caption_op = ComponentOp(dir=..., mapper=mapper)
.map_input(from_input_dataset_field=image_array, to_consumed_component_field = images_data)
.map_output(from_produced_component_field=image_captions, to_output_dataset_field = text_data) It might not seem like the smoothest experience but I think what we can do is make sure to provide clear instructions in the docs and error catching during static pipeline validation that could then point the user to the correct action to take (depending on input or output mismatch). We can also always plot the pipeline manifest evolution and highlight in red the source of the mismatch. Open to hearing other suggestions.
Do you mean to say that it should me mandatory for every component? This might force us to define it everywhere at every op even when it's not needed (component spec fields allign) and might have some overlap with the information in the component spec. Or it might mean that we would get rid of the |
I can think of some other approaches. These are not fully fleshed out, but just examples to indicate some other directions we could take. They might not be better :) Chaining ComponentOps consumes directlyfrom fondant.pipeline import ComponentOp
load_op = ComponentOp(
component_dir="load_data",
)
caption_op = ComponentOp.from_registry(
name="caption_images",
consumes={
"images_data": load_op.image_array
}
)
embed_op = ComponentOp(
component_dir="embed_text",
consumes={
"text_data": caption_op.image_captions
}
) We should be able to calculate the output manifest of each component as we go, so each You should only ever need to access the previous crop_op = ComponentOp.from_registry(
name="crop_images",
consumes={
"images_data": embed_op.image_array, # Correct
"images_data": load_op.image_array, # Incorrect
}
) Chaining consumes directly in pipelineThis can probably be solved by moving the explicit consumption chaining to the pipeline from fondant.pipeline import ComponentOp
load_op = ComponentOp(
component_dir="load_data",
)
caption_op = ComponentOp.from_registry(
name="caption_images",
)
embed_op = ComponentOp(
component_dir="embed_text",
)
crop_op = ComponentOp.from_registry(
name="crop_images",
)
pipeline = pipeline.add_op(load_op)
pipeline = pipeline.add_op(
caption_op,
consumes={
"images_data": pipeline.image_array
}
)
pipeline = pipeline.add_op(
embed_op,
consumes={
"text_data": pipeline.image_captions
}
)
pipeline = pipeline.add_op(
crop_op,
consumes={
"images_data": pipeline.image_array,
}
) Creating operations and chaining consumes directly in pipelineCompressing everything into the pipeline pipeline = pipeline.add_op(
component_dir="load_data"
)
pipeline = pipeline.add_op(
name="caption_images",
consumes={
"images_data": pipeline.image_array
}
)
pipeline = pipeline.add_op(
component_dir="embed_text",
consumes={
"text_data": pipeline.image_captions
}
)
pipeline = pipeline.add_op(
name="crop_images",
consumes={
"images_data": pipeline.image_array,
}
) These all only map inputs, but if mapping the inputs is mandatory, I don't think there's a lot of reason to map the outputs. Apart maybe for the ability to overwrite data produced in previous steps. Is this closer to what you meant @mrchtr? |
How would this work exactly? The estimation of the attributes can only happen at compile time. I am not familiar with a way to provide dynamic typed attributes.
I think this it might be needed in order to not break a whole pipeline (having to remap everywhere) if a component is added to a pipeline at a later stage as discussed here. I do agree though that it might be confusing. The examples that you mentioned above have a nice interface and could be nice if we manage to make the dynamic attributes work. Do you see the |
The code below works, Pycharm recognizes the class ComponentOp:
def __init__(self, produces: t.List[str]):
for field in produces:
setattr(self, field, field)
component_op = ComponentOp(produces=["images_data"])
Not sure, both are technically feasible. I haven't really thought this through yet, I just wanted to showcase that there are other ways to tackle this. I would need some time to think about the different options and flesh them out completely. I don't think this is blocking though. It's clear that there's ways to solve this, so we can already start breaking down the work needed to remove the subsets. |
It is super close to my thoughts. I like the last two approaches. But I would still try to include the produce step as well. ...
consumes:
text:
type: string
produces:
text_normalised:
type: string and we would use the last approach it would look like this:
We would utilise these components as operators applied to the pipeline dataframe. This approach gives us a kind of global pipeline schema to work with. I believe this would decouple the components and their transformations from a particular pipeline.
I think we can indeed move this discussion to a different issue. |
Woops, I tested it in a Python console, which is of course at runtime :) So probably won't work statically indeed, but it would work in a notebook. produces={
"text_normalised": pipeline.custom_text_column_normalized
}
|
My idea was that the custom produces would here add the column If we collect all the consumes and produces fields (either the default one or the custom ones on the right side) we would know the final schema of the dataset after the pipeline run successfully. The ComponentOp or Pipeline would define the dataframe schema which we could use for the validations. When we add a |
Yes, you can do that, but you cannot access it as an attribute on the pipeline. So then you should be working with string mappings again. produces={
"text_normalised": "custom_text_column_normalized"
} And then we probably don't want to use attributes for
|
If we want to go with the last proposed interface above, it might make more sense to work with datasets as the interface (it matches our design principle 😎): pipeline = Pipeline(
pipeline_name="my_pipeline",
pipeline_description="description of my pipeline",
base_path="/foo/bar",
)
dataset = pipeline.apply(
component_dir="load_data"
)
dataset = dataset.apply(
name="caption_images",
consumes={
"images_data": dataset.image_array
}
)
dataset = dataset.apply(
component_dir="embed_text",
consumes={
"text_data": dataset.image_captions
}
)
dataset = dataset.apply(
name="crop_images",
consumes={
"images_data": dataset.image_array,
}
)
This is the same interface that |
The dataset interface could also solve this problem:
If components operates on datasets we could add a schema to the dataset. Let the user define at the beginning the schema of the dataset or even let the first load component initialise the schema. We could apply something like this: produces={
"text_normalised": dataset.schema.custom_text_column_normalized
} If we would call it A |
The last remaining part of the interface which I'd like to simplify is that of generic components. I don't like that users have to overwrite the component spec, especially since they need to repeat the arguments etc. With the interface above (with pipelines or datasets, doesn't matter), we get a mechanism that simplifies the interface for generic write components for free: dataset = dataset.apply(
name="write_data",
consumes={
"image": dataset.image_array,
"embedding": dataset.embeddings,
"caption": dataset.image_captions,
}
) We know the schema of every field in the dataset, so this contains all the information we get from an overwritten component specification. In the component specification, we should indicate that the component can handle a generic input dataset. consumes:
"*" If this is present, the Two reservations on this:
|
The dataset is created by the previous component (or the pipeline at the start), so I still think we run into this issue:
In essence, it's the same issue that we need to solve as the one I mention in my previous comment:
The cleanest solution I can think of, indeed creates the schema in the generic read component: from fondant import type
pipeline = pipeline.add_op(
name="load_data",
produces={
"text": type.string,
}
) But this should only be used for generic read components to overwrite the component_spec. It can't be used by any non-generic components. |
I feel like we might benefit from explicitly defining I/O operators on the componentOp/Dataset class. Maybe we can somehow tie them explicitly to specific components with the newly introduced I also sometimes find the fact that a load component "produces" and a write component "consumes" to be a bit unintuitive. from fondant.schema import type
dataset = Dataset()
dataset = dataset.read_data(
name="load_data",
input_dataset_schema={ # clearly mention that only a schema is needed t.Dict[str, type]
"text": type.string
}
)
dataset = dataset.transform( # consumes (and maybe produces)
name="process_text",
consumes={
"text": dataset.text
}
)
dataset = dataset.write_dataset( # columns_to_write is "consumes" behind the scenes but it makes it more explicit
name="write_data",
columns_to_write={
"caption": dataset.text,
}
) |
Yes, @mrchtr and I had a call and concluded the same thing. I will try to summarize our conclusions here. Let me know if I missed anything @mrchtr. We create a pipeline and use a read component to create the first dataset. The user needs to provide a schema for the reader, which replaces overwriting the component specification. pipeline = Pipeline(
pipeline_name="my_pipeline",
pipeline_description="description of my pipeline",
base_path="/foo/bar",
)
dataset = pipeline.read(
name="load_images",
schema={
"image": type.binary # or pa.binary()
}
) For the schema, we need to decide if we want to roll our own types, or if we want to use the ones from PyArrow. If the PyArrow types suffice, I would go with those myself. From then on, transform components can be applied to the dataset. They can provide a mapping for both dataset = dataset.apply(
name="caption_images",
consumes={
"images_data": dataset.image # or "image"
}
produces={
"captions": "text"
} Note that the Write components also accept a schema. We can make the same choice here between a string to dataset attribute mapping or a string to string mapping. Linking to a dataset attribute here might make more sense, since we need to know both the name of the columns to map and their schema. However there is no reason we can't just look this up on the dataset using the string value. dataset.write(
name="write_data",
schema={
"image": dataset.image, # or "image"
"caption": dataset.text, # or "text"
}
) |
Just as validation that we don't block ourselves, a small thought experiment on how branching would work. Branching into multiple branches would be straightforward: branch1 = dataset.apply(...)
branch2 = dataset.apply(...) We can choose which interface to use for merging. Eg.: dataset = branch1.merge(branch2) dataset = Dataset.from(branch1, branch2) Most important here will be to select a or support multiple merge strategies, but I can't think of anything here that would complicate this compared to our current design. A pipeline reading from multiple sources can be created by adding multiple readers to the pipeline: branch1 = pipeline.read(...)
branch2 = pipeline.read(...) Again, we can get a lot of ideas from Beam here. |
Coming back to this part of my proposal above:
I would actually propose a different interface here. Instead of For example in the consumes:
text:
type: string
produces:
image_urls:
type: string
previous_index: prompt_id In a chunking component, this could be This has two advantages:
|
I this this makes sense, only reason we have a type class it to be able to write it to json schema and retrieve it from it.
I'm in favor of aligning both to text, am I correct in assuming that the attribute only makes sense when you're executing cell by cell? otherwise it might return an unrecognized attribute in your IDE.
Would keep it to string to string just for consistency. I think we can make this optional: if not specified it will write all the |
In the |
Yes it indeed only really makes sense in an interactive mode. We could still support both if we want to enable this while running interactively.
Ok, but just to be clear, then the schema attribute is optional. If you do provide a schema, you need to provide all fields. Otherwise there's no way to exclude fields to write.
Yes indeed, the component should return it as a column. We can easily validate that it is part of the produced schema. |
@RobbeSneyders has nailed down our discussion perfectly. I think it is fine using string value for both sections. As mentioned, we would lose the benefit of type safeties. But I think it is negligible, cause we can validate at latest during the compilation. I wanted to note down an additional thought. I've mentioned a concern during @PhilippeMoussalli implementation of the subset field mapping. We have to make sure that it isn't confusing for the user which site of the dictionary entry represents the schema of the dataset and which one the schema of the component dataframe. Back then I've proposed to have a dataclass that allows us to use explicit names. dataset = dataset.apply(
name="caption_images",
consumes=[
ColumnMapping(component_column="images_data", dataset_column="image"),
...
]
produces=[
ColumnMapping(component_column="captions", dataset_column="text")
]
) This only has an effect on the interface but might make it clear if you are looking the first time on a pipeline/component definition. I didn't found it straight forward on the initial draft to see which column names belongs to which dataframe. If we use dataclasses, I can even think about a different interface: dataset = dataset.apply(
name="caption_images",
schema_mapping=[
Consumes(component_column="images_data", dataset_column="image"),
Consumes(component_column="...", dataset_column="..."),
Produces(component_column="captions", dataset_column="text")
Index(previous_index="...")
]
) Not sure about the dataclasses myself anymore, it either complicates things or improves the understanding. |
I'm not a big fan of the dataclasses, it makes the interface look a lot more complex. Agree that the direction of the mapping can be confusing, but we can validate this at compilation and provide clear error messages. |
Working on the RAG use case, it's become clear that we need to support "generic" behavior not just for read and write components, but also for transform components. Think about the following components:
I therefore propose to provide an option for the user to overwrite pipeline = Pipeline("my-pipe", base_path="/base_path")
dataset = pipeline.read(
"read_component",
arguments={},
produces={ # A read component can only overwrite produces.
"image": pa.binary()
}
)
dataset = dataset.apply(
"generic_captioning",
arguments={},
consumes={ # A transform component can overwrite consumes
"image": pa.binary()
},
produces={ # And produces
"caption": pa.string()
}
)
dataset.write(
"write_component",
consumes={ # A write component can only overwrite consumes
"text": pa.string(),
}
) Generic components could define a schema for the produces:
additionalProperties: true Or for a generic captioning component: consumes:
additionalProperties:
type: binary
produces:
additionalProperties:
type: string This way we could validate the Then of course we are still left with the question on how to map fields between components. We could overload the pipeline = Pipeline("my-pipe", base_path="/base_path")
dataset = pipeline.read(
"read_component",
arguments={},
produces={
"image": pa.binary(), # A read component can only specify types in produces since we need the schema info.
}
)
dataset = dataset.apply(
"generic_captioning",
arguments={},
consumes={
"image": "image", # A generic transform component should specify a string. The type is inferred from the column on the dataset and the column is renamed.
},
produces={
"caption": pa.string(), # A generic transform component can only specify types in produces since we need the schema info.
}
)
dataset = dataset.apply(
"text_beautifier",
"consumes={
"text": "caption", # A non-generic transform component can specify strings in consumes to rename columns.
},
produces={
"text": "caption", # A non-generic transform component can specify a string in produces to rename columns
}
)
dataset.write(
"write_component",
consumes={
"text": "text", # A write component should specify a string. The type is inferred from the column on the dataset and the column is renamed.
}
) The rules can be summarized succinctly as follows, so I think (hope) this won't be too complex:
I thought about other ways to do this. Eg. by using a |
I feel like this information would still be better placed in the component spec rather than overwriting it at the dataset level since it better describes what are the possible fields that this component can take either consumes:
additionalProperties:
type: binary Instead it would make more sense if it looks something like this (not sure about the final format yet) consumes:
anyOf:
- text:
type: str
- text_embedding:
type: binary The component would need to have something like this class Retrieval(PandasTransformComponent):
def __init__(self, *args) -> None:
def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
if text in dataframe.columns:
embeddings = dataframe[text].apply(get_embeddings)
else:
embeddings = dataframe[text_emeddings]
Same comment as above except for a slightly different component spec consumes:
- answers:
type: str
optional: true
produces:
- additional_metric:
type:int
optional: true
depends_on: answers
There it makes sense to maybe include consumes:
additionalProperties: true provided that all the consumed columns will be aggregated.
It seems like this would mean that we're defining some of the behavior of the components purely on the the how the dataset interface is defined rather than the component spec itself. The new fields that you're defining here would still need to be defined in the component itself and some of the behavior is not captured here (e.g. additional produced column depending on optional input column, one of two columns that need to be provided) which can make it difficult to validate. I still think it might make more sense to override the spec only for read/write components and some exceptional transform components (aggregation). The rest of the behavior and cases should be outlined in the spec. For the generic transform component, we could maybe have a method
The new overriden
I generally agree with the behavior but I think it might be a bit difficult for end users. If we have a separate method for the generic components as I mentioned above we could provide better information for this via docstrings. We could define how to use the component with which method in the hub:
All in all I think having additional columns and fields is desirable but it does indeed come with additional complexity (component spec definition, additional considerations in the pipeline interface, ...). But I think this should be more of advanced use cases rather than the common definition of a component. |
On how to define generic stuff in the component spec, I agree that the ways you suggest might be better. But I think we can summarize it as: "Will be able to be defined by OpenAPI spec", and keep it out of scope for now.
This is exactly why we need to define this on the dataset. If we make the component spec generic, the arguments on the dataset need to make it specific. So we know exactly which columns will be part of the intermediate dataset at each point. Otherwise we cannot evolve our manifest.
I wouldn't want to make generic components a special type. Components could be a combination of specific & generic. Eg. a specific
I'm not sure this will only be for advanced use cases. As mentioned, for the RAG use case, I see a lot of components that would benefit from being (partly) generic. Probably more than components that would not benefit from it. |
PR that introduces functionality to new pipeline interface as discussed [here](#567 (comment)) * The component spec now accepts **OneOf** additionalFields or Fields in it's consumes and produces section * The new `consumes` and `produces` are defined at the Op level similarly to the ones in the component spec, if they are present, they will override the default `consumes` and `produces` defined in the component spec (manifet, dataIO) * Some changes were added to `DataIO` just to resolve tests issues but the new functionality of the custom consumes and produces is not implemented yet (will be tackled in a separate PR) --------- Co-authored-by: Robbe Sneyders <[email protected]>
PR that introduces functionality to new pipeline interface as discussed [here](#567 (comment)) * The component spec now accepts **OneOf** additionalFields or Fields in it's consumes and produces section * The new `consumes` and `produces` are defined at the Op level similarly to the ones in the component spec, if they are present, they will override the default `consumes` and `produces` defined in the component spec (manifet, dataIO) * Some changes were added to `DataIO` just to resolve tests issues but the new functionality of the custom consumes and produces is not implemented yet (will be tackled in a separate PR) --------- Co-authored-by: Robbe Sneyders <[email protected]>
PR that introduces functionality to new pipeline interface as discussed [here](#567 (comment)) * The component spec now accepts **OneOf** additionalFields or Fields in it's consumes and produces section * The new `consumes` and `produces` are defined at the Op level similarly to the ones in the component spec, if they are present, they will override the default `consumes` and `produces` defined in the component spec (manifet, dataIO) * Some changes were added to `DataIO` just to resolve tests issues but the new functionality of the custom consumes and produces is not implemented yet (will be tackled in a separate PR) --------- Co-authored-by: Robbe Sneyders <[email protected]>
We should revisit the design of our subsets & fields:
Pass unconsumed columns when additionalFields is True #244
Enable mapping subsets and fields #352
The text was updated successfully, but these errors were encountered: