-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for dataset factories feature #2670
Changes from all commits
62e7274
4936d89
fde58a2
da91bc4
9d262b7
c9359cb
92c1f0d
99694b3
6a7832e
6b92974
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -404,7 +404,7 @@ CSVDataSet( | |
``` | ||
|
||
|
||
## Load multiple datasets with similar configuration | ||
## Load multiple datasets with similar configuration using YAML anchors | ||
|
||
Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file. | ||
|
||
|
@@ -461,6 +461,211 @@ airplanes: | |
|
||
In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. | ||
|
||
## Load multiple datasets with similar configuration using dataset factories | ||
For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.11. This syntax allows you to generalise the configuration and | ||
reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. | ||
|
||
### Example 1: Generalise datasets with similar names and types into one dataset factory | ||
Consider the following catalog entries: | ||
```yaml | ||
factory_data: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/factory_data.csv | ||
|
||
|
||
process_data: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/process_data.csv | ||
``` | ||
The datasets in this catalog can be generalised to the following dataset factory: | ||
```yaml | ||
"{name}_data": | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/{name}_data.csv | ||
``` | ||
When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in | ||
quotes to avoid YAML parsing errors. | ||
|
||
|
||
### Example 2: Generalise datasets of the same type into one dataset factory | ||
You can also combine all the datasets with the same type and configuration details. For example, consider the following | ||
catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: | ||
```yaml | ||
boats: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/shuttles.csv | ||
|
||
cars: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/reviews.csv | ||
|
||
planes: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/companies.csv | ||
``` | ||
These datasets can be combined into the following dataset factory: | ||
```yaml | ||
"{dataset_name}#csv": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah maybe something else like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not something we recommend afaik, I've tried to use different delimiters for each example, underscore, hash symbol, period etc |
||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/{dataset_name}.csv | ||
``` | ||
You will then have to update the pipelines in your project located at `src/<project_name>/<pipeline_name>/pipeline.py` to refer to these datasets as `boats#csv`, | ||
`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset | ||
names are matched with the intended pattern. | ||
```python | ||
from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles | ||
|
||
|
||
def create_pipeline(**kwargs) -> Pipeline: | ||
return pipeline( | ||
[ | ||
node( | ||
func=preprocess_boats, | ||
inputs="boats#csv", | ||
outputs="preprocessed_boats", | ||
name="preprocess_boats_node", | ||
), | ||
node( | ||
func=preprocess_cars, | ||
inputs="cars#csv", | ||
outputs="preprocessed_cars", | ||
name="preprocess_cars_node", | ||
), | ||
node( | ||
func=preprocess_planes, | ||
inputs="planes#csv", | ||
outputs="preprocessed_planes", | ||
name="preprocess_planes_node", | ||
), | ||
node( | ||
func=create_model_input_table, | ||
inputs=[ | ||
"preprocessed_boats", | ||
"preprocessed_planes", | ||
"preprocessed_cars", | ||
], | ||
outputs="model_input_table", | ||
name="create_model_input_table_node", | ||
), | ||
] | ||
) | ||
``` | ||
### Example 3: Generalise datasets using namespaces into one dataset factory | ||
You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the | ||
following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the | ||
`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: | ||
```python | ||
from kedro.pipeline import Pipeline, node | ||
from kedro.pipeline.modular_pipeline import pipeline | ||
|
||
from .nodes import evaluate_model, split_data, train_model | ||
|
||
|
||
def create_pipeline(**kwargs) -> Pipeline: | ||
pipeline_instance = pipeline( | ||
[ | ||
node( | ||
func=split_data, | ||
inputs=["model_input_table", "params:model_options"], | ||
outputs=["X_train", "y_train"], | ||
name="split_data_node", | ||
), | ||
node( | ||
func=train_model, | ||
inputs=["X_train", "y_train"], | ||
outputs="regressor", | ||
name="train_model_node", | ||
), | ||
] | ||
) | ||
ds_pipeline_1 = pipeline( | ||
pipe=pipeline_instance, | ||
inputs="model_input_table", | ||
namespace="active_modelling_pipeline", | ||
) | ||
ds_pipeline_2 = pipeline( | ||
pipe=pipeline_instance, | ||
inputs="model_input_table", | ||
namespace="candidate_modelling_pipeline", | ||
) | ||
|
||
return ds_pipeline_1 + ds_pipeline_2 | ||
``` | ||
You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` | ||
and `candidate_modelling_pipeline.regressor` as below: | ||
```yaml | ||
{namespace}.regressor: | ||
type: pickle.PickleDataSet | ||
filepath: data/06_models/regressor_{namespace}.pkl | ||
versioned: true | ||
``` | ||
### Example 4: Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders | ||
|
||
You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset | ||
entries share `type`, `file_format` and `save_args`: | ||
```yaml | ||
processing.factory_data: | ||
type: spark.SparkDataSet | ||
filepath: data/processing/factory_data.pq | ||
file_format: parquet | ||
save_args: | ||
mode: overwrite | ||
|
||
processing.process_data: | ||
type: spark.SparkDataSet | ||
filepath: data/processing/process_data.pq | ||
file_format: parquet | ||
save_args: | ||
mode: overwrite | ||
|
||
modelling.metrics: | ||
type: spark.SparkDataSet | ||
filepath: data/modelling/factory_data.pq | ||
file_format: parquet | ||
save_args: | ||
mode: overwrite | ||
``` | ||
This could be generalised to the following pattern: | ||
```yaml | ||
"{layer}.{dataset_name}": | ||
type: spark.SparkDataSet | ||
filepath: data/{layer}/{dataset_name}.pq | ||
file_format: parquet | ||
save_args: | ||
mode: overwrite | ||
``` | ||
All the placeholders used in the catalog entry body must exist in the factory pattern name. | ||
|
||
### Example 5: Generalise datasets using multiple dataset factories | ||
You can have multiple dataset factories in your catalog. For example: | ||
```yaml | ||
"{namespace}.{dataset_name}@spark": | ||
type: spark.SparkDataSet | ||
filepath: data/{namespace}/{dataset_name}.pq | ||
file_format: parquet | ||
|
||
"{dataset_name}@csv": | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/{dataset_name}.csv | ||
``` | ||
|
||
Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might | ||
match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. | ||
The matches are ranked according to the following criteria : | ||
1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. | ||
2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. | ||
3. Alphabetical order | ||
|
||
### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` | ||
You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. | ||
```yaml | ||
"{default_dataset}": | ||
type: pandas.CSVDataSet | ||
filepath: data/{default_dataset}.csv | ||
|
||
``` | ||
Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog | ||
as `pandas.CSVDataSet`. | ||
|
||
## Transcode datasets | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the first example, I'm wondering if it would be good to add an "info" admonition saying
"If you don't use a suffix the default dataset will be overridden, see [this section below]"
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the first example basic enough to get started but the second example actually shows a situation where you might need to add a suffix/prefix. I added a small explanation there as to why.