Eliminate the need of having all datasets defined in the catalog #2423

merelcht · 2023-03-15T15:56:01Z

Description

Users struggle with ever increasing catalog files:
#891
#1847

Current "solutions":

Use yaml anchors: https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet#anchors
The TemplatedConfigLoader
Use the TemplatedConfigLoader with Jinja2: https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#jinja2-support

Instead of allowing anchors or loops in config let's just tackle the need of having so many datasets defined in the catalog in the first place.

Context

The new OmegaConfigLoader doesn't currently allow the use of yaml anchors (the omegaconf team discourages people to use this with omegaconf: omry/omegaconf#93) or Jinja2. While Jinja2 is very powerful, we've always had the standpoint as a team that it's better to not use it, because it makes files hard to read and reason about. Jinja2 is meant for HTML templating, not for configuration.

In order to make OmegaConfigLoader a suitable replacement for the ConfigLoader and TemplatedConfigLoader we need to find another solution to reduce the size of catalog files.

Possible Implementation

This is an idea from @idanov (written up by me, so I hope I have interpreted it correctly 😅 ) :

In every runner we have a method to create a default dataset for datasets that are used in the pipeline but not defined in the catalog:

kedro/kedro/runner/sequential_runner.py

Lines 32 to 43 in 8acfb2a

    
               def create_default_data_set(self, ds_name: str) -> AbstractDataSet: 
        
                   """Factory method for creating the default data set for the runner. 
        
                   Args: 
        
                       ds_name: Name of the missing data set 
        
                   Returns: 
        
                       An instance of an implementation of AbstractDataSet to be used 
        
                       for all unregistered data sets. 
        
                   """ 
        
                   return MemoryDataSet()

We can allow users to define what a "default" dataset should be so that in the catalog they only need to specify these default settings and Kedro would then create those without needing them to be defined in the catalog. The default dataset definition could then look like something like this:

default_dataset: 
  type: spark.SparkDataSet
  file_format: parquet
  load_args:
    header: true
  save_args:
    mode: overwrite

And any dataset used in the pipeline, but not mentioned in the catalog, would get the above specifications.

Extra info

I collected several catalog files from internal teams to see what patterns I could find in there. I think these will be useful when designing our solution. https://miro.com/app/board/uXjVMZM7IFY=/?share_link_id=288914952503

The text was updated successfully, but these errors were encountered:

datajoely · 2023-03-15T16:10:01Z

Would you only allow a singular default_dataset?

merelcht · 2023-03-15T16:30:35Z

Would you only allow a singular default_dataset?

I think that's up for discussion. I'm going to go through config files and see what patterns I see and how this proposal would help.

deepyaman · 2023-03-15T18:14:53Z

We can allow users to define what a "default" dataset should be so that in the catalog they only need to specify these default settings and Kedro would then create those without needing them to be defined in the catalog.

Non-memory datasets need more information, like where you're going to store datasets and with what name. For example, Kubeflow Pipelines has the idea of pipeline_root ("The root path where this pipeline’s outputs are stored. This can be a MinIO, Google Cloud Storage, or Amazon Web Services S3 URI. You can override the pipeline root when you run the pipeline.") But if users want control over specifically what their file name is, they don't, and instead it becomes something like the catalog entry name.

idanov · 2023-03-15T20:19:33Z

@datajoely Having the option to have more than one default datasets is also on the table, e.g. one can imagine a pattern matching solution where if your dataset name starts with xxxxx, it creates MemoryDataSet, and if it starts with yyyyy, it creates SparkDataSet. This should ultimately be configurable in settings.py and potentially removed from the runner (although different runners will need different default datasets, e.g. ParallelRunner vs SequentialRunner) and moved to the DataCatalog as responsibility.

Currently we have CONFIG_LOADER_CLASS and CONFIG_LOADER_ARGS, but we only have DATA_CATALOG_CLASS in the settings.py. We could get inspired from what we have in the OmegaConfLoader to specify file patterns for different config keys:

kedro/kedro/config/omegaconf_config.py

Lines 102 to 107 in 8acfb2a

    
           self.config_patterns = { 
        
               "catalog": ["catalog*", "catalog*/**", "**/catalog*"], 
        
               "parameters": ["parameters*", "parameters*/**", "**/parameters*"], 
        
               "credentials": ["credentials*", "credentials*/**", "**/credentials*"], 
        
               "logging": ["logging*", "logging*/**", "**/logging*"], 
        
           }

And allow a similar solution for our "default" datasets:

# % settings.py

def create_spark_dataset(dataset_name: str, *chunks):
    # e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
    return  SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")

def create_memory_dataset(dataset_name: str, *chunks):
    return MemoryDataSet()

DATA_CATALOG_ARGS = {
    "datasets_factories": {
        "root_namespace.*@spark": create_spark_dataset,
        "root_namespace.*@memory": create_memory_dataset, 
    }
}

So when a dataset name matching some of these patterns is provided, the dataset is created accordingly. We might not necessarily have all of that in code, but it could reside somewhere in catalog.yaml, since you can have almost anything as key in yaml, if surrounded by "", e.g.:

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

We also need to decide on a pattern-matching language, simpler than regex, but powerful enough to be able to chunk up the dataset names and those chunks to be reused in the body of the configuration for the reasons pointed out by @deepyaman. The using of the chunks in the body could be done through either f-string replacement as in the above yaml example, OmegaConf resolvers or something else entirely (depending where it will happen, because we don't want to add dependencies on OmegaConf in the DataCatalog).

The pipeline_root problem is the easiest to solve, since it can be an environment variable, which can be included through OmegaConf, another DATA_CATALOG_ARGS key or something of the sorts.

datajoely · 2023-03-16T09:43:43Z

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

@idanov - this is a work of art, I'm sold.

What I would ask is that we consider introducing an equivalent of dbt compile so the users have a mechanism for reviewing what is rendered at runtime versus what they configure at rest.

idanov · 2023-03-22T14:46:29Z

@AntonyMilneQB shared a nice alternative pattern-matching syntax here, which uses reverse Python f-strings format, which is definitely going to do the job and help us avoid creating a homebrew patter-matching language: https://github.com/r1chardj0n3s/parse

@datajoely we could do something like that, inspired from what @WaylonWalker created sometime ago: https://github.com/WaylonWalker/kedro-auto-catalog

Another option is a CLI command giving you the dataset definition per dataset name, something like kedro catalog get <dataset-name-of-interest>.

noklam · 2023-03-22T14:50:39Z

Just reading through this, so there are two things discussed here:

Provide a flexible way to allow users to replace default datasets - I have seen this and am glad it is mentioned here
- https://github.com/getindata/kedro-sagemaker/blob/develop/kedro_sagemaker/runner.py#L41-L67 This could be an example (1)
A pattern-matching solution for replacing the YAML anchor (2)

What would the catalog looks like for (2) catalog.yml? Do the user still need to put all entry in the catalog (but shorter because you only define the top-level key), or do user still need Jinja to create entry dynamically (say a list of 1000 datasets)

antonymilne · 2023-03-27T10:40:37Z

Before we get carried away with using parse to do the pattern matching we should definitely do some careful thinking about how this would work in practice. It seems like a great solution in theory (it's very lightweight and simple, gives you named chunks, everyone who knows Python should know the mini-language already), but when I've used it before it was a bit tricky to get right. Here's what the above example would look like:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

# would parse as follows
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: data/germany/companies.parquet
  file_format: parquet

All good so far, but what happens in this case?

"{dataset_name}_{layer}":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  layer: raw

# would parse as follows
big_companies_raw:
  type: spark.SparkDataSet
  filepath: data/companies_raw/big.parquet
  layer: companies_raw

...which is probably not what you wanted. The reason for this is that "parse() will always match the shortest text necessary (from left to right)". This is fine but would need explaining.

We also need to be wary of how to handle multiple patterns that match. e.g. let's say you want to override the default. Then you should be able to do something like this:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

france.companies@spark:
  type: spark.SomethingElse
  filepath: other/data/france/companies.parquet
  file_format: parquet

Here france.companies should take precedence over the general pattern, so we can have a rule like "if two patterns match and there's no {} in one pattern, then that one is more specific and so wins". But how about in cases where multiple {} patterns match?

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

Now france.companies@spark matches both entries, and it's not really clear which should win. Do we use the order the patterns are defined in the file to choose which wins? What about if the patterns come from a different file (since we merge multiple catalog.yml files together)?

All this is not to say that this solution is unworkable. I do think parse feels like a good fit here (after all, regex will suffer from similar problems), but that we really need to think about how it would work in practice. As soon as you start defining general patterns to match lots of things, you're going to want to override that for exceptions, and so we need to know exactly how that would work.

datajoely · 2023-03-27T10:46:12Z

@AntonyMilneQB - thank you for this, it's really an excellent write-up. One push I would make is that we could manage this complexity by providing a live editor / prototyping space in Viz.

I think it would massively improve the learning journey and help users get up and running quickly.

datajoely · 2023-03-27T10:47:47Z

Something like this https://textfsm.nornir.tech/#

merelcht · 2023-03-27T10:49:47Z

Thanks for diving a bit deeper into parse @AntonyMilneQB and writing up your thoughts!

I have a question about the final bit though. You're saying that if the catalog contains:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

france.companies@spark would match both. How come? Is the dot not a specific pattern that should be matched?

antonymilne · 2023-03-27T10:58:46Z

@merelcht The . is not a specific pattern that should be matched - it's just a character in the string like anything else. Think of replacing it with the letter A:

In [2]: parse("{dataset_name}@spark", "france.companies@spark")
Out[2]: <Result () {'dataset_name': 'france.companies'}>

In [4]: parse("{dataset_name}@spark", "franceAcompanies@spark")
Out[4]: <Result () {'dataset_name': 'franceAcompanies'}>

One way around this is to add a format specification like this:

In [6]: parse("{dataset_name:w}@spark", "france.companies@spark")
# No matches because . doesn't come under format w, which means Letters, numbers and underscore

In [4]: parse("{dataset_name}@spark", "franceAcompanies@spark")
Out[4]: <Result () {'dataset_name': 'franceAcompanies'}>
# Still matches because the string is of format w

But, by default, if you don't specify a format then it will just match anything. This still feels better to me than regex but it is definitely not as simple as it might first appear.

@idanov also asked me to post the make_like example here - I'll do it in a second.

antonymilne · 2023-03-27T11:21:11Z

Alternative proposal: omegaconf custom resolvers

This is also originally @idanov's idea. The above example would look something like this:

# Define the "template" entry - needs to be an actual valid catalog entry I think (?)
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: ${make_filepath:${..dataset_name}}  # this won't work - I don't know what the right syntax is or even if it's possible
  file_format: parquet

# Define other entries that follow the same pattern
france.companies@spark: ${make_like: ${..germany.companies@spark}} 
uk.companies@spark: ${make_like: ${..germany.companies@spark}} 
italy.companies@spark: ${make_like: ${..germany.companies@spark}}

Somewhere (presumably user defined, but maybe we could build some common ones in?) the resolvers would be defined as something like:

OmegaConf.register_new_resolver("make_filepath", lambda dataset_name: f"data{dataset_name.split(".")[0]}/{dataset_name.split("."[1])}.parquet
OmegaConf.register_new_resolver("make_like", lambda catalog_entry: catalog_entry)

This is all pretty vague because I'm not immediately sure whether these things are possible using omegaconf resolvers...

how to make the germany.companies@spark string accessible inside the catalog entry?
how to get make_like to work?

Pros:

no ambiguity with multiple pattern matching
no need to introduce another paradigm for yaml manipulation on top of omegaconf which we are already using: we have one consistent mechanism for variable interpolation, environment variables, pattern matching, etc.

Cons:

doesn't actually reduce the number of catalog entries, just makes them much shorter - you still need to define the france, italy, uk entries explicitly (is there some clever way in omegaconf to avoid this? I'm not sure since it seems you can only do variable interpolation in values, not keys?)
not actually sure if it's possible - needs some further thought on how these resolvers would work!
needs careful thought about implementation if we are to not make omegaconf a dependency of the data catalog

Pro or con depending on your perspective: more powerful and flexible than general pattern matching syntax because it allows for arbitrary Python functions, not just string parsing.

merelcht · 2023-03-27T14:23:02Z

@AntonyMilneQB

I don't think it's possible to pass a value that needs to be interpolated to a resolver. In this prototype example below:

# Define the "template" entry - needs to be an actual valid catalog entry I think (?)
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: ${make_filepath:${..dataset_name}}  # this won't work - I don't know what the right syntax is or even if it's possible
  file_format: parquet

# Define other entries that follow the same pattern
france.companies@spark: ${make_like: ${..germany.companies@spark}} 
uk.companies@spark: ${make_like: ${..germany.companies@spark}} 
italy.companies@spark: ${make_like: ${..germany.companies@spark}}

${make_filepath:${..dataset_name}} isn't possible as far as I found. Only if you add the dataset_name as a key in the file as well, but that would defeat the purpose of generalising the value, because you'd then just have the same file path value for all countries.

merelcht · 2023-03-30T14:53:22Z

Summary of the technical design discussions we had on 22/3 and 30/3:

Through the discussions we've identified there are in fact three problems that are all related.

Problems to address

Users don't like working with large + hard to maintain catalog files.
Users want to be able to have another default than MemoryDataSet for datasets not defined in the catalog.
Users need a way to keep constant/global values in one place to minimise mistakes e.g. typos and pass these in configuration files. Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader

Proposed solutions:

1. Dataset factories: addresses problem 1 & 2
2. Omegaconf "make-like" resolver: turns out that this isn't a feasible solution.

Problem 1: points discussed

General agreement that large catalog files is a valid problem and not only for QuantumBlack vertical teams.
Users are reluctant to make use of modular pipelines because it increases configuration.
General impression is that when users start working with Kedro they prefer readable (more verbose) catalog files, but the more they get used the Kedro the greater the need for features like templating and other tools that make the catalog more compact.
Concern with any solution to make the catalog more compact is readability. With existing solutions like Jinja2 and yaml-anchors it's also hard to understand what the "compiled" catalog looks like that's used by Kedro.
- Solution: add a mechanism to compile the catalog. Could be a CLI command similar to dbt compile
Whatever solution we go for, we will always allow the full fledged dataset definitions in the catalog, so users do not have to use the more compact syntax.

Problem 2: points discussed

@WaylonWalker created a tool that allows to change the default: https://github.com/WaylonWalker/kedro-auto-catalog
- This solution saves the datasets into a separate "default" catalog file, so it addresses problem 2, but not problem 1 because datasets still end up in the catalog.
- AFAIK you can only add 1 default
The datasets factories solution would address this problem.

Problem 3

Not yet addressed.

Other comments

Omegaconf resolvers do allow for some reduction of the catalog content just like yaml anchors.
With custom resolvers power users can find ways to reduce their catalogs more by themselves.
Perhaps it's not worth introducing this extra syntax if the OmegaConfigLoader leverages all resolver syntax properly.

Next steps

Discuss problem 3 in the next technical design session: Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader
Through some quick experimenting we found that variable interpolation can be used in place of yaml anchors. What needs to be done to make this work in OmegaConfigLoader? Same solution as for Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader? To be done in Enable variable interpolation in the catalog with OmegaConfigLoader #2507

Note We need to do user testing to verify any solution we decide on.

datajoely · 2023-03-30T14:59:49Z

I know we don't want to - but I also think we need to ask, does our solution support dynamic pipelines? Our users, particularly those who care about this templating stuff are going to try that sort of thing very quickly.

noklam · 2023-03-30T16:18:34Z


	Reduce the number Catalog Entry (How many top-level keys in the file)	Shorten Catalog Entry (How many lines for each Dataset)	Ability to Change Default DataSet	Constant/Global Value for all Configurations
DataSet Factory	✅	✅	✅	❌
OmegaConf Interpolation	❌	✅	❌	？
Environment Variable	❌	❌	❌	✅
OmegaConf Custom Resolver	❌	✅	❌	？

(Feel free to edit, the ? are things that I am not sure about)
Updated at 2023-04-11

datajoely · 2023-03-30T16:30:15Z

For a slightly 🌶️ comparison


	Reduce the number Catalog Entry (How many top-level keys in the file)	Shorten Catalog Entry (How many lines for each Dataset)	Ability to Change Default DataSet	Constant/Global Value for all Configurations
Jinja2 (with .j2 template support)	✅	✅	❌	✅

antonymilne · 2023-03-30T17:01:12Z

@noklam I filled out two of the questions marks but don't understand what the last column means?

Also I think the dataset factory should have ✅ for "Reduce the number Catalog Entry", no? Since you would have a general pattern matching syntax rather than needing to define each entry explicitly.

merelcht · 2023-05-17T14:04:33Z

Feedback on the prototypes (#2560 + #2559) to be taken into account for the proper implementation:

Do not call catalog.list() in the runner checks to determine if there's any missing inputs/datasets that need the default dataset, but instead create a new method to check if a dataset name exists in the catalog that finds any exact matches as well as pattern matches.
Clarify to users that the default dataset creation will be replaced by any pattern with the following format:

{dataset}:
   .....:
   .....:

Add new catalog CLI commands for dataset factories #2603
We should design a way of testing this functionality before releasing: User testing for dataset factories syntax #2602

merelcht · 2023-05-30T13:11:38Z

This is the clean version of the dataset factories prototype excluding the parsing rules: https://github.com/kedro-org/kedro/tree/feature/dataset-factories-for-catalog

merelcht added the Component: Configuration label Mar 15, 2023

merelcht added this to the Make `OmegaConfigLoader` ready for 0.19.0 milestone Mar 15, 2023

merelcht added this to Kedro Framework Mar 15, 2023

merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 15, 2023

antonymilne mentioned this issue Mar 27, 2023

Decide on syntax to allow adding new attributes #2439

Closed

2 tasks

merelcht mentioned this issue Apr 12, 2023

Enable variable interpolation in the catalog with OmegaConfigLoader #2507

Closed

merelcht added the Type: Parent Issue label Apr 12, 2023

This was referenced Apr 12, 2023

Prototype parsing rules for dataset factories #2508

Closed

Prototype dataset factories #2510

Closed

merelcht mentioned this issue May 5, 2023

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

Closed

5 tasks

This was referenced May 22, 2023

User testing for dataset factories syntax #2602

Closed

Add new catalog CLI commands for dataset factories #2603

Closed

merelcht moved this to To Do in Kedro Framework May 30, 2023

ankatiyar self-assigned this May 30, 2023

ankatiyar moved this from To Do to In Progress in Kedro Framework May 30, 2023

This was referenced Jun 2, 2023

[Draft] Dataset factories - Eager resolving approach #2632

Closed

Dataset factories #2635

Merged

Add documentation for the dataset factories feature #2666

Closed

ankatiyar mentioned this issue Jun 12, 2023

Use dataset factories to register default datasets #2668

Closed

ankatiyar moved this from In Progress to In Review in Kedro Framework Jun 12, 2023

noklam mentioned this issue Jun 19, 2023

Migration guide for switching to OmegaConfigLoader #2699

Closed

2 tasks

ankatiyar mentioned this issue Jul 3, 2023

Show warning when using a catch-all pattern with dataset factories #2760

Closed

ankatiyar closed this as completed in #2635 Jul 6, 2023

github-project-automation bot moved this from In Review to Done in Kedro Framework Jul 6, 2023

amandakys mentioned this issue Jul 12, 2023

Update starters to use new /conf structure #2753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate the need of having all datasets defined in the catalog #2423

Eliminate the need of having all datasets defined in the catalog #2423

merelcht commented Mar 15, 2023 •

edited

Loading

datajoely commented Mar 15, 2023

merelcht commented Mar 15, 2023

deepyaman commented Mar 15, 2023

idanov commented Mar 15, 2023 •

edited by noklam

Loading

datajoely commented Mar 16, 2023

idanov commented Mar 22, 2023

noklam commented Mar 22, 2023 •

edited

Loading

antonymilne commented Mar 27, 2023 •

edited

Loading

datajoely commented Mar 27, 2023 •

edited

Loading

datajoely commented Mar 27, 2023

merelcht commented Mar 27, 2023

antonymilne commented Mar 27, 2023 •

edited

Loading

antonymilne commented Mar 27, 2023 •

edited

Loading

merelcht commented Mar 27, 2023

merelcht commented Mar 30, 2023 •

edited

Loading

datajoely commented Mar 30, 2023

noklam commented Mar 30, 2023 •

edited

Loading

datajoely commented Mar 30, 2023

antonymilne commented Mar 30, 2023

merelcht commented May 17, 2023 •

edited

Loading

merelcht commented May 30, 2023

Eliminate the need of having all datasets defined in the catalog #2423

Eliminate the need of having all datasets defined in the catalog #2423

Comments

merelcht commented Mar 15, 2023 • edited Loading

Description

Context

Possible Implementation

Extra info

datajoely commented Mar 15, 2023

merelcht commented Mar 15, 2023

deepyaman commented Mar 15, 2023

idanov commented Mar 15, 2023 • edited by noklam Loading

datajoely commented Mar 16, 2023

idanov commented Mar 22, 2023

noklam commented Mar 22, 2023 • edited Loading

antonymilne commented Mar 27, 2023 • edited Loading

datajoely commented Mar 27, 2023 • edited Loading

datajoely commented Mar 27, 2023

merelcht commented Mar 27, 2023

antonymilne commented Mar 27, 2023 • edited Loading

antonymilne commented Mar 27, 2023 • edited Loading

Alternative proposal: omegaconf custom resolvers

merelcht commented Mar 27, 2023

merelcht commented Mar 30, 2023 • edited Loading

Problems to address

Proposed solutions:

Problem 1: points discussed

Problem 2: points discussed

Problem 3

Other comments

Next steps

datajoely commented Mar 30, 2023

noklam commented Mar 30, 2023 • edited Loading

datajoely commented Mar 30, 2023

antonymilne commented Mar 30, 2023

merelcht commented May 17, 2023 • edited Loading

merelcht commented May 30, 2023

merelcht commented Mar 15, 2023 •

edited

Loading

idanov commented Mar 15, 2023 •

edited by noklam

Loading

noklam commented Mar 22, 2023 •

edited

Loading

antonymilne commented Mar 27, 2023 •

edited

Loading

datajoely commented Mar 27, 2023 •

edited

Loading

antonymilne commented Mar 27, 2023 •

edited

Loading

antonymilne commented Mar 27, 2023 •

edited

Loading

merelcht commented Mar 30, 2023 •

edited

Loading

noklam commented Mar 30, 2023 •

edited

Loading

merelcht commented May 17, 2023 •

edited

Loading