Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate the need of having all datasets defined in the catalog #2423

Closed
merelcht opened this issue Mar 15, 2023 · 21 comments · Fixed by #2635
Closed

Eliminate the need of having all datasets defined in the catalog #2423

merelcht opened this issue Mar 15, 2023 · 21 comments · Fixed by #2635
Assignees
Labels
Component: Configuration Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Type: Parent Issue

Comments

@merelcht
Copy link
Member

merelcht commented Mar 15, 2023

Description

Users struggle with ever increasing catalog files:
#891
#1847

Current "solutions":

Instead of allowing anchors or loops in config let's just tackle the need of having so many datasets defined in the catalog in the first place.

Context

The new OmegaConfigLoader doesn't currently allow the use of yaml anchors (the omegaconf team discourages people to use this with omegaconf: omry/omegaconf#93) or Jinja2. While Jinja2 is very powerful, we've always had the standpoint as a team that it's better to not use it, because it makes files hard to read and reason about. Jinja2 is meant for HTML templating, not for configuration.

In order to make OmegaConfigLoader a suitable replacement for the ConfigLoader and TemplatedConfigLoader we need to find another solution to reduce the size of catalog files.

Possible Implementation

This is an idea from @idanov (written up by me, so I hope I have interpreted it correctly 😅 ) :

In every runner we have a method to create a default dataset for datasets that are used in the pipeline but not defined in the catalog:

def create_default_data_set(self, ds_name: str) -> AbstractDataSet:
"""Factory method for creating the default data set for the runner.
Args:
ds_name: Name of the missing data set
Returns:
An instance of an implementation of AbstractDataSet to be used
for all unregistered data sets.
"""
return MemoryDataSet()

We can allow users to define what a "default" dataset should be so that in the catalog they only need to specify these default settings and Kedro would then create those without needing them to be defined in the catalog. The default dataset definition could then look like something like this:

default_dataset: 
  type: spark.SparkDataSet
  file_format: parquet
  load_args:
    header: true
  save_args:
    mode: overwrite

And any dataset used in the pipeline, but not mentioned in the catalog, would get the above specifications.

Extra info

I collected several catalog files from internal teams to see what patterns I could find in there. I think these will be useful when designing our solution. https://miro.com/app/board/uXjVMZM7IFY=/?share_link_id=288914952503

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 15, 2023
@datajoely
Copy link
Contributor

Would you only allow a singular default_dataset?

@merelcht
Copy link
Member Author

Would you only allow a singular default_dataset?

I think that's up for discussion. I'm going to go through config files and see what patterns I see and how this proposal would help.

@deepyaman
Copy link
Member

We can allow users to define what a "default" dataset should be so that in the catalog they only need to specify these default settings and Kedro would then create those without needing them to be defined in the catalog.

Non-memory datasets need more information, like where you're going to store datasets and with what name. For example, Kubeflow Pipelines has the idea of pipeline_root ("The root path where this pipeline’s outputs are stored. This can be a MinIO, Google Cloud Storage, or Amazon Web Services S3 URI. You can override the pipeline root when you run the pipeline.") But if users want control over specifically what their file name is, they don't, and instead it becomes something like the catalog entry name.

@idanov
Copy link
Member

idanov commented Mar 15, 2023

@datajoely Having the option to have more than one default datasets is also on the table, e.g. one can imagine a pattern matching solution where if your dataset name starts with xxxxx, it creates MemoryDataSet, and if it starts with yyyyy, it creates SparkDataSet. This should ultimately be configurable in settings.py and potentially removed from the runner (although different runners will need different default datasets, e.g. ParallelRunner vs SequentialRunner) and moved to the DataCatalog as responsibility.

Currently we have CONFIG_LOADER_CLASS and CONFIG_LOADER_ARGS, but we only have DATA_CATALOG_CLASS in the settings.py. We could get inspired from what we have in the OmegaConfLoader to specify file patterns for different config keys:

self.config_patterns = {
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}

And allow a similar solution for our "default" datasets:

# % settings.py

def create_spark_dataset(dataset_name: str, *chunks):
    # e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
    return  SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")

def create_memory_dataset(dataset_name: str, *chunks):
    return MemoryDataSet()

DATA_CATALOG_ARGS = {
    "datasets_factories": {
        "root_namespace.*@spark": create_spark_dataset,
        "root_namespace.*@memory": create_memory_dataset, 
    }
}

So when a dataset name matching some of these patterns is provided, the dataset is created accordingly. We might not necessarily have all of that in code, but it could reside somewhere in catalog.yaml, since you can have almost anything as key in yaml, if surrounded by "", e.g.:

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

We also need to decide on a pattern-matching language, simpler than regex, but powerful enough to be able to chunk up the dataset names and those chunks to be reused in the body of the configuration for the reasons pointed out by @deepyaman. The using of the chunks in the body could be done through either f-string replacement as in the above yaml example, OmegaConf resolvers or something else entirely (depending where it will happen, because we don't want to add dependencies on OmegaConf in the DataCatalog).

The pipeline_root problem is the easiest to solve, since it can be an environment variable, which can be included through OmegaConf, another DATA_CATALOG_ARGS key or something of the sorts.

@datajoely
Copy link
Contributor

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

@idanov - this is a work of art, I'm sold.

What I would ask is that we consider introducing an equivalent of dbt compile so the users have a mechanism for reviewing what is rendered at runtime versus what they configure at rest.

@idanov
Copy link
Member

idanov commented Mar 22, 2023

@AntonyMilneQB shared a nice alternative pattern-matching syntax here, which uses reverse Python f-strings format, which is definitely going to do the job and help us avoid creating a homebrew patter-matching language: https://github.com/r1chardj0n3s/parse

@datajoely we could do something like that, inspired from what @WaylonWalker created sometime ago: https://github.com/WaylonWalker/kedro-auto-catalog

Another option is a CLI command giving you the dataset definition per dataset name, something like kedro catalog get <dataset-name-of-interest>.

@noklam
Copy link
Contributor

noklam commented Mar 22, 2023

Just reading through this, so there are two things discussed here:

What would the catalog looks like for (2) catalog.yml? Do the user still need to put all entry in the catalog (but shorter because you only define the top-level key), or do user still need Jinja to create entry dynamically (say a list of 1000 datasets)

@antonymilne
Copy link
Contributor

antonymilne commented Mar 27, 2023

Before we get carried away with using parse to do the pattern matching we should definitely do some careful thinking about how this would work in practice. It seems like a great solution in theory (it's very lightweight and simple, gives you named chunks, everyone who knows Python should know the mini-language already), but when I've used it before it was a bit tricky to get right. Here's what the above example would look like:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

# would parse as follows
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: data/germany/companies.parquet
  file_format: parquet

All good so far, but what happens in this case?

"{dataset_name}_{layer}":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  layer: raw

# would parse as follows
big_companies_raw:
  type: spark.SparkDataSet
  filepath: data/companies_raw/big.parquet
  layer: companies_raw

...which is probably not what you wanted. The reason for this is that "parse() will always match the shortest text necessary (from left to right)". This is fine but would need explaining.

We also need to be wary of how to handle multiple patterns that match. e.g. let's say you want to override the default. Then you should be able to do something like this:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

france.companies@spark:
  type: spark.SomethingElse
  filepath: other/data/france/companies.parquet
  file_format: parquet

Here france.companies should take precedence over the general pattern, so we can have a rule like "if two patterns match and there's no {} in one pattern, then that one is more specific and so wins". But how about in cases where multiple {} patterns match?

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

Now france.companies@spark matches both entries, and it's not really clear which should win. Do we use the order the patterns are defined in the file to choose which wins? What about if the patterns come from a different file (since we merge multiple catalog.yml files together)?

All this is not to say that this solution is unworkable. I do think parse feels like a good fit here (after all, regex will suffer from similar problems), but that we really need to think about how it would work in practice. As soon as you start defining general patterns to match lots of things, you're going to want to override that for exceptions, and so we need to know exactly how that would work.

@datajoely
Copy link
Contributor

datajoely commented Mar 27, 2023

@AntonyMilneQB - thank you for this, it's really an excellent write-up. One push I would make is that we could manage this complexity by providing a live editor / prototyping space in Viz.

I think it would massively improve the learning journey and help users get up and running quickly.

@datajoely
Copy link
Contributor

Something like this https://textfsm.nornir.tech/#

@merelcht
Copy link
Member Author

Thanks for diving a bit deeper into parse @AntonyMilneQB and writing up your thoughts!

I have a question about the final bit though. You're saying that if the catalog contains:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

france.companies@spark would match both. How come? Is the dot not a specific pattern that should be matched?

@antonymilne
Copy link
Contributor

antonymilne commented Mar 27, 2023

@merelcht The . is not a specific pattern that should be matched - it's just a character in the string like anything else. Think of replacing it with the letter A:

In [2]: parse("{dataset_name}@spark", "france.companies@spark")
Out[2]: <Result () {'dataset_name': 'france.companies'}>

In [4]: parse("{dataset_name}@spark", "franceAcompanies@spark")
Out[4]: <Result () {'dataset_name': 'franceAcompanies'}>

One way around this is to add a format specification like this:

In [6]: parse("{dataset_name:w}@spark", "france.companies@spark")
# No matches because . doesn't come under format w, which means Letters, numbers and underscore

In [4]: parse("{dataset_name}@spark", "franceAcompanies@spark")
Out[4]: <Result () {'dataset_name': 'franceAcompanies'}>
# Still matches because the string is of format w

But, by default, if you don't specify a format then it will just match anything. This still feels better to me than regex but it is definitely not as simple as it might first appear.

@idanov also asked me to post the make_like example here - I'll do it in a second.

@antonymilne
Copy link
Contributor

antonymilne commented Mar 27, 2023

Alternative proposal: omegaconf custom resolvers

This is also originally @idanov's idea. The above example would look something like this:

# Define the "template" entry - needs to be an actual valid catalog entry I think (?)
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: ${make_filepath:${..dataset_name}}  # this won't work - I don't know what the right syntax is or even if it's possible
  file_format: parquet

# Define other entries that follow the same pattern
france.companies@spark: ${make_like: ${..germany.companies@spark}} 
uk.companies@spark: ${make_like: ${..germany.companies@spark}} 
italy.companies@spark: ${make_like: ${..germany.companies@spark}} 

Somewhere (presumably user defined, but maybe we could build some common ones in?) the resolvers would be defined as something like:

OmegaConf.register_new_resolver("make_filepath", lambda dataset_name: f"data{dataset_name.split(".")[0]}/{dataset_name.split("."[1])}.parquet
OmegaConf.register_new_resolver("make_like", lambda catalog_entry: catalog_entry)

This is all pretty vague because I'm not immediately sure whether these things are possible using omegaconf resolvers...

  • how to make the germany.companies@spark string accessible inside the catalog entry?
  • how to get make_like to work?

Pros:

  • no ambiguity with multiple pattern matching
  • no need to introduce another paradigm for yaml manipulation on top of omegaconf which we are already using: we have one consistent mechanism for variable interpolation, environment variables, pattern matching, etc.

Cons:

  • doesn't actually reduce the number of catalog entries, just makes them much shorter - you still need to define the france, italy, uk entries explicitly (is there some clever way in omegaconf to avoid this? I'm not sure since it seems you can only do variable interpolation in values, not keys?)
  • not actually sure if it's possible - needs some further thought on how these resolvers would work!
  • needs careful thought about implementation if we are to not make omegaconf a dependency of the data catalog

Pro or con depending on your perspective: more powerful and flexible than general pattern matching syntax because it allows for arbitrary Python functions, not just string parsing.

@merelcht
Copy link
Member Author

@AntonyMilneQB

I don't think it's possible to pass a value that needs to be interpolated to a resolver. In this prototype example below:

# Define the "template" entry - needs to be an actual valid catalog entry I think (?)
germany.companies@spark:
  type: spark.SparkDataSet
  filepath: ${make_filepath:${..dataset_name}}  # this won't work - I don't know what the right syntax is or even if it's possible
  file_format: parquet

# Define other entries that follow the same pattern
france.companies@spark: ${make_like: ${..germany.companies@spark}} 
uk.companies@spark: ${make_like: ${..germany.companies@spark}} 
italy.companies@spark: ${make_like: ${..germany.companies@spark}} 

${make_filepath:${..dataset_name}} isn't possible as far as I found. Only if you add the dataset_name as a key in the file as well, but that would defeat the purpose of generalising the value, because you'd then just have the same file path value for all countries.

@merelcht
Copy link
Member Author

merelcht commented Mar 30, 2023

Summary of the technical design discussions we had on 22/3 and 30/3:

Through the discussions we've identified there are in fact three problems that are all related.

Problems to address

  1. Users don't like working with large + hard to maintain catalog files.
  2. Users want to be able to have another default than MemoryDataSet for datasets not defined in the catalog.
  3. Users need a way to keep constant/global values in one place to minimise mistakes e.g. typos and pass these in configuration files. Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader

Proposed solutions:

1. Dataset factories: addresses problem 1 & 2
2. Omegaconf "make-like" resolver: turns out that this isn't a feasible solution.

Problem 1: points discussed

  • General agreement that large catalog files is a valid problem and not only for QuantumBlack vertical teams.
  • Users are reluctant to make use of modular pipelines because it increases configuration.
  • General impression is that when users start working with Kedro they prefer readable (more verbose) catalog files, but the more they get used the Kedro the greater the need for features like templating and other tools that make the catalog more compact.
  • Concern with any solution to make the catalog more compact is readability. With existing solutions like Jinja2 and yaml-anchors it's also hard to understand what the "compiled" catalog looks like that's used by Kedro.
    • Solution: add a mechanism to compile the catalog. Could be a CLI command similar to dbt compile
  • Whatever solution we go for, we will always allow the full fledged dataset definitions in the catalog, so users do not have to use the more compact syntax.

Problem 2: points discussed

  • @WaylonWalker created a tool that allows to change the default: https://github.com/WaylonWalker/kedro-auto-catalog
    • This solution saves the datasets into a separate "default" catalog file, so it addresses problem 2, but not problem 1 because datasets still end up in the catalog.
    • AFAIK you can only add 1 default
  • The datasets factories solution would address this problem.

Problem 3

Not yet addressed.

Other comments

  • Omegaconf resolvers do allow for some reduction of the catalog content just like yaml anchors.
  • With custom resolvers power users can find ways to reduce their catalogs more by themselves.
  • Perhaps it's not worth introducing this extra syntax if the OmegaConfigLoader leverages all resolver syntax properly.

Next steps

Note We need to do user testing to verify any solution we decide on.

@datajoely
Copy link
Contributor

I know we don't want to - but I also think we need to ask, does our solution support dynamic pipelines? Our users, particularly those who care about this templating stuff are going to try that sort of thing very quickly.

@noklam
Copy link
Contributor

noklam commented Mar 30, 2023

Reduce the number Catalog Entry (How many top-level keys in the file) Shorten Catalog Entry (How many lines for each Dataset) Ability to Change Default DataSet Constant/Global Value for all Configurations
DataSet Factory
OmegaConf Interpolation
Environment Variable
OmegaConf Custom Resolver

(Feel free to edit, the ? are things that I am not sure about)
Updated at 2023-04-11

@datajoely
Copy link
Contributor

For a slightly 🌶️ comparison

Reduce the number Catalog Entry (How many top-level keys in the file) Shorten Catalog Entry (How many lines for each Dataset) Ability to Change Default DataSet Constant/Global Value for all Configurations
Jinja2 (with .j2 template support)

@antonymilne
Copy link
Contributor

@noklam I filled out two of the questions marks but don't understand what the last column means?

Also I think the dataset factory should have ✅ for "Reduce the number Catalog Entry", no? Since you would have a general pattern matching syntax rather than needing to define each entry explicitly.

@merelcht
Copy link
Member Author

merelcht commented May 17, 2023

Feedback on the prototypes (#2560 + #2559) to be taken into account for the proper implementation:

  • Do not call catalog.list() in the runner checks to determine if there's any missing inputs/datasets that need the default dataset, but instead create a new method to check if a dataset name exists in the catalog that finds any exact matches as well as pattern matches.
  • Clarify to users that the default dataset creation will be replaced by any pattern with the following format:
{dataset}:
   .....:
   .....:

@merelcht
Copy link
Member Author

This is the clean version of the dataset factories prototype excluding the parsing rules: https://github.com/kedro-org/kedro/tree/feature/dataset-factories-for-catalog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Configuration Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Type: Parent Issue
Projects
Archived in project
7 participants