-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate the need of having all datasets defined in the catalog #2423
Comments
Would you only allow a singular |
I think that's up for discussion. I'm going to go through config files and see what patterns I see and how this proposal would help. |
Non-memory datasets need more information, like where you're going to store datasets and with what name. For example, Kubeflow Pipelines has the idea of |
@datajoely Having the option to have more than one default datasets is also on the table, e.g. one can imagine a pattern matching solution where if your dataset name starts with Currently we have kedro/kedro/config/omegaconf_config.py Lines 102 to 107 in 8acfb2a
And allow a similar solution for our "default" datasets: # % settings.py
def create_spark_dataset(dataset_name: str, *chunks):
# e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
return SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")
def create_memory_dataset(dataset_name: str, *chunks):
return MemoryDataSet()
DATA_CATALOG_ARGS = {
"datasets_factories": {
"root_namespace.*@spark": create_spark_dataset,
"root_namespace.*@memory": create_memory_dataset,
}
} So when a dataset name matching some of these patterns is provided, the dataset is created accordingly. We might not necessarily have all of that in code, but it could reside somewhere in "{root_namespace}.{*}@{spark}":
type: spark.SparkDataSet
filepath: data/{chunks[0]}/{chunks[1]}.parquet
file_format: parquet We also need to decide on a pattern-matching language, simpler than regex, but powerful enough to be able to chunk up the dataset names and those chunks to be reused in the body of the configuration for the reasons pointed out by @deepyaman. The using of the chunks in the body could be done through either f-string replacement as in the above The |
"{root_namespace}.{*}@{spark}":
type: spark.SparkDataSet
filepath: data/{chunks[0]}/{chunks[1]}.parquet
file_format: parquet @idanov - this is a work of art, I'm sold. What I would ask is that we consider introducing an equivalent of |
@AntonyMilneQB shared a nice alternative pattern-matching syntax here, which uses reverse Python f-strings format, which is definitely going to do the job and help us avoid creating a homebrew patter-matching language: https://github.com/r1chardj0n3s/parse @datajoely we could do something like that, inspired from what @WaylonWalker created sometime ago: https://github.com/WaylonWalker/kedro-auto-catalog Another option is a CLI command giving you the dataset definition per dataset name, something like |
Just reading through this, so there are two things discussed here:
What would the catalog looks like for (2) |
Before we get carried away with using
All good so far, but what happens in this case?
...which is probably not what you wanted. The reason for this is that " We also need to be wary of how to handle multiple patterns that match. e.g. let's say you want to override the default. Then you should be able to do something like this:
Here
Now All this is not to say that this solution is unworkable. I do think |
@AntonyMilneQB - thank you for this, it's really an excellent write-up. One push I would make is that we could manage this complexity by providing a live editor / prototyping space in Viz. I think it would massively improve the learning journey and help users get up and running quickly. |
Something like this https://textfsm.nornir.tech/# |
Thanks for diving a bit deeper into I have a question about the final bit though. You're saying that if the catalog contains:
|
@merelcht The
One way around this is to add a format specification like this:
But, by default, if you don't specify a format then it will just match anything. This still feels better to me than regex but it is definitely not as simple as it might first appear. @idanov also asked me to post the |
Alternative proposal: omegaconf custom resolversThis is also originally @idanov's idea. The above example would look something like this:
Somewhere (presumably user defined, but maybe we could build some common ones in?) the resolvers would be defined as something like:
This is all pretty vague because I'm not immediately sure whether these things are possible using omegaconf resolvers...
Pros:
Cons:
Pro or con depending on your perspective: more powerful and flexible than general pattern matching syntax because it allows for arbitrary Python functions, not just string parsing. |
@AntonyMilneQB I don't think it's possible to pass a value that needs to be interpolated to a resolver. In this prototype example below:
|
Summary of the technical design discussions we had on 22/3 and 30/3: Through the discussions we've identified there are in fact three problems that are all related. Problems to address
Proposed solutions:1. Dataset factories: addresses problem 1 & 2 Problem 1: points discussed
Problem 2: points discussed
Problem 3Not yet addressed. Other comments
Next steps
|
I know we don't want to - but I also think we need to ask, does our solution support dynamic pipelines? Our users, particularly those who care about this templating stuff are going to try that sort of thing very quickly. |
(Feel free to edit, the ? are things that I am not sure about) |
For a slightly 🌶️ comparison
|
@noklam I filled out two of the questions marks but don't understand what the last column means? Also I think the dataset factory should have ✅ for "Reduce the number Catalog Entry", no? Since you would have a general pattern matching syntax rather than needing to define each entry explicitly. |
Feedback on the prototypes (#2560 + #2559) to be taken into account for the proper implementation:
|
This is the clean version of the dataset factories prototype excluding the parsing rules: https://github.com/kedro-org/kedro/tree/feature/dataset-factories-for-catalog |
Description
Users struggle with ever increasing catalog files:
#891
#1847
Current "solutions":
TemplatedConfigLoader
TemplatedConfigLoader
with Jinja2: https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#jinja2-supportInstead of allowing anchors or loops in config let's just tackle the need of having so many datasets defined in the catalog in the first place.
Context
The new
OmegaConfigLoader
doesn't currently allow the use of yaml anchors (the omegaconf team discourages people to use this with omegaconf: omry/omegaconf#93) or Jinja2. While Jinja2 is very powerful, we've always had the standpoint as a team that it's better to not use it, because it makes files hard to read and reason about. Jinja2 is meant for HTML templating, not for configuration.In order to make
OmegaConfigLoader
a suitable replacement for theConfigLoader
andTemplatedConfigLoader
we need to find another solution to reduce the size of catalog files.Possible Implementation
This is an idea from @idanov (written up by me, so I hope I have interpreted it correctly 😅 ) :
In every runner we have a method to create a default dataset for datasets that are used in the pipeline but not defined in the catalog:
kedro/kedro/runner/sequential_runner.py
Lines 32 to 43 in 8acfb2a
We can allow users to define what a "default" dataset should be so that in the catalog they only need to specify these default settings and Kedro would then create those without needing them to be defined in the catalog. The default dataset definition could then look like something like this:
And any dataset used in the pipeline, but not mentioned in the catalog, would get the above specifications.
Extra info
I collected several catalog files from internal teams to see what patterns I could find in there. I think these will be useful when designing our solution. https://miro.com/app/board/uXjVMZM7IFY=/?share_link_id=288914952503
The text was updated successfully, but these errors were encountered: