-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically create catalog entries #138
Comments
@roumail Thank you for your feedback. Kedro does not support this feature out of box at the moment. You can programmatically modify the DataCatalog and add new datasets to it after instantiating one from YAML, however you would also need to dynamically change a) node definition that produces those datasets, and b) all nodes that consume them. Which leads to somewhat clunky solution. If this is critical for your use case, we would need more information from you as to what is your particular need for that and why it is not feasible to achieve the same using a single dataset or a constant set of them. |
Hi @DmitriiDeriabinQB, Thank you for your response. For my particular use case, I was able to fix it in the end by sticking to the YAML API. I was previously trying to mix the two which led to a clunky implementation and confusion. I guess I would close with a request for more information around the Code vs YAML API. For example, 1) can we mix the Yaml and code API and 2) whether doing so is even a good idea from a design perspective. Based on your last comment it seems we should stick with either one of the api's and not mix the two. |
This is a good question to ask :) Generally, I would say, the recommendation is to try to avoid mixing Code vs YAML API together. Most of the time it just boils down to not doing in the code something that YAML already supports (e.g., templating the datasets, passing parameters to nodes, etc.). However, there are definitely some genuine use cases that are not fully covered by YAML API, so it makes sense to apply some logic in the code if it becomes apparent that there is no other way to get what you want. But in that situation I would personally still refrain from porting all dataset definitions into the code, since 95% of them is just regular configuration. If that makes sense. |
Closing this as answered, but feel free to re-open/open a new issue if you need further clarification. |
Description
I have two python dictionaries that I'm saving locally using kedro.io.PickleLocalDataSet.
These dictionaries are created using the snippet below:
Therefore, my catalog.yml looks like this :
I wish there were an easy way to "explode" the two dictionaries, d1 and d2, into CSVLocalDatasets:
I first tried to do this by creating a node that would read a dictionary and output a list of dataframes using a small function like the following:
However I get stuck when trying to specify the config.yml for the above node because I don't know how many dataframes will be generated up front. As seen in the example, d1 has two dataframes as values and d2 has only one dataframe.
It should be clear already but I've been using the YAML file based way of declaring pipelines/nodes/config etc. I understand that using code API is equivalent but I'm not as familiar using that approach to declare pipelines etc.
Context
There are many times when we have a node that creates a list of outputs. We can't always pre-specify how many outputs will be generated. Therefore, some documentation around such use cases would be really helpful.
I'm not exactly sure how to proceed here.
Possible Alternatives
When I encountered a similar situation in the past, I resolved the dilemma by implementing a separate node for each case of the twenty cases I had. Without a looping construct, ofcourse there was a lot of code duplication but at least it worked. I cannot use a similar approach in this case since I don't know the total number of cases up front.
The text was updated successfully, but these errors were encountered: