[DRAFT] Dataset factory parsing rules demo #2559

ankatiyar · 2023-05-05T09:19:07Z

Description

Related to #2508
This PR is just to demonstrate/experiment with parsing rules - not the actual dataset factory prototype.

Development notes

I've added a kedro catalog resolve (Similar to what @merelcht did in #2560)
For each named dataset used by the pipelines ->
It loops over all the catalog entries, finds patterns that would be a match and then selects the best match pattern (pick_best_match function)

Problem Statement

The matches list contains all the patterns that match the dataset name - for example for france.companies, all the patterns listed below would be a match -

# Catalog Entry 1
"{namespace}.{dataset}":
  type: pandas.CSVDataSet
  filepath: path1/{namespace}_{dataset}.csv

# Catalog Entry 2
"{dataset}":
  type: pandas.CSVDataSet
  filepath: path2/{dataset}.csv

# Catalog Entry 3
"{namespace}.{a}":
  type: pandas.CSVDataSet
  filepath: path3/{namespace}_{a}.csv

# Catalog Entry 4
"{namespace}.companies":
  type: pandas.CSVDataSet
  filepath: path4/{namespace}_companies.csv

# Catalog Entry 5
france.companies:
  type: pandas.CSVDataSet
  filepath: path5/france_companies.csv

# Catalog Entry 6
"{dataset}s":
  type: pandas.CSVDataSet
  filepath: path6/{dataset}s.csv

We need to decide the ranked order of these entries to find the best match.

Proposed Solution

Sort according to "specificity" (for the lack of a better word?) - The number of characters outside of the brackets that match. The order then would be ->
#5 -> #4 -> #6/ #3/ #1 -> #2
When "specificity" is the same -
- sort them according to the number of brackets? Between #6, #3 & #1 - pick the highest number of bracket pairs. Example: {namespace}.{dataset} would be chosen over {dataset}s.
- either after sorting them according to bracket pairs or instead we can also sort them alphabetically. Example: f{*} should rank above {*}s

Notes / Challenges

When the dataset entry is too generic, for example :

{dataset}:
  type: pandas.CSVDataSet
  filepath: path/path

The datasets that are supposed to be default/MemoryDataSet and parameters (eg. params:model_input) will falsely match against the pattern. This will lead to runtime errors and we will need to document this.

Catalog entry 1 & 3 are basically the same thing and we could check for duplicating patterns and disallow it.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Ankita Katiyar <[email protected]>

merelcht · 2023-05-10T16:07:30Z

kedro/framework/cli/catalog.py

+    return matches[0]
+
+
+def specificity(pattern):


Is this method just counting the number of characters that are not in brackets? Wouldn't it be easier to just remove the text in {} and then get the length of the string?

That is what is happening here but using parse instead of a regex.

merelcht · 2023-05-10T16:36:57Z

kedro/framework/cli/catalog.py

+        for key, value in best_match_config.items():
+            string_value = str(value)
+            try:
+                formatted_string = string_value.format_map(result.named)


This should only happen for dataset fields with {} otherwise you get weird things like:

shuttles: filepath: data/01_raw/shuttles.xlsx load_args: data/01_raw/shuttles.xlsx type: pandas.ExcelDataSet

But that's not really part of the matching logic, so not super important here.

This can be changed to only match with patterns containing {} but this logic will still work for dataset entries that are not patterns(exact matches) because of how parse works -
parse("xyz", "abc") returns None (not a match at all - ignore)
parse("xyz", "xyz") returns <Result () {}> (exact match, not a pattern - result is not empty so still added to matches list)
parse("hello {name}", "hello world") returns <Result () {'name': 'world'}> (potential match- added to matches list)

So the exact match then gets chosen as the best match and since it wouldn't contain any brackets anywhere in the catalog entry, it would not be changed at all.

antonymilne · 2023-05-11T22:03:43Z

I think these rules look generally good 👍 I also think we need all three of them, and the order you suggest them is right, i.e. count non-bracketed characters, then count number of brackets, then sort alphabetically.

We shouldn't need to consider the specificity of your example #5 (france.companies) at all because it doesn't have {} in it. This isn't a pattern and so if you catalog.load a dataset with this name you're not interested in looking at patterns that match.

Also I don't think it's a problem that #1 and #3 are the same or that we need to go through and remove duplicates. These mean exactly the same thing so we don't really care how they rank relative to each other. In practice they would be distinguished based on the final alphabetical order, and then it will always be the same one that gets matched against first and the other one will just get ignored.

noklam · 2023-05-11T22:19:22Z

I also like the fact that the rule is simple. I suspect there will be edge cases but I am not too nervous about it. In general we should encourage people to write simple catalog, if it requires very obscure rules it's likely that they can just simplify it by changing the structure.

One missing example that I can think of is whether we should consider namespace as a priority.

Consider the name France.companies

Which pattern will win?

France.{something}
{namespace}.companies

Previous conversation favor the former one as it's more specific about namespace. This however require some more sophisticated rule. It shouldn't be hard tho, we can simply split on the dot and derive the score from there.

antonymilne · 2023-05-11T22:39:01Z

Hmm, that's an interesting example about the namespaces. I haven't thought it through fully, but I can't immediately think of a rule that would would order these the way you want without it being quite complex though? To me it's also not immediately obvious that france.{something} should necessarily beat {namespace}.companies in the first place. So I think it's probably fine to not handle namespaces using any special logic and just treat . like any other character. I suspect it will just get too complex otherwise for not that much benefit. But very happy to be proved wrong here - I hadn't thought of the example you give here before, and it's a good one.

antonymilne · 2023-05-12T08:30:04Z

Something else I just thought of. Let's say you have these patterns:

1. italy.{dataset}
2. france.{dataset}
3. switzerland.{dataset}
4. {namespace}.companies

Then the order of these would be 3 > 4 > 2 > 1 which is not really obvious, just because "switzerland" is a longer word than "companies". So a dataset name switzerland.companies would match switzerland.{dataset} whereas italy.companies and france.companies would match {namespace}.companies.

Maybe it would be less arbitrary to count number of unbracketed "chunks" outside brackets rather than number of characters? This would rank all the equally so would then need another rule to order them. Or maybe we should consider the order of where unbracketed chunks appear relative to bracketed ones in the pattern? Or maybe this is just an edge case that won't come up in practice and we shouldn't care about it?

Note this is nothing specific to do with namespacing: if you replace . with _ in all the above examples then the same arguments hold. So I still think we should probably not treat . specially in any way.

noklam · 2023-05-12T08:50:47Z

I agree we don't need to treat the namespace specially, however for a good convention it's most likely that people should split it using some kind of delimeter for these pattern.

It's also important to note that the patterns aren't mean for refactoring, there are still value to create explicit entry, it's probably not worth it to use a pattern to just replace two or three similar entries. (same apply for Jinja)

Can we try to do this exercise with a realistic example? I think @merelcht example could be useful for documentation anyway, so it won't be a waste. It would be interesting to have at least two people to go through the same exercise and see what's the difference.

I see the main use cases are two:

Shorten the catalog with long repetitive entries (maybe 5 or 10, it varies because you will most likely use it with variable interpolation )
Enumerating is not possible, where you have to loop through a dynamic list.

I should add that I think we can start with just counting characters and see how well it goes.

noklam · 2023-05-12T10:17:20Z

Just checking with my understanding, for 2. with jinja you will loop through the list inside the YAML file. In contrast, with Dataset Factory you will loop through this directly inside the nodes file instead?

ankatiyar · 2023-05-12T16:39:12Z

My assumption is that any user ideally would not have a lot of "dataset patterns" that would match a given dataset name. As long as we have a set of simple + deterministic rules to order the patterns, we could leave it up to the user to deal with cases where multiple matches exist.

ankatiyar · 2023-06-14T10:08:26Z

Closing in favour of #2635

Add catalog resolve fn

000dd62

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar requested a review from merelcht May 5, 2023 09:25

Add specificity fn

5c04432

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar requested review from antonymilne and noklam May 5, 2023 16:37

remove print statement

a9f633c

Signed-off-by: Ankita Katiyar <[email protected]>

merelcht reviewed May 10, 2023

View reviewed changes

merelcht mentioned this pull request May 17, 2023

Eliminate the need of having all datasets defined in the catalog #2423

Closed

ankatiyar mentioned this pull request Jun 12, 2023

Dataset factories #2635

Merged

9 tasks

ankatiyar closed this Jun 14, 2023

ankatiyar deleted the dataset-factories-parsing-rules branch June 29, 2023 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Dataset factory parsing rules demo #2559

[DRAFT] Dataset factory parsing rules demo #2559

ankatiyar commented May 5, 2023 •

edited

Loading

merelcht May 10, 2023 •

edited

Loading

ankatiyar May 12, 2023

merelcht May 10, 2023

ankatiyar May 12, 2023 •

edited

Loading

antonymilne commented May 11, 2023

noklam commented May 11, 2023

antonymilne commented May 11, 2023 •

edited

Loading

antonymilne commented May 12, 2023 •

edited

Loading

noklam commented May 12, 2023 •

edited

Loading

noklam commented May 12, 2023

ankatiyar commented May 12, 2023

ankatiyar commented Jun 14, 2023

[DRAFT] Dataset factory parsing rules demo #2559

[DRAFT] Dataset factory parsing rules demo #2559

Conversation

ankatiyar commented May 5, 2023 • edited Loading

Description

Development notes

Problem Statement

Proposed Solution

Notes / Challenges

Checklist

merelcht May 10, 2023 • edited Loading

Choose a reason for hiding this comment

ankatiyar May 12, 2023

Choose a reason for hiding this comment

merelcht May 10, 2023

Choose a reason for hiding this comment

ankatiyar May 12, 2023 • edited Loading

Choose a reason for hiding this comment

antonymilne commented May 11, 2023

noklam commented May 11, 2023

antonymilne commented May 11, 2023 • edited Loading

antonymilne commented May 12, 2023 • edited Loading

noklam commented May 12, 2023 • edited Loading

noklam commented May 12, 2023

ankatiyar commented May 12, 2023

ankatiyar commented Jun 14, 2023

ankatiyar commented May 5, 2023 •

edited

Loading

merelcht May 10, 2023 •

edited

Loading

ankatiyar May 12, 2023 •

edited

Loading

antonymilne commented May 11, 2023 •

edited

Loading

antonymilne commented May 12, 2023 •

edited

Loading

noklam commented May 12, 2023 •

edited

Loading