Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Dataset factory parsing rules demo #2559

Closed
wants to merge 3 commits into from

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented May 5, 2023

Description

Related to #2508
This PR is just to demonstrate/experiment with parsing rules - not the actual dataset factory prototype.

Development notes

I've added a kedro catalog resolve (Similar to what @merelcht did in #2560)
For each named dataset used by the pipelines ->
It loops over all the catalog entries, finds patterns that would be a match and then selects the best match pattern (pick_best_match function)

Problem Statement

The matches list contains all the patterns that match the dataset name - for example for france.companies, all the patterns listed below would be a match -

# Catalog Entry 1
"{namespace}.{dataset}":
  type: pandas.CSVDataSet
  filepath: path1/{namespace}_{dataset}.csv

# Catalog Entry 2
"{dataset}":
  type: pandas.CSVDataSet
  filepath: path2/{dataset}.csv

# Catalog Entry 3
"{namespace}.{a}":
  type: pandas.CSVDataSet
  filepath: path3/{namespace}_{a}.csv

# Catalog Entry 4
"{namespace}.companies":
  type: pandas.CSVDataSet
  filepath: path4/{namespace}_companies.csv

# Catalog Entry 5
france.companies:
  type: pandas.CSVDataSet
  filepath: path5/france_companies.csv

# Catalog Entry 6
"{dataset}s":
  type: pandas.CSVDataSet
  filepath: path6/{dataset}s.csv

We need to decide the ranked order of these entries to find the best match.

Proposed Solution

  • Sort according to "specificity" (for the lack of a better word?) - The number of characters outside of the brackets that match. The order then would be ->
    #5 -> #4 -> #6/ #3/ #1 -> #2
  • When "specificity" is the same -
    • sort them according to the number of brackets? Between #6, #3 & #1 - pick the highest number of bracket pairs. Example: {namespace}.{dataset} would be chosen over {dataset}s.
    • either after sorting them according to bracket pairs or instead we can also sort them alphabetically. Example: f{*} should rank above {*}s

Notes / Challenges

  • When the dataset entry is too generic, for example :
{dataset}:
  type: pandas.CSVDataSet
  filepath: path/path

The datasets that are supposed to be default/MemoryDataSet and parameters (eg. params:model_input) will falsely match against the pattern. This will lead to runtime errors and we will need to document this.

  • Catalog entry 1 & 3 are basically the same thing and we could check for duplicating patterns and disallow it.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar requested a review from merelcht May 5, 2023 09:25
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar requested review from antonymilne and noklam May 5, 2023 16:37
Signed-off-by: Ankita Katiyar <[email protected]>
return matches[0]


def specificity(pattern):
Copy link
Member

@merelcht merelcht May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method just counting the number of characters that are not in brackets? Wouldn't it be easier to just remove the text in {} and then get the length of the string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what is happening here but using parse instead of a regex.

for key, value in best_match_config.items():
string_value = str(value)
try:
formatted_string = string_value.format_map(result.named)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only happen for dataset fields with {} otherwise you get weird things like:

shuttles:
  filepath: data/01_raw/shuttles.xlsx
  load_args: data/01_raw/shuttles.xlsx
  type: pandas.ExcelDataSet

But that's not really part of the matching logic, so not super important here.

Copy link
Contributor Author

@ankatiyar ankatiyar May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to only match with patterns containing {} but this logic will still work for dataset entries that are not patterns(exact matches) because of how parse works -
parse("xyz", "abc") returns None (not a match at all - ignore)
parse("xyz", "xyz") returns <Result () {}> (exact match, not a pattern - result is not empty so still added to matches list)
parse("hello {name}", "hello world") returns <Result () {'name': 'world'}> (potential match- added to matches list)

So the exact match then gets chosen as the best match and since it wouldn't contain any brackets anywhere in the catalog entry, it would not be changed at all.

@antonymilne
Copy link
Contributor

I think these rules look generally good 👍 I also think we need all three of them, and the order you suggest them is right, i.e. count non-bracketed characters, then count number of brackets, then sort alphabetically.

We shouldn't need to consider the specificity of your example #5 (france.companies) at all because it doesn't have {} in it. This isn't a pattern and so if you catalog.load a dataset with this name you're not interested in looking at patterns that match.

Also I don't think it's a problem that #1 and #3 are the same or that we need to go through and remove duplicates. These mean exactly the same thing so we don't really care how they rank relative to each other. In practice they would be distinguished based on the final alphabetical order, and then it will always be the same one that gets matched against first and the other one will just get ignored.

@noklam
Copy link
Contributor

noklam commented May 11, 2023

I also like the fact that the rule is simple. I suspect there will be edge cases but I am not too nervous about it. In general we should encourage people to write simple catalog, if it requires very obscure rules it's likely that they can just simplify it by changing the structure.

One missing example that I can think of is whether we should consider namespace as a priority.

Consider the name France.companies

Which pattern will win?

  • France.{something}
  • {namespace}.companies

Previous conversation favor the former one as it's more specific about namespace. This however require some more sophisticated rule. It shouldn't be hard tho, we can simply split on the dot and derive the score from there.

@antonymilne
Copy link
Contributor

antonymilne commented May 11, 2023

Hmm, that's an interesting example about the namespaces. I haven't thought it through fully, but I can't immediately think of a rule that would would order these the way you want without it being quite complex though? To me it's also not immediately obvious that france.{something} should necessarily beat {namespace}.companies in the first place. So I think it's probably fine to not handle namespaces using any special logic and just treat . like any other character. I suspect it will just get too complex otherwise for not that much benefit. But very happy to be proved wrong here - I hadn't thought of the example you give here before, and it's a good one.

@antonymilne
Copy link
Contributor

antonymilne commented May 12, 2023

Something else I just thought of. Let's say you have these patterns:

1. italy.{dataset}
2. france.{dataset}
3. switzerland.{dataset}
4. {namespace}.companies

Then the order of these would be 3 > 4 > 2 > 1 which is not really obvious, just because "switzerland" is a longer word than "companies". So a dataset name switzerland.companies would match switzerland.{dataset} whereas italy.companies and france.companies would match {namespace}.companies.

Maybe it would be less arbitrary to count number of unbracketed "chunks" outside brackets rather than number of characters? This would rank all the equally so would then need another rule to order them. Or maybe we should consider the order of where unbracketed chunks appear relative to bracketed ones in the pattern? Or maybe this is just an edge case that won't come up in practice and we shouldn't care about it?

Note this is nothing specific to do with namespacing: if you replace . with _ in all the above examples then the same arguments hold. So I still think we should probably not treat . specially in any way.

@noklam
Copy link
Contributor

noklam commented May 12, 2023

I agree we don't need to treat the namespace specially, however for a good convention it's most likely that people should split it using some kind of delimeter for these pattern.

It's also important to note that the patterns aren't mean for refactoring, there are still value to create explicit entry, it's probably not worth it to use a pattern to just replace two or three similar entries. (same apply for Jinja)

Can we try to do this exercise with a realistic example? I think @merelcht example could be useful for documentation anyway, so it won't be a waste. It would be interesting to have at least two people to go through the same exercise and see what's the difference.

I see the main use cases are two:

  1. Shorten the catalog with long repetitive entries (maybe 5 or 10, it varies because you will most likely use it with variable interpolation )
  2. Enumerating is not possible, where you have to loop through a dynamic list.

I should add that I think we can start with just counting characters and see how well it goes.

@noklam
Copy link
Contributor

noklam commented May 12, 2023

Just checking with my understanding, for 2. with jinja you will loop through the list inside the YAML file. In contrast, with Dataset Factory you will loop through this directly inside the nodes file instead?

@ankatiyar
Copy link
Contributor Author

My assumption is that any user ideally would not have a lot of "dataset patterns" that would match a given dataset name. As long as we have a set of simple + deterministic rules to order the patterns, we could leave it up to the user to deal with cases where multiple matches exist.

@ankatiyar
Copy link
Contributor Author

Closing in favour of #2635

@ankatiyar ankatiyar closed this Jun 14, 2023
@ankatiyar ankatiyar deleted the dataset-factories-parsing-rules branch June 29, 2023 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants