Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zentity not honoring nesting #115

Open
LydiaBartholomew opened this issue Jan 13, 2023 · 6 comments
Open

Zentity not honoring nesting #115

LydiaBartholomew opened this issue Jan 13, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@LydiaBartholomew
Copy link

Environment

  • zentity version: 1.8.1
  • Elasticsearch version: 7.10.2

Describe the bug

Zentity does not appear to honor a nested index structure. I have an index definition with person data fields (names, addresses, phones, etc.). The addresses are nested in the index definition (e.g. Addresses.FullStreet, Addresses.City). The matcher in the model definition specifies that they are nested and the indices section uses dot notation to map to the index definition (e.g. Addresses.FullStreet). When I search a street address, city, state, and zip, any one of the street addresses will match against anyone of the city, state, and zips rather than grouping a street address with a particular city, state, and zip. I also can't seem to get my firstname, lastname, state resolver to match but I'm assuming I have a typo somewhere.

Expected behavior

I expected that each full address (street, city, state, zip) would be evaluated separately from other full addresses.

Steps to reproduce

  1. PUT index definition with nested addresses: {"settings":{"index":{"number_of_shards":1,"number_of_replicas":0,"analysis":{"filter":{"street_suffix_map":{"pattern":"(st)","type":"pattern_replace","replacement":"street"},"phonetic":{"type":"phonetic","encoder":"nysiis"},"punct_white":{"pattern":"\p{Punct}","type":"pattern_replace","replacement":" "},"remove_non_digits":{"pattern":"[^\\d]","type":"pattern_replace","replacement":""},"remove_punct":{"pattern":"[^a-zA-Z0-9]","type":"pattern_replace","replacement":""}},"analyzer":{"name_clean":{"filter":["icu_normalizer","icu_folding","punct_white"],"tokenizer":"standard"},"name_phonetic":{"filter":["icu_normalizer","icu_folding","punct_white","phonetic"],"tokenizer":"standard"},"street_clean":{"filter":["icu_normalizer","icu_folding","punct_white","trim"],"tokenizer":"keyword"}}}}},"mappings":{"properties":{"Names":{"type":"nested","properties":{"FirstName":{"type":"text","fields":{"clean":{"type":"text","analyzer":"name_clean"},"phonetic":{"type":"text","analyzer":"name_phonetic"}}},"MiddleName":{"type":"text","fields":{"clean":{"type":"text","analyzer":"name_clean"},"phonetic":{"type":"text","analyzer":"name_phonetic"}}},"LastName":{"type":"text","fields":{"clean":{"type":"text","analyzer":"name_clean"},"phonetic":{"type":"text","analyzer":"name_phonetic"}}}}},"Addresses":{"type":"nested","properties":{"FullStreet":{"type":"text","fields":{"clean":{"type":"text","analyzer":"street_clean"}}},"City":{"type":"text","fields":{"clean":{"type":"text","analyzer":"name_clean"}}},"State":{"type":"text","fields":{"clean":{"type":"keyword"}}},"Zip":{"type":"text"},"Zip4":{"type":"text"}}}}}}
  2. PUT model definition indicating addresses are nesting: {"attributes":{"name.first_name":{"type":"string","score":0.6125},"name.middle_name":{"type":"string","score":0.6125},"name.last_name":{"type":"string","score":0.65},"address.full_street":{"type":"string","score":0.75},"address.city":{"type":"string","score":0.55},"address.state":{"type":"string","score":0.5125},"address.zip":{"type":"string","score":0.75}},"resolvers":{"full_name":{"attributes":["name.first_name","name.last_name"]},"first_last_city":{"attributes":["name.first_name","name.last_name","address.city"]},"first_last_state":{"attributes":["name.first_name","name.last_name","address.state"]},"first_last_zip":{"attributes":["name.first_name","name.last_name","address.zip"]},"street_city":{"attributes":["address.full_street","address.city"]},"street_state":{"attributes":["address.full_street","address.state"]},"street_zip":{"attributes":["address.full_street","address.zip"]},"last_street":{"attributes":["name.last_name","address.full_street"]}},"matchers":{"simple_nested_name":{"clause":{"nested":{"path":"Names","query":{"match":{"{{ field }}":"{{ value }}"}}}},"quality":0.975},"fuzzy_nested_name":{"clause":{"nested":{"path":"Names","query":{"match":{"{{ field }}":{"query":"{{ value }}","fuzziness":"1"}}}}},"quality":0.95},"simple_nested_addresses":{"clause":{"nested":{"path":"Addresses","query":{"match":{"{{ field }}":"{{ value }}"}}}},"quality":0.95},"exact_nested_address":{"clause":{"nested":{"path":"Addresses","query":{"term":{"{{ field }}":"{{ value }}"}}}},"quality":1.0},"fuzzy_nested_address":{"clause":{"nested":{"path":"Addresses","query":{"match":{"{{ field }}":{"query":"{{ value }}","fuzziness":"1"}}}}},"quality":0.95}},"indices":{"person_list":{"fields":{"Names.FirstName.clean":{"attribute":"name.first_name","matcher":"fuzzy_nested_name","quality":0.975},"Names.FirstName.phonetic":{"attribute":"name.first_name","matcher":"simple_nested_name","quality":0.925},"Names.MiddleName.clean":{"attribute":"name.middle_name","matcher":"fuzzy_nested_name","quality":0.975},"Names.MiddleName.phonetic":{"attribute":"name.middle_name","matcher":"simple_nested_name","quality":0.925},"Names.LastName.clean":{"attribute":"name.last_name","matcher":"fuzzy_nested_name","quality":0.975},"Names.LastName.phonetic":{"attribute":"name.last_name","matcher":"simple_nested_name","quality":0.925},"Addresses.FullStreet.clean":{"attribute":"address.full_street","matcher":"simple_nested_addresses","quality":0.975},"Addresses.City.clean":{"attribute":"address.city","matcher":"simple_nested_addresses","quality":0.975},"Addresses.State.keyword":{"attribute":"address.state","matcher":"exact_nested_address"},"Addresses.Zip":{"attribute":"address.zip","matcher":"simple_nested_addresses","quality":0.975}}}}}
  3. Index data: {"index": {"_id": "1", "_index": "person_list"}}
    {"Names":[{"FirstName":"Jane","MiddleName":"D","LastName":"Smith"},"Addresses":[{"FullStreet":"123 Main St","City":"Barnstable","State":"MA","Zip":"02632","Zip4":""},{"FullStreet":"567 North St","City":"Arlington","State":"VA","Zip":"22201"}]}
    {"index": {"_id": "2", "_index": "person_watch_list"}}
  4. Search: POST _zentity/resolution/person_model?_score=true&pretty&_explanation=true&max_hops=0
    {
    "attributes": {
    "name.last_name": ["Smith"],
    "name.first_name": ["Jane"],
    "address.full_street":["567 North St"],
    "address.city":["Barnstable"],
    "address.state":["MA"],
    "address.zip":["02632"]
    }
    }

Additional context

The search above matches with the street_city and street_zip resolver. I would hope this would not match and would only match if I changed address.full_street to "123 Main St". Thank you in advance for your help!

@LydiaBartholomew LydiaBartholomew added the bug Something isn't working label Jan 13, 2023
@LydiaBartholomew LydiaBartholomew changed the title [Bug] Zentity not honoring nesting Jan 17, 2023
@susan-shu-c
Copy link

susan-shu-c commented Jan 30, 2023

I'm running into the same issue.
Using the sandbox via https://zentity.io/sandbox/
[Edit: Seems to be something else causing the issue, I'm experimenting with both nested and unnested fields. Will report back]

zentity version: 1.6.1
Elasticsearch version: 7.10.1

@mjlachman
Copy link

Commenting to follow this issue.

@susan-shu-c
Copy link

Update:
It turns out my issue was with object fields and not nested objects, so I've since resolved it.

Along the way, I found some past solutions for nested fields - in the matchers clause it needs to be specified that what you're looking at is nested:

@davemoore-
Copy link
Member

I believe I've found the issue. I don't think it's a bug, because zentity is execuing the queries correctly. What's needed is an enhancement for supporting the "inner_hits" field of nested queries.

When I execute your resolution job with the "street_zip" resolver, the explanation shows the following:

"_explanation": {
    "resolvers": {
        "street_zip": {
            "attributes": [
                "address.full_street",
                "address.zip"
            ]
        }
    },
    "matches": [
        {
            "attribute": "address.full_street",
            "target_field": "Addresses.FullStreet.clean",
            "target_value": [
                "123 Main St",
                "567 North St"
            ],
            "input_value": "567 North St",
            "input_matcher": "simple_nested_address_street",
            "input_matcher_params": {}
        },
        {
            "attribute": "address.zip",
            "target_field": "Addresses.Zip",
            "target_value": [
                "02632",
                "22201"
            ],
            "input_value": "02632",
            "input_matcher": "simple_nested_address_zip",
            "input_matcher_params": {}
        }
    ]
}

We can see how the "address.full_street" attribute matched. The input value of "567 North St" matches two target values: "123 Main St" and "567 North St". We know this shouldn't be correct. Clearly only one of those values is what we want to match. But we can also inspect the query that zentity constructed from your matchers to see that it's running the query we told it to.

You can see what queries zentity is constructing by passing the queries parameter in the URL:

POST _zentity/resolution/person_model?_score=true&pretty&_explanation=true&max_hops=0&queries

Let's copy the query clause that was constructed in this resolution job for the address.full_street attribute, and we'll run the query directly on the person_list index:

GET person_list/_search
{
    "query": {
        "bool": {
            "_name": "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0",
            "filter": {
                "nested": {
                    "path": "Addresses",
                    "query": {
                        "match": {
                            "Addresses.FullStreet.clean": "567 North St"
                        }
                    }
                }
            }
        }
    }
}

Here's the response:

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.0,
        "hits": [
            {
                "_index": "person_list",
                "_id": "1",
                "_score": 0.0,
                "_source": {
                    "Names": [
                        {
                            "FirstName": "Jane",
                            "MiddleName": "D",
                            "LastName": "Smith"
                        }
                    ],
                    "Addresses": [
                        {
                            "FullStreet": "123 Main St",
                            "City": "Barnstable",
                            "State": "MA",
                            "Zip": "02632",
                            "Zip4": ""
                        },
                        {
                            "FullStreet": "567 North St",
                            "City": "Arlington",
                            "State": "VA",
                            "Zip": "22201"
                        }
                    ]
                },
                "matched_queries": [
                    "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0"
                ]
            }
        ]
    }
}

When zentity receives this hit, it sees a single document that has two values for address.full_street: "123 Main St" and "567 North St". It can't distinguish the fact that there are two nested objects in this hit.

However, zentity could determine which nested object matched if we added "inner_hits" to the query clause:

GET person_list/_search
{
    "query": {
        "bool": {
            "_name": "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0",
            "filter": {
                "nested": {
                    "path": "Addresses",
                    "query": {
                        "match": {
                            "Addresses.FullStreet.clean": "567 North St"
                        }
                    },
                    "inner_hits": {}
                }
            }
        }
    }
}

Here's the response:

{
    "took": 15,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.0,
        "hits": [
            {
                "_index": "person_list",
                "_id": "1",
                "_score": 0.0,
                "_source": {
                    "Names": [
                        {
                            "FirstName": "Jane",
                            "MiddleName": "D",
                            "LastName": "Smith"
                        }
                    ],
                    "Addresses": [
                        {
                            "FullStreet": "123 Main St",
                            "City": "Barnstable",
                            "State": "MA",
                            "Zip": "02632",
                            "Zip4": ""
                        },
                        {
                            "FullStreet": "567 North St",
                            "City": "Arlington",
                            "State": "VA",
                            "Zip": "22201"
                        }
                    ]
                },
                "matched_queries": [
                    "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0"
                ],
                "inner_hits": {
                    "Addresses": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 0.6931471,
                            "hits": [
                                {
                                    "_index": "person_list",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "Addresses",
                                        "offset": 1
                                    },
                                    "_score": 0.6931471,
                                    "_source": {
                                        "FullStreet": "567 North St",
                                        "City": "Arlington",
                                        "State": "VA",
                                        "Zip": "22201"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

Notice the response lists only the street address of "567 North St" in the "inner_hits" field. That's what we want.

Unfortunately, you can't safely add "inner_hits": {} to your matcher clause, because if a query contains multiple clauses with "inner_hits": {} then Elasticsearch will respond with:

[inner_hits] already contains an entry for key [Addresses]

You can see that by trying to run this query, which attempts to search with "inner_hits" for both attributes in your resolver: address.full_street and address.zip:

GET person_list/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "bool": {
                        "_name": "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0",
                        "filter": {
                            "nested": {
                                "path": "Addresses",
                                "query": {
                                    "match": {
                                        "Addresses.FullStreet.clean": "567 North St"
                                    }
                                },
                                "inner_hits": {}
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "_name": "address.zip:Addresses.Zip:simple_nested_addresses:MDI2MzI=:1",
                        "filter": {
                            "nested": {
                                "path": "Addresses",
                                "query": {
                                    "match": {
                                        "Addresses.Zip": "02632"
                                    }
                                },
                                "inner_hits": {}
                            }
                        }
                    }
                }
            ]
        }
    }
}

What would safely work is if each "inner_hits" clause is given a unique "name". See below, in which I give each "inner_hits" clause the same "name" as the the "_name" of its respective "bool" query:

GET person_list/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "bool": {
                        "_name": "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0",
                        "filter": {
                            "nested": {
                                "path": "Addresses",
                                "query": {
                                    "match": {
                                        "Addresses.FullStreet.clean": "567 North St"
                                    }
                                },
                                "inner_hits": {
                                    "name": "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0"
                                }
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "_name": "address.zip:Addresses.Zip:simple_nested_addresses:MDI2MzI=:1",
                        "filter": {
                            "nested": {
                                "path": "Addresses",
                                "query": {
                                    "match": {
                                        "Addresses.Zip": "02632"
                                    }
                                },
                                "inner_hits": {
                                    "name": "address.zip:Addresses.Zip:simple_nested_addresses:MDI2MzI=:1"
                                }
                            }
                        }
                    }
                }
            ]
        }
    }
}

Here's the response:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.0,
        "hits": [
            {
                "_index": "person_list",
                "_id": "1",
                "_score": 0.0,
                "_source": {
                    "Names": [
                        {
                            "FirstName": "Jane",
                            "MiddleName": "D",
                            "LastName": "Smith"
                        }
                    ],
                    "Addresses": [
                        {
                            "FullStreet": "123 Main St",
                            "City": "Barnstable",
                            "State": "MA",
                            "Zip": "02632",
                            "Zip4": ""
                        },
                        {
                            "FullStreet": "567 North St",
                            "City": "Arlington",
                            "State": "VA",
                            "Zip": "22201"
                        }
                    ]
                },
                "matched_queries": [
                    "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0",
                    "address.zip:Addresses.Zip:simple_nested_addresses:MDI2MzI=:1"
                ],
                "inner_hits": {
                    "address.full_street:Addresses.FullStreet.clean:simple_nested_addresses:NTY3IE5vcnRoIFN0:0": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 0.6931471,
                            "hits": [
                                {
                                    "_index": "person_list",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "Addresses",
                                        "offset": 1
                                    },
                                    "_score": 0.6931471,
                                    "_source": {
                                        "FullStreet": "567 North St",
                                        "City": "Arlington",
                                        "State": "VA",
                                        "Zip": "22201"
                                    }
                                }
                            ]
                        }
                    },
                    "address.zip:Addresses.Zip:simple_nested_addresses:MDI2MzI=:1": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 0.6931471,
                            "hits": [
                                {
                                    "_index": "person_list",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "Addresses",
                                        "offset": 0
                                    },
                                    "_score": 0.6931471,
                                    "_source": {
                                        "FullStreet": "123 Main St",
                                        "City": "Barnstable",
                                        "State": "MA",
                                        "Zip": "02632",
                                        "Zip4": ""
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

Notice that there are two sets of "inner_hits". Each one contains just one nested object that correctly matched our criteria. That's what we want. We also see that the two documents are different. That tells us that neither nested object should be considered a match, because neither nested object contains both the street address and the zip code that we were hoping to find in a single nested object.

Unfortunately, this isn't possible in zentity yet. The matcher clause would need to support a variable reference to the clause_name, which is generated at query time. An implementation would look something like this:

{
    ...,
    "simple_nested_addresses": {
        "clause": {
            "nested": {
                "path": "Addresses",
                "query": {
                    "match": {
                        "{{ field }}": "{{ value }}"
                    }
                },
                "inner_hits": {
                    "name": "{{ clause_name }}" // or simply "_name" as it's referred to in the "bool" query
                }
            }
        },
        "quality": 0.95
    },
    ...
}

Additionally, zentity would need to check the query results for any instances of "inner_hits" and then discard hits that don't appear in every instance of "inner_hits" of a given resolver before considering the hit to be a match.

I think this describes the problem well, and I think this can be targeted for inclusion in zentity 1.9.0.

@davemoore- davemoore- added enhancement New feature or request and removed bug Something isn't working labels May 6, 2024
@susan-shu-c
Copy link

Awesome, @davemoore- !!

@davemoore-
Copy link
Member

This feature is being worked on branch feature-inner-hits (branch history).

Given that nested objects often represent separate entities with multiple attributes (e.g. an address identified by a street, city, state, and zip), I may need to implement compound attributes (#28) before completing this feature. Both features will add quite a bit of complexity to the project. Starting with compound attributes makes sense because it could be applied either to "flat" indices or to indices with nested objects. Once zentity shows that it can handle the complexity of compound attributes with flat indices, I'll feel more comfortable introducing yet another layer of complexity nested objects.

@davemoore- davemoore- self-assigned this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants