-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
override locality
and localadmin
values from admin lookup
#93
Conversation
Requires merge of pelias/model#38 |
Here's an example of the problem and how this will benefit us. There's a record in geonames named
The pseudo-record for this is:
The This PR performs the admin lookup as expected but adds a stream afterward that discards the
the correction stream throws away the
Now when the libpostal updates queries multimatch for |
document.addParent('locality', document.getName('default'), document.getId()); | ||
} | ||
else if (document.getLayer() === 'localadmin') { | ||
document.clearParent('localadmin'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be cleaner to make this a simple function that takes in the layer name and then have a single if statement that checks if layer is locality
or localadmin
and call it with the layer value if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can fix that
We should be sure to document this well in our |
...also how come functional tests are failing? 😢 |
@dianashk I saw a comment come thru about creating an infinite loop because a record is a parent of itself but I can't seem to find it anymore. This is existing behavior; records are already considered parents of themselves. Here's production: |
It's failing tests because it requires changes to pelias-model to be available. |
LGTM |
1dd2546
to
5912340
Compare
LGTM |
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited. That said, there are no regressions in our acceptance tests, and quite a few improvements.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited. That said, there are no regressions in our acceptance tests, and quite a few improvements.
Fixed #92
This PR uses the geonames record
name
value instead of the wof-admin-lookup values when layer is eitherlocality
orlocaladmin
.