Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

override locality and localadmin values from admin lookup #93

Merged
merged 3 commits into from
Jul 21, 2016

Conversation

trescube
Copy link
Contributor

Fixed #92

This PR uses the geonames record name value instead of the wof-admin-lookup values when layer is either locality or localadmin.

@trescube
Copy link
Contributor Author

Requires merge of pelias/model#38

@trescube
Copy link
Contributor Author

Here's an example of the problem and how this will benefit us. There's a record in geonames named Gwendale with feature code PPL that we treat as a locality and lat/lon 39.9876,-76.74469. The geonames importer performs WOF admin lookup for the fields:

['country', 'macroregion', 'region', 'macrocounty', 'county', 'localadmin', 'locality']

The pseudo-record for this is:

layer: locality,
name: Gwendale, 
gid: geonames:locality:4050038,
locality: York,
locality_gid: whosonfirst:locality:101717279,
county: York County,
county_gid: whosonfirst:county:102080953,
region: Pennsylvania,
region_gid: whosonfirst:region:85688481,
country: United States,
country_gid: whosonfirst:country:85633793

The name and locality fields are in direct contradiction. The libpostal updates to pelias queries multimatch on the coarse fields not the name field, so when searching for York, PA, the record for Gwendale will be returned because it's locality field matches on York and region on PA. This is bad behavior because it's not what the user asked for. It also creates strange labels; in this case Gwendale, York, PA, USA.

This PR performs the admin lookup as expected but adds a stream afterward that discards the locality or localadmin parent if the geonames layer was locality or localadmin, respectively. The effect is that whereas admin-lookup will create (same as above):

layer: locality,
name: Gwendale, 
gid: geonames:locality:4050038,
locality: York,
locality_gid: whosonfirst:locality:101717279,
county: York County,
county_gid: whosonfirst:county:102080953,
region: Pennsylvania,
region_gid: whosonfirst:region:85688481,
country: United States,
country_gid: whosonfirst:country:85633793

the correction stream throws away the locality parent (including _id and _a) and adds a new parent using the geonames record name and id, resulting in:

layer: locality,
name: Gwendale, 
gid: geonames:locality:4050038,
locality: Gwendale,
locality_gid: geonames:locality:4050038,
county: York County,
county_gid: whosonfirst:county:102080953,
region: Pennsylvania,
region_gid: whosonfirst:region:85688481,
country: United States,
country_gid: whosonfirst:country:85633793

Now when the libpostal updates queries multimatch for York on locality, the value won't match and Gwendale won't be returned. This also fixes the label generation issue since name and locality will be correctly deduped to Gwendale, PA, USA.

@trescube
Copy link
Contributor Author

trescube commented Jul 15, 2016

6 of the following are geonames records that demonstrate the problem:

screen shot 2016-07-15 at 3 27 18 pm

document.addParent('locality', document.getName('default'), document.getId());
}
else if (document.getLayer() === 'localadmin') {
document.clearParent('localadmin');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be cleaner to make this a simple function that takes in the layer name and then have a single if statement that checks if layer is locality or localadmin and call it with the layer value if that's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can fix that

@dianashk
Copy link
Contributor

We should be sure to document this well in our pelias-doc repo. It's a weird condition and could be confusing. It also means that people can't expect to lookup the geometries of this item, so we should point that out.

@dianashk
Copy link
Contributor

...also how come functional tests are failing? 😢

@trescube
Copy link
Contributor Author

trescube commented Jul 18, 2016

@dianashk I saw a comment come thru about creating an infinite loop because a record is a parent of itself but I can't seem to find it anymore. This is existing behavior; records are already considered parents of themselves. Here's production:

screen shot 2016-07-18 at 9 57 41 am

@trescube
Copy link
Contributor Author

It's failing tests because it requires changes to pelias-model to be available.

@missinglink
Copy link
Member

LGTM

@trescube trescube force-pushed the override-locality-and-localadmin branch from 1dd2546 to 5912340 Compare July 21, 2016 20:39
@orangejulius
Copy link
Member

LGTM

@trescube trescube merged commit f0a0353 into master Jul 21, 2016
@orangejulius orangejulius deleted the override-locality-and-localadmin branch July 25, 2016 20:35
orangejulius added a commit to pelias/api that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit to pelias/api that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit to pelias/api that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
orangejulius added a commit to pelias/api that referenced this pull request Mar 10, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants