Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format semicolon-delimited value lists in labels #666

Merged
merged 4 commits into from
Jan 6, 2023
Merged

Conversation

1ec5
Copy link
Member

@1ec5 1ec5 commented Jan 5, 2023

Whenever a feature’s name contains multiple values separated by semicolons, the label now puts each value on a separate line or separates them with an elegant bullet, for up to nine values, honoring the ;; escape sequence.

This implementation includes a general-purpose polyfill for a string replacement expression operator. We can use it for any purpose in an expression or filter but need to be careful about avoiding a combinatoric explosion in the size of an expression’s code. I’ve included a comprehensive battery of unit tests.

Kaser Kaser legend
Kaser and New Square, New York

Machahe Machahe legend
Machahe/Caiyuan/Liucandong/Houyukou/Wangshilou/Lichahe/Chahexin/Fukangxin/Qianyukou villages in Liangshan County, Jining City, Shandong, China

Viet Heritage Garden
Viet Heritage Garden, San José, California

Columbus Cincinnati Road
Columbus Cincinnati Road/Cincinnati–Columbus Road in West Chester Township, Butler County, Ohio

63A&B 63A&B legend
Southbound Interstate 275 at exits 63A/B in Union Township, Clermont County, Ohio

Fixes #665.

@1ec5 1ec5 added enhancement New feature or request internationalization labels Jan 5, 2023
@1ec5 1ec5 self-assigned this Jan 5, 2023
Comment on lines 172 to 178
/**
* Maximum number of values in a semicolon-delimited list of values.
*
* Increasing this constant deepens recursion for replacing delimiters in the
* list, potentially affecting style loading performance.
*/
const maxValueListLength = 9;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please compare this branch to main and note any perceptible degradation in style loading speed, tile loading speed, panning and zooming performance, or memory or CPU usage. If it’s severe enough, we can tune this limit downward until we’re satisfied with the balance between handling enough features and maintaining reasonable performance.

Comment on lines +196 to +198
// Replace the ;; escape sequence with a placeholder sequence unlikely to
// legitimately occur inside a value or separator.
const objReplacementChar = "\x91\ufffc\x92"; // https://overpass-turbo.eu/s/1pJx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s amazing how much junk there is in name that can impersonate a strip marker, but this sequence does the trick for now. If someone opens a pub for text processing geeks and names it “The �� & ;”, I’ll kindly ask someone to offer up a prized password as sacrifice.

@@ -217,7 +338,7 @@ export const localizedNameWithLocalGloss = [
["var", "localizedCollator"],
],
// ...just pick one.
["var", "localizedName"],
["format", listValuesExpression(["var", "localizedName"], "\n")],
Copy link
Member Author

@1ec5 1ec5 Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of this method is so complex that the style-spec package (used by the unit tests) is apparently unable to upcast it to a format expression, leading to an error when it sees the direct format expression in the else slot below. Interestingly, GL JS has no such problem at runtime. This explicit cast reassures the expression parser that each branch of the case expression will resolve to a format expression.

Comment on lines 274 to +277
if (typeof evaluated === "string") {
return [evaluated, null];
return [evaluated];
}
return [evaluated.sections[0].text, evaluated.sections[4].text];
return [evaluated.sections[0].text, evaluated.sections[4]?.text];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result of the manual cast above, this expression no longer returns a literal string, always a format object.

@ZeLonewolf
Copy link
Member

Test location below.

Performance measurement in Chrome on my laptop (multiple refreshes based on network tab in the Chrome debugger console
Current master (remote github): 1.9-2.1s
Current master (localhost): 1.9-2.1s
This branch (localhost): 2.6-2.9s

Approximately 37% slowdown.

Current/New:
image
image
(localhost link)

Cursory testing seems to indicate that these numbers are independent of location (high zoom on an ocean tile yields the same results).

I'm curious to understand if the performance difference is due to execution or building the style.json.

@1ec5
Copy link
Member Author

1ec5 commented Jan 5, 2023

Cursory testing seems to indicate that these numbers are independent of location (high zoom on an ocean tile yields the same results).

Yes, the expression is only evaluated at runtime when loading a tile or reevaluating a symbol layer, so the impact is uniform everywhere. That said, it would be possible to limit this semicolon replacement to certain layers that are more likely to have semicolons in them. Or we could create a separate copy of each layer prefiltered to just the features that contain semicolons in them, but at the cost of additional layers that participate in symbol collision.

I'm curious to understand if the performance difference is due to execution or building the style.json.

The new code probably isn’t a significant contributor to the time it takes to build the JSON. (It happens in memory; there’s no file involved.) However, the step right after, loading the JSON, requires more memory than before. Under the hood, GL JS compiles the expression to a type-safe, object-oriented data structure, which becomes more expensive as the expression becomes more deeply nested. Whenever a new tile needs to be displayed, a Web Worker needs to evaluate more massive expressions. The case statements should allow the evaluator to short-circuit most of the expression unless there are semicolons, but it still requires work to ferry around such a large data structure.

With a built-in string replacement operator, the performance concern would essentially go away. The operator’s performance would be similar to the upcase and downcase operators, which no one bats an eye about. Deduplicating values would require slightly more involved string splitting and array joining operators, but those probably wouldn’t have this performance impact either.

@ZeLonewolf
Copy link
Member

Pending anyone else's review of the code changes, I think it's a good idea to proceed with the proposed change to handle semicolon-delimited lists. I'm concerned about the 1/3 performance drop for processing up to 9 semi-colons, though I understand that future improvements to MapLibre may make it possible for use to be more efficient about this. Perhaps we could reduce the 9 down to a number that captures the vast majority of the sane delimiter use cases while ignoring the silly ones that are probably tagging errors.

@claysmalley
Copy link
Member

Should spaces before and after semicolons be trimmed, or is that considered a tagging error?

Screenshot from 2023-01-05 11-27-30

@1ec5
Copy link
Member Author

1ec5 commented Jan 5, 2023

Should spaces before and after semicolons be trimmed, or is that considered a tagging error?

I believe that’s a tagging error. In fact, editors such as iD automatically remove the space when you touch the name.

@zekefarwell
Copy link
Collaborator

zekefarwell commented Jan 5, 2023

Should spaces before and after semicolons be trimmed, or is that considered a tagging error?

The wiki mentions a single space after a semicolon being valid:
https://wiki.openstreetmap.org/wiki/Semi-colon_value_separator#Space_character_padding

I'm sure handling either ; or ; would make for more complex code, but it could be a strategically good choice if it's doable. On maps that don't replace semicolons with something nicer looking, mappers will probably find "First Name; Second Name" a bit more tolerable than "First Name;Second Name". Although they may not like the look of the semicolon, at least there is also a space separating the two names.

@1ec5
Copy link
Member Author

1ec5 commented Jan 5, 2023

Editors strip out the space in general, but preserve it for certain syntaxes like opening_hours and *:conditional. At one point, it was common to intentionally leave a space after the semicolon, but removing it has been standard practice for many years at this point.

@zekefarwell
Copy link
Collaborator

Editors strip out the space in general

Well iD does anyway. I didn't have any problem including a space here with JOSM: https://www.openstreetmap.org/way/971815925

@ZeLonewolf
Copy link
Member

For place nodes, stats on prevalence:

# Semi-colons Prevalence
1+ semicolon 444
2+ semicolons 9
3+ semicolons 6
4+ semicolons 3
5+ semicolons 2
6+ semicolons 1

@1ec5
Copy link
Member Author

1ec5 commented Jan 5, 2023

For reference, the semicolon delimiter was first documented (without a space) in 2007, when it replaced a recommendation to use a comma.1 There continued to be some uncertainty about the delimiter until 2008.

Editors strip out the space in general

Well iD does anyway. I didn't have any problem including a space here with JOSM: https://www.openstreetmap.org/way/971815925

You can do anything you want in JOSM. 😉 But it too prefers a semicolon whenever it constructs a value list for you. The main reason space-padded semicolons appear in the database today is that Potlatch used to insert this delimiter when merging two ways. It was usually a mistake caused by its confusing shortcut for merging ways. Based on that behavior, mappers (including yours truly) used to manually insert the same delimiter for consistency, but it has been rare for a long time: openstreetmap/iD#941.

Documentation about space padding is pretty stale, dating back to when the same page pointed out that characters other than a semicolon were being used as value delimiters. This page is the product of past edit-warring and also flamewarring on the tagging list, so people have probably been hesitant to do much cleanup on it since. (The controversy was not about the choice of delimiter but rather about multiple-value lists in the first place.) I updated the documentation to reflect longstanding reality.

If there’s a strong need to support the space after the semicolon, then there will be an additional performance hit as long as we build upon this magnificent workaround for the lack of a string replacement operator. This extra padding is more noticeable with the inline treatment than with the newline treatment. We could replace the bullet with a series of spaces, which would make the imbalance imperceptible, but at the cost of making it even more likely for a dual-named label to get collided out.

Given this complexity, I’d suggest tracking support for space-padded semicolons as tail work, which I would consider blocked by a string replacement operator. I’m not opposed to it in principle, since a value’s leading and trailing spaces are always extraneous, but the cost-benefit ratio isn’t superb.

For place nodes, stats on prevalence:

Thanks, this argues strongly in favor of setting the limit to just one or two semicolons for now. I suppose it would be difficult to gather similar statistics about ad-hoc delimiters that would be candidates for conversion to semicolons – which is the point of using semicolons in the first place. But out of curiosity, is there a similar distribution on places if we naïvely assume all space-padded slashes and hyphens are intended to be value delimiters?

Footnotes

  1. A comma continues to be used very commonly in addr:housenumber to this day as part of the Karlsruhe schema, authored by the same mapper who originally documented the comma as a more general delimiter.

@zekefarwell
Copy link
Collaborator

You can do anything you want in JOSM. 😉

Indeed 💪💪

Given this complexity, I’d suggest tracking support for space-padded semicolons as tail work, which I would consider blocked by a string replacement operator. I’m not opposed to it in principle, since a value’s leading and trailing spaces are always extraneous, but the cost-benefit ratio isn’t superb.

Sounds good to me. Even without explicitly handling the extra space after a semicolon, the bullet delimiter is far superior to to displaying the raw value with the semicolon. I don't really find the extra space on one side of the bullet all that noticeable. I just didn't think we ought to necessarily consider a space after a semicolon to be a tagging error. At least not while the raw, semicolon delimited values are displayed on the osm.org standard layer.

@ZeLonewolf
Copy link
Member

Query for all types of objects this time.

  • 1 semi-colon 31,953
  • 2 semi-colons 2,163
  • 3 semi-colon 539
  • 4 semi-colon 34
  • 5 semi-colon 10
  • 6+ semi-colon 6

Copy link
Collaborator

@zekefarwell zekefarwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I did some performance profiling and there's definitely an increase to JS execution time and jank on initial load and while panning/zooming around the map. However, performance is already not great and without profiling I'm not sure the increase from this changes is really all that noticeable (at least on my hardware). Given the stats on numbers of semicolons it seems like dropping maxValueListLength to 4 would be perfectly reasonable if that would improve performance.

1ec5 added 3 commits January 6, 2023 00:20
Added a function that returns an expression to find and replace a fixed number of occurrences of a substring. Added a function to format a list of semicolon-delimited tag values. Separate multiple names by newlines or bullets as necessary.
@1ec5 1ec5 force-pushed the 1ec5-name-semicolon-665 branch from 2091677 to cf6c351 Compare January 6, 2023 08:20
@@ -175,7 +175,7 @@ export function replaceExpression(
* Increasing this constant deepens recursion for replacing delimiters in the
* list, potentially affecting style loading performance.
*/
const maxValueListLength = 9;
const maxValueListLength = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three values separated by two semicolons is plenty enough for the names currently using semicolons. In the future, with a more efficient string replacement implementation, we can add support for more semicolons, which will make this option more attractive to mappers currently using informal delimiters like spaces.

@1ec5
Copy link
Member Author

1ec5 commented Jan 11, 2023

Should spaces before and after semicolons be trimmed, or is that considered a tagging error?

#670 collapses up to one space following a semicolon. If a mapper does intentionally use to make the label look nicer in some other renderers, it’ll still work fine in this style. However, iD does strip it out, so if this becomes an important tagging style, iD will need a one-line fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pretty-print semicolon delimiters in compound names
4 participants