-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an expressions string lookup operator that returns the script of the string? #5807
Comments
is_char_in_unicode_block.js comes from the Unicode Character Database’s Scripts.txt. It would be straightforward to add a function that returns the matching key instead of a Boolean, but we’d need to first uncomment all the code blocks that are currently commented out. (We commented them out because we didn’t need them for the purposes of vertical text and ideographic line breaking detection.) It’s worth noting that GL JS is currently unable to display anything in the supplementary planes: #4001. We could still add those blocks to the list for the purpose of this expression operator, though. |
Scripts and Blocks are two different things, and I expect that there's not a 1-1 correspondence. I'm not super familiar with the state of the art for script detection, but I expect that, like most things related to human language, it's pretty tricky and full of subtle nuances. My guess is that a categorization of "primary script" as described by @nickidlugash is something that's better done as a processing step when the datasource is built, and included as a feature property. I guess what I'm getting at is that I think we should step back and capture what the underlying requirements here are, as I'm not sure the feature as described is a good fit. |
Script detection thankfully isn’t quite as difficult a problem as language detection, but you’re right that script detection requires more nuance than what is_char_in_unicode_block.js provides. For the most part, a script corresponds to one or more Unicode blocks, but a block doesn’t necessarily correspond to a single script. If you follow ISO 15924’s definition of a script:
We probably could expose the data from Scripts.txt without much effort or size increase. However, I agree that we may end up needing something a bit different depending on the intended use cases. |
For this use case specifically, it would be straightforward to detect codepoints that require unsupported typographic features. Instead of exposing an open-ended script lookup operator, how about a simpler operator that returns whether GL thinks it can render a given string? Then a style could combine that with the |
Here's a possible implementation of "can we render this character" which captures about what our expectations are but hopefully also shows the limitations of this approach:
|
Closing with #6260: we went with the more limited |
It could be useful to have an expressions string lookup operator that returns the script name of a string. We could use it to style different scripts differently, or only display text for certain scripts.
We currently pull in unicode block data to aid in some script detection checks for text layout – perhaps we can add additional unicode data here for assigning the name of the written script? I think we would need something along the lines of this: http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
We may need logic for deciding what the overall script of a string is (for mixed-script strings), but it might be good to develop similar logic anyway for text layout considerations like horizontal text runs within vertical text?
In addition, we could also return a flag for whether we can display a script accurately (based on complex text shaping needs, and possibly other factors). This could help us/customers create better internationalized maps by making it simpler to use our local language name field (or other localize data sources) but allow an alternative display for poorly rendered scripts (e.g. display English labels instead).
I discussed this briefly with @ChrisLoer, and initial thoughts were that this seems like a reasonable feature to discuss adding, both in terms of implementation and usefulness. Looking forward to hearing other thoughts on this!
/cc @anandthakker @jfirebaugh @kkaefer @jcsg @1ec5 @ajashton
The text was updated successfully, but these errors were encountered: