Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow characters in the emoji tag sequence in file names #1899

Merged
merged 4 commits into from
Nov 19, 2021

Conversation

mattgarrish
Copy link
Member

@mattgarrish mattgarrish commented Nov 10, 2021

Fixes #1885


Preview | Diff

Copy link
Contributor

@dauwhe dauwhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there is an inherent conflict between the desire to restrict file names to promote interoperability, and the desire to allow anything in file names to promote the expressivity of content authors. But if you name a file 🏴󠁧󠁢󠁷󠁬󠁳󠁿.opf, not all reading systems will be able to handle your EPUB.

@iherman
Copy link
Member

iherman commented Nov 10, 2021

I suppose there is an inherent conflict between the desire to restrict file names to promote interoperability, and the desire to allow anything in file names to promote the expressivity of content authors. But if you name a file 🏴󠁧󠁢󠁷󠁬󠁳󠁿.opf, not all reading systems will be able to handle your EPUB.

I wonder whether this warning should not be included in the text (as a note, of course).

@mattgarrish
Copy link
Member Author

I wonder whether this warning should not be included in the text (as a note, of course).

But then shouldn't we just make this list of characters to avoid best practice? I know we're trying to help authors avoid interop problems, but if that's critical then stick to the printable ascii character set.

@wareid
Copy link
Contributor

wareid commented Nov 10, 2021

This is something I wonder if we need to actually test before making any concrete comments on, because from my experience, at least our system falls over when we get just into special characters like punctuation in file names, let alone emojis.

@mattgarrish
Copy link
Member Author

Right, it just feels like we're doing something a bit tangential to EPUB itself - namely working out how all operating systems and applications will handle unicode. We seem to have issues with this list every revision.

@xfq
Copy link
Member

xfq commented Nov 12, 2021

We discussed this issue in the Internationalization Working Group Teleconference yesterday, and we had a few concerns.

First, a lot of Variation Selectors are excluded. Some of them are used in ideographic variation sequences and should be allowed.

Second, there are unassigned code points in the tags block and they should not be excluded.

We recommend either deleting this line (i.e., allowing all Tags and Variation Selectors Supplement code points including the deprecated characters) or only excluding the deprecated characters (i.e., U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG).

@mattgarrish
Copy link
Member Author

For now, I've changed this to:

The Deprecated Characters in the Tags and Variation Selectors Supplement (U+E0001 and U+E007F)

@r12a r12a added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Nov 18, 2021
@xfq
Copy link
Member

xfq commented Nov 19, 2021

Looks good to me. Thank you!

@iherman
Copy link
Member

iherman commented Nov 19, 2021

The issue was discussed in a meeting on 2021-11-19

List of resolutions:

View the transcript

2. Allow characters in the emoji tag sequence in file names (pr epub-specs#1899)

See github pull request epub-specs#1899.

Dave Cramer: historically epub has been focused on interop and we've had some limits on characters allowed in file names.
… makes epubs portable across different OSes.
… as part of i18n review we've gotten a lot of feedback about allowing authors to use their own languages in file names.
… this has exposed more edge cases, e.g. allowing unicode chars in the emoji tag sequence.
… a sequence of unicode characters that renders out as the icon of a flag.
… this PR allows this, while still excluding some even more problematic characters.
… tested this with a file named as the Welsh flag, and ADE wasn't happy with the result.

Matt Garrish: wasn't just emoji characters, it was languages with variation selectors (e.g. mongolian script).
… not sure how often authors use this, but if its possible to author it i suppose we should allow it.
… after consulting with i18n group, we've identified 2 chars that are deprecated and which we still exclude.

Dan Lazin: seems like the primary concern here is that we want to support this, but we're wary that its not supported today and we don't want to give authors bad advice.
… physical readers either can't or in practice don't get updates, its not practical for an author to make an ebook using these emojis in file names.
… but then there's the issue of these characters displaying in the stores.
… even if they work in RS.
… recommend that we make MUST statement in RS spec, SHOULD NOT statement in core, with the possibility that we change to MAY in core in future.

Murata Makoto: We might want to have a look at https://unicode.org/reports/tr51/#EmojiVersions.

Ivan Herman: how would we test this?.

Dan Lazin: in practice i think RS will support UTF-8 or not, but expecting that UTF-8 support will exist throughout the store ecosystem and legacy readers is hard.
… we can test, but we might not get 100%.

Matt Garrish: are we restricting this because these file names might not work in certain North American stores? Can't the stores themselves decide their own policies for what they will allow?.

Ben Schroeter: +1 to Matt.

Dave Cramer: agree, let's let unicode be unicode.

Romain Deltour: there are quite a few standards/api that define file names, some only exist as editor drafts or cg documents.
… the web generally lacks a unified model of a file system, but that means we don;.
… don't have precedent on the web to rely on.
… and most of these APIs don't restrict file names a lot.
… just say path specific characters can't be part of the file name.
… to be safe, until we have a unified file system model for the web, we should loosen the restriction by changing the MUST NOT to SHOULD NOT, at least in the authoring system spec.

Romain Deltour: for the record, some of the standards I looked at:

Rick Johnson: we seem to be saying this is a supply chain issue, can we pass this over to the business group? Meanwhile we let unicode be unicode.

Avneesh Singh: after getting such nice feedback from i18n, I think this is a sign that we should not be restrictive here. Maybe a note that these characters are now allowed, but that some RS may not support it. At least for this revision..

Matt Garrish: i'm almost positive that there's a note about zip tools that authors should stay within the ASCII range.
… to avneeshsingh's point, perhaps we could generalize this note.

Murata Makoto: mgarrish where is this note you just referred to?.

Matt Garrish: it's bottom of 6.1.3.

Dave Cramer: i think we should merge the PR. This part is uncontroversial. It satisfies i18n and keeps with our philosophy.
… do we need an additional note about the supply chain?.

Matt Garrish: or can we just expand the existing note we were talking about just now?.

Dan Lazin: i think we need a note that says caution when using unicode characters.
… if you're distributing to only one store and you know that your store supports it, then go ahead, otherwise stay away.

Murata Makoto: i don't like that. It discourages non-ASCII characters.

Dan Lazin: i want to encourage the use, just not sure it is safe to do so today.

Murata Makoto: i've heard that argument for 20 years, haha. That argument endangers the use of non-ASCII characters.

Matt Garrish: if we change the restriction from MUST NOT to SHOULD NOT, would that work MURATA?.

Murata Makoto: this issue is about emoji characters, if we start talking about non-ASCII we may be opening a can of worms.

Matt Garrish: i think the issue is just about what is allowed in file naming, and how restrictive the spec should be.

Dave Cramer: The current note says "Some commercial ZIP tools do not support the full Unicode range for File Names. EPUB Creators who want to use ZIP tools that have these restrictions may find it best to restrict their File Names to the [US-ASCII] range.".

Dave Cramer: can we re-write above to avoid referring to US ASCII?.

Murata Makoto: this issue is about emoji only, so why are we writing a note about non-ASCII.

Proposed resolution: merge #1899. (Dave Cramer)

Dave Cramer: +1.

Ivan Herman: +1.

Dan Lazin: +1.

Romain Deltour: +1.

Matthew Chan: +1.

Matt Garrish: +1.

Avneesh Singh: +1.

Wendy Reid: +1.

Ben Schroeter: +1.

Victoria Lee: +1.

Rick Johnson: +1.

Dan Lazin: #1899 just changes "don't use wide range of unicode" to "don't use the two deprecated characters".
… not controversial, I don't think.

Murata Makoto: +1.

Resolution #2: merge #1899.

Dave Cramer: mgarrish do you want to try to reword that note a little?.

Matt Garrish: we'll open an issue about this note?.

Dave Cramer: yes, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tags and Variation Selectors Supplement in the file name
6 participants