-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to strip Unicode from entry filenames #1135
Conversation
5c25c2d
to
117644d
Compare
Deploy preview for cms-demo ready! Built with commit 0d12e6b |
c5c565b
to
d58e606
Compare
@@ -60,6 +60,12 @@ public_folder: "/images/uploads" | |||
|
|||
Based on the settings above, if a user used an image widget field called `avatar` to upload and select an image called `philosoraptor.png`, the image would be saved to the repository at `/static/images/uploads/philosoraptor.png`, and the `avatar` field for the file would be set to `/images/uploads/philosoraptor.png`. | |||
|
|||
## Slug Type | |||
|
|||
By default, filenames (slugs) for entries created in the CMS are sanitized according to RFC3987 and the WHATWG URL spec. This spec allows non-ASCII (or non-Latin) characters to exist in URLs. However, for maximum compatibility, you can also set a different slugification option: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I mention that RFC3987 is still technically a draft, not an "official standard"?
b81abe5
to
4cd81ac
Compare
If it converts all accent chars to asci (not strip them out) and if it apply to filenames of media as well, I am ok with that. |
@vencax Yes, you would set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM: Seems to work as expected. I will run some more to make sure, but nothing glaring on first run.
After looking through IRIs RfC 3987 and WHATWG URL Standard I would say https://github.com/netlify/netlify-cms/pull/1135/files#r170398959
Linking to IRIs RfC 3987 2.1. Summary of IRI Syntax and/or WHATWG URL Standard 4.3. URL writing and/or https://github.com/whatwg/url somewhere in a code comment or docs might be a cool move if you have not done so already (just glanced through the PR quickly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔗🎉
@rdela Thanks for the review (I updated the docs to add links). On option naming, the other choice that I could think of would be to name it |
@tech4him1 Never a bad idea to follow Jekyll lead because of popularity, sheerly from GH pages use alone.
That said, no other reason not to use |
…which is now way easier thanks to doc links |
|
||
- `iri` (default): Keeps Unicode characters in slugs, according the the IRI draft spec ([RFC3987](https://tools.ietf.org/html/rfc3987)) and the [WHATWG URL spec](https://url.spec.whatwg.org/). | ||
- `latin`: Removes accents/diacritics from slug, then strips out all non-valid URL characters and periods (see `ascii` below). | ||
- `ascii`: Strips out all characters except valid URI chars (RFC3986) or periods (0-9, a-z, A-Z, `_`, `-`, `~`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If latin
is effectively ascii
+ accent removal, it'd be clearer to just state that. Also, we should order these options from most to least permissive, which just mean swapping the last two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
latin
is effectivelyascii
+ accent removal, it'd be clearer to just state that.
Agreed, can you suggest wording?
Also, we should order these options from most to least permissive, which just mean swapping the last two.
I originally ordered them in that way, but I'm not sure. You're going to end up with more chars left in latin
than in a pure RFC3986 format (ascii
option), since you're converting accented chars instead of just stripping them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you're converting accented chars instead of just stripping them out.
Then why does it not read:
Converts accent/diacritic characters from slug to ASCII equivalents, then strips out all non-valid URL characters and periods (see `ascii` below).
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdela That is what it does -- my wording needs to be cleared up there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe they are in the right order after all and latin
just needs better description like example suggestion above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaces accent/diacritic characters from slug with ASCII equivalents, then strips out all non-valid URL characters and periods (see `ascii` below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does make sense in that it is removing the diacritics and leaving the base chars intact
Oh yeah I can see that meaning now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to wonder if we should do this a little different. This boils down to URI/3986 or IRI/3987, with the option to strip accents. Maybe a two-fold config approach:
slug: iri
# or
slug:
- protocol: iri
- strip_accents: true
Would stripping accents ever make sense when using IRI protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erquhart Actually, I think that's the best option. Stripping accents is really going to be independent of the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so most folks can just do slug: uri
or slug: iri
, but others can provide object if they need tighter control.
Btw I love that pattern and am expecting to use it more as we expand and restructure the APIs - the pattern being config options accepting a primitive value for simplicity, but also an object for fine-grained settings.
**Example** | ||
|
||
``` yaml | ||
slug_type: "latin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slug_type
is a bit general, and could be confusing when we finally do more with slugs and potentially have more configuration options for slugs. What do you think about slug_characters
or slug_chars
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slug_charset
, or is that too much overlap with HTTP/HTML type stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slug-encoding
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of the above could work, I shied away from really specific terms like encoding because I'm not sure it's technically accurate, and yeah "charset" kind of has a specific meaning too. At any rate, just something accurate that isn't as broad as "type".
The spec refers to itself as a "protocol", so maybe slug_protocol
. Accurate but possibly completely opaque to most folks 🤷♀️.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your worry about technical accuracy, but with the prevalence across languages of terms like url_encoded
, I think slug-encoding
/slug_encoding
feels most naturally understandable.
(It also would be nice if we consistently knew whether to use -
or _
, but it's too late for that! 😝)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention used in Netlify CMS is underscore.
Encoding sounds great, I'm with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention used in Netlify CMS is underscore.
...except when it's hyphen, like yaml-frontmatter
, or camel case, like valueField
. ;)
Underscore is certainly the most common, though, and good to know that's the standard going forward. (I wonder if that makes sense to document in the contributor docs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the options will be unicode
and ascii
now, instead of iri
/uri
. @verythorough @erquhart Do you think slug: encoding
still makes sense for that?
564c47c
to
7cdf7d7
Compare
@verythorough I'd appreciate a docs review on this as well! 😄 |
src/lib/__tests__/urlHelper.spec.js
Outdated
@@ -81,6 +88,24 @@ describe('sanitizeSlug', ()=> { | |||
).toEqual('This-that-one_or.the~other-123'); | |||
}); | |||
|
|||
it('should remove accents if set', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'should remove accents with clean_accents
set'
export function sanitizeSlug(str, options = Map()) { | ||
const encoding = options.get('encoding', 'unicode'); | ||
const stripDiacritics = options.get('clean_accents', false); | ||
const replacement = options.get('sanitize_replacement', '-'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally undocumented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if we wanted to wait until someone actually had a valid use case for it -- validating it in src/actions/config.js
would take a bit of effort. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I'm fine with it as is, was just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple of comments, but this looks solid!
@vencax This has been released as part of v1.4.0. |
- Summary
The PR adds
slug
as a global config option.The
slug
option allows you to change how filenames for entries are created and sanitized. For modifying the actual data in a slug, see the per-collection option below.slug
accepts multiple options:encoding
unicode
(default): Sanitize filenames (slugs) according to RFC3987 and the WHATWG URL spec. This spec allows non-ASCII (or non-Latin) characters to exist in URLs.ascii
: Sanitize filenames (slugs) according to RFC3986. The only allowed characters are (0-9, a-z, A-Z,_
,-
,~
).clean_accents
: Set totrue
to remove diacritics from slug characters before sanitizing. This is often helpful when usingascii
encoding.Closes #1012.
Also sanitizes media file slugs, closing #1196.
- Test plan
Added tests of functions. Manually tested all three options.
- Description for the changelog
Add option to strip Unicode from entry filenames.
- A picture of a cute animal (not mandatory but encouraged)