Handle Unicode whitespace and invisible characters #15

JasonBarnabe · 2014-01-14T20:26:16Z

Unicode defines a number of whitespace and invisible characters that strip_attributes does not currently strip. This pull request adds support for stripping these.

I created the list based off of this and this. I'm not a Unicode expert so it's possible I missed some or included some I shouldn't have.

rmm5t · 2014-01-14T21:01:34Z

I just learned something today.
I'm going to have to give this some thought. Is this an actual problem that you've run into with user input before?

JasonBarnabe · 2014-01-14T21:17:36Z

Yes. Some users are being overly clever and putting fields containing nothing but zero-width spaces to get by "required" validations.

rmm5t · 2014-01-14T21:19:53Z

Interesting. Damn clever users.

I'm apprehensive about adding this as a feature though. Do you think #14 would accomplish what you want?

JasonBarnabe · 2014-01-14T21:28:34Z

Yes, I think I could accomplish this with #14 by simply giving it the same regexp I added in this pull request. However, I think this is something that should happen by default, as it's exactly what strip_attributes is for.

rmm5t · 2014-01-14T22:01:34Z

it's exactly what strip_attributes is for.

Not exactly. StripAttributes was built to help avoid accidental spaces and normalize user input. Malicious empty space that users use to get around validation is a very different story, in my opinion.

Furthermore whitespace and emptyspace characters are two very different things. I don't really want this behavior to be the default.

I might just be ignorant to the purpose of these invisible characters. I need to research this more:
http://www.cs.tut.fi/~jkorpela/chars/spaces.html

I'd still have to sleep on it, but I think I'd be more open to a new option for strip_attributes.

Option Naming Ideas:

  strip_attributes :strip_invinsible => true

JasonBarnabe · 2014-01-14T22:58:17Z

StripAttributes was built to help avoid accidental spaces and normalize user input. Malicious empty space that users use to get around validation is a very different story, in my opinion.

I don't think there's a reason to make a distinction between malicious and accidental entry, nor do I think you can assume that ASCII whitespace is accidental and Unicode whitespace is malicious. I have other records in my database where there's trailing Unicode whitespace in an otherwise valid field, and users have likely tried to maliciously get around minimum length validation by adding more ASCII whitespace.

Simply put, I'm trying to make StripAttributes align more closely to the simple definition of:

StripAttributes is an ActiveModel extension that automatically strips all attributes of leading and trailing whitespace before validation.

with an emphasis on "automatically" and a lack of qualifier on "whitespace".

Furthermore whitespace and emptyspace characters are two very different things. I don't really want this behavior to be the default.

I might just be ignorant to the purpose of these invisible characters. I need to research this more:
http://www.cs.tut.fi/~jkorpela/chars/spaces.html

I'd still have to sleep on it, but I think I'd be more open to a new option for strip_attributes.

I'm not sure, but you may be assuming all the characters I'm matching are "invisible". That's not the case - most of them are real-deal spaces-that-take-space.

I agree that these Unicode characters throw a wrench into the typical understanding of whitespace, with some not taking space, others not being white, and even others that in some cases aren't white and in others aren't spaces. I'm open to suggestions on what characters to include, but the criteria I'm trying to follow is characters

that are usually white, regardless of whether they take space, and
whose inclusion as the start or end of text has no effect on the meaning of that text

Would the [:space:] POSIX character class match all your invisible characters thereby cleaning up this implementation a bit?

According to this, [:space:] matches the ASCII whitespace characters and the Unicode characters of class Z. This misses some of the characters I'm currently matching, including the one that's causing me issues in the first place (zero width space). I believe matching [:space:] and [:cntrl:] together would cover all the characters I'm including, but I'd be concerned about it removing characters that perhaps can validly be the start or end of string, like bidi control characters.

rmm5t · 2014-01-14T23:31:00Z

Okay. I'll sleep on this. Your arguments and my limited research are
turning me around a bit.

However, shouldn't these same arguments be made to the ruby core team about
String#strip?

JasonBarnabe · 2014-01-15T04:12:59Z

There's an open bug with the Ruby team. Looks to be hung up on the fact that what is considered whitespace is locale-dependant.

After doing a bit more research, I think:

Definitely strip anything of Unicode class Z ("separator", this is what [:space:] matches). This covers the subcategories Zl, Zp, and Zs
Selectively strip things of Unicode class C ("control"). The subcategories are Cc and Cf.

Here's what a couple of sources think are non-Z class spaces:

"Unicode spaces"	"Spaces in Unicode" on Wikipedia
U+180E MONGOLIAN VOWEL SEPARATOR	X	X
U+200B ZERO WIDTH SPACE	X	X
U+200C ZERO WIDTH NON-JOINER		X
U+200D ZERO WIDTH JOINER		X
U+2060 WORD JOINER		X
U+FEFF ZERO WIDTH NO-BREAK SPACE	X	X

From the write-up on Wikipedia about the "joiners", it sounds like they don't make sense at the start or end of text, so I'm comfortable having them stripped.

rmm5t · 2014-01-15T20:15:10Z

Thanks for this additional research. It helps.

Could you also please clean up the regex to use the [:space:] character class plus the 6 class C control characters that you listed?

The build is failing on Ruby 1.8.7. I don't immediately see why. Getting the build passing on 1.8.7 would also be a big help.

JasonBarnabe · 2014-01-15T21:58:44Z

How do you even install StripAttribute's prerequisites on 1.8.7?

jason@jason-thinkpad:~/strip_attributes$ bundle install
Fetching gem metadata from http://rubygems.org/.........
Resolving dependencies...
Installing rake (10.1.1) 
Installing i18n (0.6.9) 
Installing minitest (4.7.5) 
Installing multi_json (1.8.4) 
Installing atomic (1.1.14) 
Installing thread_safe (0.1.3) 
Installing tzinfo (0.3.38) 
Installing activesupport (4.0.2) 
Gem::InstallError: activesupport requires Ruby version >= 1.9.3.
An error occurred while installing activesupport (4.0.2), and Bundler cannot continue.
Make sure that `gem install activesupport -v '4.0.2'` succeeds before bundling.

rmm5t · 2014-01-15T23:30:36Z

See the specific activemodel-3.2.gemfile in the gemfiles directory. That's
what the Travis build matrix uses for 1.8.7

JasonBarnabe · 2014-01-15T23:59:36Z

I'm unsure how to make this work with 1.8 with its lack of Unicode support. \u escape sequences don't work. I've read suggestions to use the form ["200A".to_i(16)].pack("U*"), but ["200A".to_i(16)].pack("U*").gsub(/[[:space:]]+/, '') doesn't do anything.

Can we skip this functionality in Ruby 1.8, and if so, do you have a suggestion on how to test for Ruby 1.8?

JasonBarnabe · 2014-01-16T19:01:03Z

I added a check to the functionality and the test to only run if "\u0020" == " ", which means Ruby 1.9 and above.

Handle Unicode whitespace and invisible characters

rmm5t · 2014-01-16T23:36:49Z

@JasonBarnabe Thanks for all your hard work on this. I just released v1.5.0 which incorporates these changes.

Handle Unicode spaces

6bda66c

JasonBarnabe mentioned this pull request Jan 14, 2014

Don't allow whitespace-only names stylish-userstyles/stylish#33

Closed

rmm5t mentioned this pull request Jan 15, 2014

add regular expression support #14

Merged

Merge from upstream

15a8ff9

Only replace Unicode class Z and a few class C characters

4acecf2

Only handle Unicode spaces if Ruby understands Unicode

52ead73

rmm5t added a commit that referenced this pull request Jan 16, 2014

Merge pull request #15 from JasonBarnabe/master

ea9867e

Handle Unicode whitespace and invisible characters

rmm5t merged commit ea9867e into rmm5t:master Jan 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Unicode whitespace and invisible characters #15

Handle Unicode whitespace and invisible characters #15

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 15, 2014

rmm5t commented Jan 15, 2014

JasonBarnabe commented Jan 15, 2014

rmm5t commented Jan 15, 2014

JasonBarnabe commented Jan 15, 2014

JasonBarnabe commented Jan 16, 2014

rmm5t commented Jan 16, 2014

Handle Unicode whitespace and invisible characters #15

Handle Unicode whitespace and invisible characters #15

Conversation

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 14, 2014

rmm5t commented Jan 14, 2014

JasonBarnabe commented Jan 15, 2014

rmm5t commented Jan 15, 2014

JasonBarnabe commented Jan 15, 2014

rmm5t commented Jan 15, 2014

JasonBarnabe commented Jan 15, 2014

JasonBarnabe commented Jan 16, 2014

rmm5t commented Jan 16, 2014