Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Unicode whitespace and invisible characters #15

Merged
merged 4 commits into from
Jan 16, 2014
Merged

Handle Unicode whitespace and invisible characters #15

merged 4 commits into from
Jan 16, 2014

Conversation

JasonBarnabe
Copy link
Contributor

Unicode defines a number of whitespace and invisible characters that strip_attributes does not currently strip. This pull request adds support for stripping these.

I created the list based off of this and this. I'm not a Unicode expert so it's possible I missed some or included some I shouldn't have.

@rmm5t
Copy link
Owner

rmm5t commented Jan 14, 2014

  1. I just learned something today.
  2. I'm going to have to give this some thought. Is this an actual problem that you've run into with user input before?

@JasonBarnabe
Copy link
Contributor Author

Yes. Some users are being overly clever and putting fields containing nothing but zero-width spaces to get by "required" validations.

@rmm5t
Copy link
Owner

rmm5t commented Jan 14, 2014

Interesting. Damn clever users.

I'm apprehensive about adding this as a feature though. Do you think #14 would accomplish what you want?

@JasonBarnabe
Copy link
Contributor Author

Yes, I think I could accomplish this with #14 by simply giving it the same regexp I added in this pull request. However, I think this is something that should happen by default, as it's exactly what strip_attributes is for.

@rmm5t
Copy link
Owner

rmm5t commented Jan 14, 2014

it's exactly what strip_attributes is for.

Not exactly. StripAttributes was built to help avoid accidental spaces and normalize user input. Malicious empty space that users use to get around validation is a very different story, in my opinion.

Furthermore whitespace and emptyspace characters are two very different things. I don't really want this behavior to be the default.

I might just be ignorant to the purpose of these invisible characters. I need to research this more:
http://www.cs.tut.fi/~jkorpela/chars/spaces.html

I'd still have to sleep on it, but I think I'd be more open to a new option for strip_attributes.

Option Naming Ideas:

  strip_attributes :strip_invinsible => true

@JasonBarnabe
Copy link
Contributor Author

StripAttributes was built to help avoid accidental spaces and normalize user input. Malicious empty space that users use to get around validation is a very different story, in my opinion.

I don't think there's a reason to make a distinction between malicious and accidental entry, nor do I think you can assume that ASCII whitespace is accidental and Unicode whitespace is malicious. I have other records in my database where there's trailing Unicode whitespace in an otherwise valid field, and users have likely tried to maliciously get around minimum length validation by adding more ASCII whitespace.

Simply put, I'm trying to make StripAttributes align more closely to the simple definition of:

StripAttributes is an ActiveModel extension that automatically strips all attributes of leading and trailing whitespace before validation.

with an emphasis on "automatically" and a lack of qualifier on "whitespace".

Furthermore whitespace and emptyspace characters are two very different things. I don't really want this behavior to be the default.

I might just be ignorant to the purpose of these invisible characters. I need to research this more:
http://www.cs.tut.fi/~jkorpela/chars/spaces.html

I'd still have to sleep on it, but I think I'd be more open to a new option for strip_attributes.

I'm not sure, but you may be assuming all the characters I'm matching are "invisible". That's not the case - most of them are real-deal spaces-that-take-space.

I agree that these Unicode characters throw a wrench into the typical understanding of whitespace, with some not taking space, others not being white, and even others that in some cases aren't white and in others aren't spaces. I'm open to suggestions on what characters to include, but the criteria I'm trying to follow is characters

  1. that are usually white, regardless of whether they take space, and
  2. whose inclusion as the start or end of text has no effect on the meaning of that text

Would the [:space:] POSIX character class match all your invisible characters thereby cleaning up this implementation a bit?

According to this, [:space:] matches the ASCII whitespace characters and the Unicode characters of class Z. This misses some of the characters I'm currently matching, including the one that's causing me issues in the first place (zero width space). I believe matching [:space:] and [:cntrl:] together would cover all the characters I'm including, but I'd be concerned about it removing characters that perhaps can validly be the start or end of string, like bidi control characters.

@rmm5t
Copy link
Owner

rmm5t commented Jan 14, 2014

Okay. I'll sleep on this. Your arguments and my limited research are
turning me around a bit.

However, shouldn't these same arguments be made to the ruby core team about
String#strip?

@JasonBarnabe
Copy link
Contributor Author

There's an open bug with the Ruby team. Looks to be hung up on the fact that what is considered whitespace is locale-dependant.

After doing a bit more research, I think:

  • Definitely strip anything of Unicode class Z ("separator", this is what [:space:] matches). This covers the subcategories Zl, Zp, and Zs
  • Selectively strip things of Unicode class C ("control"). The subcategories are Cc and Cf.

Here's what a couple of sources think are non-Z class spaces:

"Unicode spaces""Spaces in Unicode" on Wikipedia
U+180E MONGOLIAN VOWEL SEPARATORXX
U+200B ZERO WIDTH SPACEXX
U+200C ZERO WIDTH NON-JOINERX
U+200D ZERO WIDTH JOINERX
U+2060 WORD JOINERX
U+FEFF ZERO WIDTH NO-BREAK SPACEXX

From the write-up on Wikipedia about the "joiners", it sounds like they don't make sense at the start or end of text, so I'm comfortable having them stripped.

@rmm5t
Copy link
Owner

rmm5t commented Jan 15, 2014

Thanks for this additional research. It helps.

Could you also please clean up the regex to use the [:space:] character class plus the 6 class C control characters that you listed?

The build is failing on Ruby 1.8.7. I don't immediately see why. Getting the build passing on 1.8.7 would also be a big help.

@JasonBarnabe
Copy link
Contributor Author

How do you even install StripAttribute's prerequisites on 1.8.7?

jason@jason-thinkpad:~/strip_attributes$ bundle install
Fetching gem metadata from http://rubygems.org/.........
Resolving dependencies...
Installing rake (10.1.1) 
Installing i18n (0.6.9) 
Installing minitest (4.7.5) 
Installing multi_json (1.8.4) 
Installing atomic (1.1.14) 
Installing thread_safe (0.1.3) 
Installing tzinfo (0.3.38) 
Installing activesupport (4.0.2) 
Gem::InstallError: activesupport requires Ruby version >= 1.9.3.
An error occurred while installing activesupport (4.0.2), and Bundler cannot continue.
Make sure that `gem install activesupport -v '4.0.2'` succeeds before bundling.

@rmm5t
Copy link
Owner

rmm5t commented Jan 15, 2014

See the specific activemodel-3.2.gemfile in the gemfiles directory. That's
what the Travis build matrix uses for 1.8.7

@JasonBarnabe
Copy link
Contributor Author

I'm unsure how to make this work with 1.8 with its lack of Unicode support. \u escape sequences don't work. I've read suggestions to use the form ["200A".to_i(16)].pack("U*"), but ["200A".to_i(16)].pack("U*").gsub(/[[:space:]]+/, '') doesn't do anything.

Can we skip this functionality in Ruby 1.8, and if so, do you have a suggestion on how to test for Ruby 1.8?

@JasonBarnabe
Copy link
Contributor Author

I added a check to the functionality and the test to only run if "\u0020" == " ", which means Ruby 1.9 and above.

rmm5t added a commit that referenced this pull request Jan 16, 2014
Handle Unicode whitespace and invisible characters
@rmm5t rmm5t merged commit ea9867e into rmm5t:master Jan 16, 2014
@rmm5t
Copy link
Owner

rmm5t commented Jan 16, 2014

@JasonBarnabe Thanks for all your hard work on this. I just released v1.5.0 which incorporates these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants