-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Unicode whitespace and invisible characters #15
Conversation
|
Yes. Some users are being overly clever and putting fields containing nothing but zero-width spaces to get by "required" validations. |
Interesting. Damn clever users. I'm apprehensive about adding this as a feature though. Do you think #14 would accomplish what you want? |
Yes, I think I could accomplish this with #14 by simply giving it the same regexp I added in this pull request. However, I think this is something that should happen by default, as it's exactly what strip_attributes is for. |
Not exactly. StripAttributes was built to help avoid accidental spaces and normalize user input. Malicious empty space that users use to get around validation is a very different story, in my opinion. Furthermore whitespace and emptyspace characters are two very different things. I don't really want this behavior to be the default. I might just be ignorant to the purpose of these invisible characters. I need to research this more: I'd still have to sleep on it, but I think I'd be more open to a new option for strip_attributes. Option Naming Ideas: strip_attributes :strip_invinsible => true |
I don't think there's a reason to make a distinction between malicious and accidental entry, nor do I think you can assume that ASCII whitespace is accidental and Unicode whitespace is malicious. I have other records in my database where there's trailing Unicode whitespace in an otherwise valid field, and users have likely tried to maliciously get around minimum length validation by adding more ASCII whitespace. Simply put, I'm trying to make StripAttributes align more closely to the simple definition of:
with an emphasis on "automatically" and a lack of qualifier on "whitespace".
I'm not sure, but you may be assuming all the characters I'm matching are "invisible". That's not the case - most of them are real-deal spaces-that-take-space. I agree that these Unicode characters throw a wrench into the typical understanding of whitespace, with some not taking space, others not being white, and even others that in some cases aren't white and in others aren't spaces. I'm open to suggestions on what characters to include, but the criteria I'm trying to follow is characters
According to this, |
Okay. I'll sleep on this. Your arguments and my limited research are However, shouldn't these same arguments be made to the ruby core team about |
There's an open bug with the Ruby team. Looks to be hung up on the fact that what is considered whitespace is locale-dependant. After doing a bit more research, I think:
Here's what a couple of sources think are non-
From the write-up on Wikipedia about the "joiners", it sounds like they don't make sense at the start or end of text, so I'm comfortable having them stripped. |
Thanks for this additional research. It helps. Could you also please clean up the regex to use the The build is failing on Ruby 1.8.7. I don't immediately see why. Getting the build passing on 1.8.7 would also be a big help. |
How do you even install StripAttribute's prerequisites on 1.8.7?
|
See the specific activemodel-3.2.gemfile in the gemfiles directory. That's |
I'm unsure how to make this work with 1.8 with its lack of Unicode support. Can we skip this functionality in Ruby 1.8, and if so, do you have a suggestion on how to test for Ruby 1.8? |
I added a check to the functionality and the test to only run if "\u0020" == " ", which means Ruby 1.9 and above. |
Handle Unicode whitespace and invisible characters
@JasonBarnabe Thanks for all your hard work on this. I just released v1.5.0 which incorporates these changes. |
Unicode defines a number of whitespace and invisible characters that strip_attributes does not currently strip. This pull request adds support for stripping these.
I created the list based off of this and this. I'm not a Unicode expert so it's possible I missed some or included some I shouldn't have.