-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check_string function that is more generic, thanks to Encodings #2
base: master
Are you sure you want to change the base?
Conversation
@tkelman Could you take a look at this please? |
This looks like it's just pasting in the UTF8 version of |
I do think optional arguments as boolean keyword arguments is better than C-style flags in a |
@tkelman As I said, if Encodings.jl can be put in Base (which helps everybody trying to deal with conversions... it would be a basis for people adding support for other encodings, such as CP1252, EUC, or SJIS, etc. in packages), then this version would replace what's currently in #11575. |
Right. I am comparing this strictly relative to #11575, and this doesn't look any better. And it requires encodings to be added to base first, which as I've said before I'm neutral on. I haven't seen a strong enough use case for needing to look at different endianness of strings in base. |
Put another way, we can fix the bugs and try to improve performance (subject to code size tradeoffs) in the functionality that exists in base right now, but for new stuff with esoteric encodings, experimenting with it in a package (aka right here) for a while is the right place to do things. |
I'm sorry, but I don't think you are quite getting it. The only encodings used by this new version of check_strings are the same as before, so there is no increase in base. |
I'm looking at the two versions of the code, and this doesn't look more generic to me. It looks like you added extra encoding inputs, and pasted the UTF8 version of the code into the same method instead of having one UTF8 method and one non-UTF8 method. The latter is fine for base, especially since it doesn't need to add any extra code first. Generality is not about reducing the method count, it's about doing more with less - here you've just moved code around, I don't see the benefit. |
You're missing it then. |
I did see that part, which smells like a feature - but so far not a use case for it. This is a good place to work on that, to come up with a convincing case. |
This allows it to handle things like a repeated string or substring efficiently. |
OK, I've fixed up the comments. |
|
||
using Encodings | ||
|
||
export check_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just call it check
; I'm a big fan of avoiding _
where possible and letting the arguments play apart of indicating what a method does, i.e. check(e::Encoding, dat)
== I'm checking the encoding/validity of dat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry though that check
might be too general a name... this is really just to check string encodings, stored in various ways...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally in the future, we'd have a Strings
module in Base and we could just have Strings.check()
and leave it unexported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw a PR about making a Strings
module in Base... Although I have some problems with that (not the concept, but maybe the implementation [mainly because I've got a branch where I'm experimenting with moving a lot of the stuff in strings.jl into separate files, which just have the most basic string functionality, just enough to be able to support """ strings, for documentation, earlier in the build])
It's not that I don't want to export it, I think it should be, because people could extend it for many other string encodings.
@quinnj Fine by me... I'd planned on asking if the name should be more generic as well... |
ch, pos = next(dat, pos) | ||
totalchar += 1 | ||
if ch > 0x7f | ||
if E <: UTF8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time I find myself doing if X <: Y
, it screams for use of dispatch. I think this is probably why @tkelman isn't super impressed by this vs. JuliaLang/julia#11575, since you've basically just added this if statement here and combined the two methods. A better solution would be to find a way to further split out the internals of check
that could be dispatched on according to Encoding
. This leads to nice hierarchy of methods with the most generic at the top level, and Encoding
-specific methods at the lowest and hopefully smallest level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I put it as one method, because reviewers had previously asked to do just that... (the back and forth makes my head hurt!) 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying there should be two separate check
methods, I'm saying that the if E <: UTF8
is an immediate flag to me that whatever is within that if
block should actually be pulled out as a method check_internal(::Type{UTF8}, ...)
, and it looks like in this case, a generic fallback for check_internal{T}(::Type{T}, ...)
. That becomes more extensible and general when another encoding doesn't have to reimplement (i.e. copy) all the other code from check
, they just have to implement check_internal
. This is the beauty of multiple dispatch + types as values in Julia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a very good idea, I'll try to experiment with how to do that, and make sure it is still efficient...
@ScottPJones, here's something to think about. I really kind of hate how much string code we have in Base that interacts with Indeed, the new What are your thoughts on consolidating all methods to operating on UInt8s and what would be a minimal abstraction to interact with those bytes? |
I'm not sure how that would be all that good... I would think you'd want the compiler to know the size of the code units (1, 2, or 4 bytes), typically with aligned access to 2 or 4 byte values... |
3c83c17
to
040b647
Compare
@quinnj I've been trying to adapt the checkstring code to not just use Encodings, but also to directly use the String types.
I was also wondering, what do you think about String having a "validated" bit? |
Bump @quinnj Have you had a chance to think about any of the questions I raised this weekend? (I realize you've been very busy with mmap, which is a great thing also) |
Hey @ScottPJones, yeah, I've been busy with getting stuff ready for JuliaCon, so I trying to refrain from doing much more work on this (hopefully we can have a larger discussion or working session at JuliaCon on this! Maybe on Wednesday during the Hackathon). 1&2) I actually took some time to finally soak in the long, but where-much-progress-was-made JuliaLang/julia#9297 and I think things finally started to settle there. I.e. So far, I've only done
For the validated bit, I'm wondering how it will get used. I imagine you could have the normal string constructors do validation on construction, with "unsafe" methods that would allow direct construction without validation (if you know you're getting valid code points from somewhere) for performance. But then how does the bit get used later? Can certain operations only be done on a validated string? Do you "show" them differently? |
Right, about 3) I'd really want that sort of thing only accessible from the Strings module. |
About a validated bit... I'm not sure about it, I'm just bouncing ideas around... there are so many optimizations possible when you know strings are validated, but if you also want to support read-only access to strings that you just point to in memory (that are in the middle of a record you got from ZMQ, or from Java, or ODBC, etc.), you might be able to validate them, and then set the bit, to avoid the overhead of calling |
No description provided.