-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Text.IO.Utf8 module #503
Conversation
The title of issue #472 mentions using If I understand correctly, this is about variants of Wouldn't it be better to add to @haskell/text any objections? |
Yeah, the intention is to make it faster in the future once I finish haskell/bytestring#547. I generally like using qualified imports instead of adding suffixes to everything, so that's why I put it in a separate module. I'm fine with putting it in |
I'd mildly prefer to fold it into Shall we provide the same set of functions for lazy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks overall good - especially if we can get a zero-copy via ShortByteString
!
Is there any intent to deprecate the existing readFile
, possibly pointing it to a named function readFileWithSystemLocale
? That would describe the most common footgun and allow people to migrate either to the existing behavior, or to this new behavior.
How should I group the documentation in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it was just the three file functions it seemed odd to have a dedicated module, but now that I see it's the whole Text.IO
API that's affected and being duplicated, it makes more sense to have a new module Data.Text.IO.Utf8
. That way, users who actually don't want to be locale dependent can just do a search-replace on the module name.
This reverts commit 306f7fa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, only couple remarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Any chance to write some tests?
It doesn't seem like Data.Text.IO has any tests? Also, the code really just builds on stuff from bytestring |
There are some basic IO tests in a very strange location: text/tests/Tests/Properties/LowLevel.hs Lines 141 to 145 in ff4af4c
|
@Bodigrim I added tests |
There's a danger with the lack of newline conversion. Windows users are going to be very confused when they read their files on their system and get an unexpected number of characters. It's useful to distinguish two issues here:
We can resolve (1) without changing the API, by adding a special case in the existing IO functions. The matter of newline conversions adds some complexity; currently And (2) turns out to not be accurate because of newline conversion. UTF-8 still leaves open the question of how to encode newlines. I think that, either way, |
Can't we just document that it doesn't convert newlines? The user can convert newlines explicitly if they want, and we can add functions for that. How would we add a special case for the existing functions? |
I don't think those functions are really that conventional that they pass the Fairbairn threshold. Even if we document the behavior, it's really not obvious what specific circumstances warrant it. The standard encoding is set by the platform and the locale. As far as I can tell, it's only legitimate to ignore that standard when you downloaded a file from a Unix system, or you're talking to another local application which is ignoring the standard. IMO The niche bit of convenience of including
The encoding is determined by the |
In the modern environment locale-dependent IO is a larger risk than reading UTF-8 by default. You don't really want to lose all data just because someone accidentally changed system locale to ASCII. E. g., even GHC itself always expects UTF-8 whatever locale. We actually do warn about this issue already: Lines 70 to 78 in ff4af4c
See also https://www.snoyman.com/blog/2016/12/beware-of-readfile/ I've defined this very set of functions in private projects more than once, so I'm keen to have them available from |
FWIW |
The blogpost makes a good point. Maybe I underestimated how many system do assume UTF-8. Consider myself overruled. Do mention the lack of newline conversion then. It makes sense for bytestring to ignore all of that because it's not necessarily dealing with text. |
Thanks, @oberblastmeister! |
Solves #472