-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arbitrary encoding 🚀 #188
Comments
Hi @Alphare thanks for getting in touch. Absolutely, let's fix this. I hasdn't thought about this previously but based on your message I'm thinking that we need to support multiple different encodings, in different files/commits within a repo. How does the following sound? cc @fgsch
Recently opened issues #187 and #150 led me to put in the quick fix 9d1aafe to avoid crashing in the face of invalid utf-8, but I was about to modify that in any case, so I think this work fits in. I also think we can do this without impacting performance on the common case (actually, it might improve it). |
I'd like to add that this is not necessarily historical. Utf-8 (or any encoding really) makes sense for text, but there are cases where you want to have arbitrary bytes. Adding something like |
The retry would be at line level (delta only reads a limited number of lines into memory at a time -- basically one diff hunk). I haven't looked into this properly yet: for entirely non-textual files, I was thinking that git (and mercurial?) would often output text like |
@Alphare how does Mercurial handle this? Does it look at the first few bytes of a file to judge whether it appears to be text or not? |
The encoding strategy is documented here and will do a better job than I can: https://www.mercurial-scm.org/wiki/EncodingStrategy If anywhere in the code Mercurial does need to assess if any hunk is binary, in LFS context there is an override, otherwise it looks if it's not empty and has a null byte in it. I unfortunately don't have much time to think about how to solve the issue today, but here's my question: do we have to care about the encoding at all? This is a very candid question, as in "is there an encoding-agnostic way of doing so"? For example: I've been working a lot on the Rust version of |
I think that we will need to care about the encoding. The operations that delta needs to do on its input lines include parsing, running string alignment algorithms over them to infer within-line edits, inserting ANSI escape sequences. So although bytelines allows us to iterate over lines of stdin yielding each line as a pointer to the relevant raw bytes of stdin, it's going to be impossible to implement the rest of Delta without being able to to construct heap-allocated |
It looks like it does support bytes: https://github.com/ogham/rust-ansi-term#byte-strings. I don't see how the tasks you've listed could not be done on arbitrary bytes, but maybe I don't understand the specific need for |
Ah, thanks! Well, OK but apart from ansi_term :) ... what about syntect? It looks to me like syntect requires the input to be I'm new to Rust with this project, and not expert. Honestly, I think that I'd been thinking that, given the prominence of |
@Alphare I've made some of the necessary changes to do this in #191 which provides lossy utf-8 decoding.
Do you happen to have some examples of repos to hand which contain a mix of non-utf-8 encodings which I can test the fix for this issue against? |
@Alphare would you be able to point me to some example repos that have problematic encodings for me to test against? |
@dandavison sorry for the delay, I should have answered you when I first got the message. I do not personally have any repos that are anything other than UTF8, but I'll ask around. $ hg init test-repo
$ cd test-repo
$ echo -n "Raphaël Gomès" > foo # assuming UTF8 default
$ hg ci -Am "UTF8"
$ iconv -f UTF-8 -t WINDOWS-1252 foo > foo2
$ mv foo2 foo
$ hg ci -Am "CP1252" Running I hope that helps, thanks for pinging me. :) |
Early revisions of the nginx repo have non-UTF8 encodings. |
@Alphare that's very useful, thanks a lot. |
Hello, is there any new progress? |
Hi,
I've quickly looked at the project and it looks awesome. Also thanks for supporting Mercurial. :)
The project uses
String
for handling hunks, which forces the data to be UTF8, or since 9d1aafe, display garbage.Some projects historically did not start with UTF8, either because they predated the standard or for other reasons, which makes it impossible for them to use VCS or VCS-related software that does not support handling arbitrary bytes without completely losing their (often decades-long) history.
Have you considered it? Is it a wontfix?
Thanks
The text was updated successfully, but these errors were encountered: