-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented UTF-8 support #29
Conversation
It seems like a very solid PR for someone who has never used Java before.
Don't worry about the warnings. Those are caused by other parts of RARS. I should probably suppress them though.
I believe those warnings would go away if you used
Good catch. |
This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented. If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas. Thanks for the contribution. |
I should also mention that much more testing would need to be done obviously. I really just wanted to send you a PR before the long weekend.
I am interested in contributing more so it may be possible but it would have to be in my free time from now on so im also not entirely sure when I would be able to get around to it. Thank you very much for all your feedback. Made my day honestly. |
Would you be able to provide me with a full list of the areas where you would like unicode support extended for you to feel comfortable with merging it all in? I would like to see if it seems within my ability to complete this task and if it seems possible in a reasonable time frame. Thanks! |
I mentioned two other areas in #28 already; they both would probably take about as much effort as the change you have already made. I don't know exactly where other areas might be. I would have to take a look through all of the directives and system calls to make sure there is no other area that string <-> byte conversions happen. I think those three areas are the only places, and if you implement them all, I'd be happy to merge this and fix any additional areas if I find them while reviewing. |
Unicode support for string directives
It looks like you are making progress. The write system call and bug-fixing look like all there is left. |
Bug fixes and error handling
Unrelated to the unicode changes I noticed that when a string directive would end with a Im unsure of whether this is the correct handling of this but I can change it however you'd like. I just thought I would raise your attention to this if it is unintended. With this most recent merge it handles a
by not storing the \ or the directive quote ("). Should it still be storing just the \ which may have been intended by the user or should it raise an error to console considering this is an improper usage of an escape? I have fixed the file creation issue and handled the \u escape errors better (and hopefully completely). I will begin working on the write syscall. |
The behavior of
So I definitely think that the not needing to close quotes is intuitive, but it is inherited from MARS. I think it would be a good idea to add a warning if a string is not closed, but I would rather this not be handled in the escaping code. |
Ah I wasn't familiar with that being how strings are handled in general. This presents a new potential problem though unless I am still misunderstanding. The current release of rars doenst save the last char if you dont end the string directive with a closing quote. For example if you:
then only tes will be saved to memory and if you print later obviously only tes will be printed. This is because the for loop in store strings assumes both a starting and ending quote i believe and starts the loop before the first quote and ends at the last quote. rars/rars/assembler/Assembler.java Line 1033 in 7148112
EDIT: I suppose maybe you were aware of this and thats what you meant by wanting a warning raised. |
I was not aware of that issue. I thought it was the same as if a quote was added on the end to terminate it. In your opinion, should an open string be an error or warning? I would be ok breaking compatibility with MARS on this, but am on the fence about making it an error. |
Well, I think that if we want to accept unclosed quotes we can easily do this by doing a check right before the for loop of whether the last char in the string is a
If we did that I feel like there would no need to raise either a warning or an error because as you said quotes don't need to be closed in general and the behavior, in either case, would then be as expected. Though, would this be a RISC-V defined behavior or is this something that can be left up to the implementation of the assembler to decide? |
Can you clarify for me what needs to be done for the write syscall. In #28 you say:
I notice that with my most recent commit I can write to a file unicode characters and have them appear as unicode within the file. So it appears that it does work as you say so what is left to change? To what exactly are you referring to when you say the default encoding? Do you mean of the file we are writing to or the system the program is run on? Is what is left to change simply forcing the file to have utf-8 encoding? Would something like using an OutputStreamWriter to force the characters to be written in utf-8 encoding be what you are asking for (similar to what is mentioned in: https://stackoverflow.com/a/1001562). |
I know how to fix all of the problems from the code side without much issue. Changing how the tokenizer works can do all of that.
AFAIK, no RISC-V assembler besides RARS does this. The reason RARS does it is because MARS does it and I didn't make the choice to change away from it. If I were implementing from scratch I probably would not have made open strings possible.
I would expect unicode being written to files to work. OutputStreams just send bytes not characters so the utf-8 encoded strings passed in should be directly written to the file.
I was referring to these lines: Lines 265 to 266 in 522ebbd
The GUI console in RARS is posted to by strings rather than bytes and so a conversion must be made. If I definitely should have explained that a little in the original issue. |
UTF8 string construction in write syscall and unsupported encoding handling
I'm still not quite sure about how I handled the unsupported encoding exception in each of the files. I tried throwing the assertionError as you mentioned but dont feel comfortable enough with my java error handling abilities and in general how errors are handled in rars to fully implement it correctly. For now I simply followed what I saw from other files when rars would be forced to close (such as in syscallLoader). My understanding is that if i throw the aeesrtionError i would then need to modify wherever else calls these function and catch the thrown assertionError and handle it there? Would it be possible to leave this to you or would you be able to provide a little more direction on this matter? How does this look and is there anything that needs to be improved? |
Java does allow undeclared exceptions so you can throw an AssertionError and it will propagate up the call stack until it is caught or leaves the root call. If uncaught it will terminate the program and show a stack trace. With threads this works a little differently, but you don't need to worry about it.
I can handle the error handling. I think you've done everything I suggested so I'll take this, read over it, run through some tests and merge it in. |
Thanks for all your help and guidance throughout all of this! |
Unterminated strings were mentioned in #29, I decided to make them an error because it was slightly technically easier. Also I don't know of any riscv assembler which allows unterminated strings. The range warnings for .byte and .half were improved to not warn on values the fit in an unsigned byte/half and to say exactly how the value is truncated. Now it matches how gcc handles it. This was mentioned in
Unterminated strings were mentioned in #29, I decided to make them an error because it was slightly technically easier. Also I don't know of any riscv assembler which allows unterminated strings. The range warnings for .byte and .half were improved to not warn on values the fit in an unsigned byte/half and to say exactly how the value is truncated. Now it matches how gcc handles it. This was mentioned in #5
This was merged in with 3280cc7. I found a few spots where support for utf-8 is missing, and added TODOs for them (1d5cdc3). They are all low priority though so, I felt comfortable merging in without them done. If you want to continue improving unicode support, you could fix them, but its fine if you don't. The main change that I made to your code was using |
No description provided.