Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented UTF-8 support #29

Closed
wants to merge 9 commits into from

Conversation

TaylorZowtuk
Copy link
Contributor

No description provided.

@TaylorZowtuk
Copy link
Contributor Author

This is just an example of why I wanted unicode and what is possible with unicode support:
colorDemo

@TheThirdOne
Copy link
Owner

I have literally never used java in my life except for this so feel free to make any changes and consider this only as a proof of concept?

It seems like a very solid PR for someone who has never used Java before.

When building using ./build-jar.sh I get the following warnings:

Don't worry about the warnings. Those are caused by other parts of RARS. I should probably suppress them though.

results in these warnings in the terminal:

I believe those warnings would go away if you used .byte 0xFFFFFFE2, etc. This is related to #5

Also fixed copyright year ;)

Good catch.

@TheThirdOne
Copy link
Owner

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

Thanks for the contribution.

@TaylorZowtuk
Copy link
Contributor Author

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

I should also mention that much more testing would need to be done obviously. I really just wanted to send you a PR before the long weekend.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

I am interested in contributing more so it may be possible but it would have to be in my free time from now on so im also not entirely sure when I would be able to get around to it. Thank you very much for all your feedback. Made my day honestly.

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented May 28, 2019

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

Would you be able to provide me with a full list of the areas where you would like unicode support extended for you to feel comfortable with merging it all in? I would like to see if it seems within my ability to complete this task and if it seems possible in a reasonable time frame. Thanks!

@TheThirdOne
Copy link
Owner

I mentioned two other areas in #28 already; they both would probably take about as much effort as the change you have already made.

I don't know exactly where other areas might be. I would have to take a look through all of the directives and system calls to make sure there is no other area that string <-> byte conversions happen.

I think those three areas are the only places, and if you implement them all, I'd be happy to merge this and fix any additional areas if I find them while reviewing.

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk#1 (comment)

@TheThirdOne
Copy link
Owner

It looks like you are making progress. The write system call and bug-fixing look like all there is left.

@TaylorZowtuk
Copy link
Contributor Author

Unrelated to the unicode changes I noticed that when a string directive would end with a '\' the directive would save the ending quote from the directive itself. I assume this is unintended behavior so I did a quick check inside the case '"': which sets a flag that is used to avoid saving the " to memory.

https://github.com/TaylorZowtuk/rars/blob/e3a7c6624b8ae3d13c90b40367e01f5d5f6bb3f4/rars/assembler/Assembler.java#L1057-L1064

https://github.com/TaylorZowtuk/rars/blob/e3a7c6624b8ae3d13c90b40367e01f5d5f6bb3f4/rars/assembler/Assembler.java#L1099

Im unsure of whether this is the correct handling of this but I can change it however you'd like. I just thought I would raise your attention to this if it is unintended. With this most recent merge it handles a

.asciz
"\" 

by not storing the \ or the directive quote ("). Should it still be storing just the \ which may have been intended by the user or should it raise an error to console considering this is an improper usage of an escape?

I have fixed the file creation issue and handled the \u escape errors better (and hopefully completely).

I will begin working on the write syscall.

@TheThirdOne
Copy link
Owner

The behavior of "\" follows how strings are handled in general. Ultimately quotes don't need to be closed. For example, the following is valid.

.asciz "Text without closing parenthesis

So "\" is equivilant to "\"" which when escaped should end up as ".

I definitely think that the not needing to close quotes is intuitive, but it is inherited from MARS. I think it would be a good idea to add a warning if a string is not closed, but I would rather this not be handled in the escaping code.

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented Jun 3, 2019

Ah I wasn't familiar with that being how strings are handled in general. This presents a new potential problem though unless I am still misunderstanding. The current release of rars doenst save the last char if you dont end the string directive with a closing quote. For example if you:

.asciz
"test

then only tes will be saved to memory and if you print later obviously only tes will be printed. This is because the for loop in store strings assumes both a starting and ending quote i believe and starts the loop before the first quote and ends at the last quote.

for (int j = 1; j < quote.length() - 1; j++) {

EDIT: I suppose maybe you were aware of this and thats what you meant by wanting a warning raised.

@TheThirdOne
Copy link
Owner

I was not aware of that issue. I thought it was the same as if a quote was added on the end to terminate it.

In your opinion, should an open string be an error or warning? I would be ok breaking compatibility with MARS on this, but am on the fence about making it an error.

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented Jun 3, 2019

In your opinion, should an open string be an error or warning? I would be ok breaking compatibility with MARS on this, but am on the fence about making it an error.

Well, I think that if we want to accept unclosed quotes we can easily do this by doing a check right before the for loop of whether the last char in the string is a " or not. Based on that check we can initialize a size variable to be either quote.length - 1 (in the case that the string is ended in a quote) or quote.length (in the case that the string is not closed). This would fix the behavior of not saving the last char to memory in the unclosed case. It would also agree with your original answer of:

So "\" is equivilant to "\"" which when escaped should end up as ".

If we did that I feel like there would no need to raise either a warning or an error because as you said quotes don't need to be closed in general and the behavior, in either case, would then be as expected.

Though, would this be a RISC-V defined behavior or is this something that can be left up to the implementation of the assembler to decide?

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented Jun 3, 2019

Can you clarify for me what needs to be done for the write syscall. In #28 you say:

The write syscall looks like it is correct if the default encoding for Strings is UTF-8.

I notice that with my most recent commit I can write to a file unicode characters and have them appear as unicode within the file. So it appears that it does work as you say so what is left to change? To what exactly are you referring to when you say the default encoding? Do you mean of the file we are writing to or the system the program is run on? Is what is left to change simply forcing the file to have utf-8 encoding? Would something like using an OutputStreamWriter to force the characters to be written in utf-8 encoding be what you are asking for (similar to what is mentioned in: https://stackoverflow.com/a/1001562).

@TheThirdOne
Copy link
Owner

Well, I think that if we want to accept unclosed quotes ...

I know how to fix all of the problems from the code side without much issue. Changing how the tokenizer works can do all of that.

Though, would this be a RISC-V defined behavior or is this something that can be left up to the implementation of the assembler to decide?

AFAIK, no RISC-V assembler besides RARS does this. The reason RARS does it is because MARS does it and I didn't make the choice to change away from it.

If I were implementing from scratch I probably would not have made open strings possible.

I notice that with my most recent commit I can write to a file unicode characters and have them appear as unicode within the file. So it appears that it does work as you say so what is left to change?

I would expect unicode being written to files to work. OutputStreams just send bytes not characters so the utf-8 encoded strings passed in should be directly written to the file.

So it appears that it does work as you say so what is left to change? When you say the default encoding in:
if the default encoding for Strings is UTF-8.

I was referring to these lines:

String data = new String(myBuffer);
Globals.getGui().getMessagesPane().postRunMessage(data);

The GUI console in RARS is posted to by strings rather than bytes and so a conversion must be made. If new String(bytes) by default interprets the bytes as utf-8 then it should already work. I don't think that is true, but making the encoding of the string be utf-8 should be as simple as adding a parameter.

I definitely should have explained that a little in the original issue.

UTF8 string construction in write syscall and unsupported encoding handling
@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented Jun 3, 2019

I'm still not quite sure about how I handled the unsupported encoding exception in each of the files. I tried throwing the assertionError as you mentioned but dont feel comfortable enough with my java error handling abilities and in general how errors are handled in rars to fully implement it correctly. For now I simply followed what I saw from other files when rars would be forced to close (such as in syscallLoader). My understanding is that if i throw the aeesrtionError i would then need to modify wherever else calls these function and catch the thrown assertionError and handle it there? Would it be possible to leave this to you or would you be able to provide a little more direction on this matter?

How does this look and is there anything that needs to be improved?

@TheThirdOne
Copy link
Owner

My understanding is that if i throw the aeesrtionError i would then need to modify wherever else calls these function and catch the thrown assertionError and handle it there?

Java does allow undeclared exceptions so you can throw an AssertionError and it will propagate up the call stack until it is caught or leaves the root call. If uncaught it will terminate the program and show a stack trace. With threads this works a little differently, but you don't need to worry about it.

Would it be possible to leave this to you or would you be able to provide a little more direction on this matter?

How does this look and is there anything that needs to be improved?

I can handle the error handling. I think you've done everything I suggested so I'll take this, read over it, run through some tests and merge it in.

@TaylorZowtuk
Copy link
Contributor Author

Thanks for all your help and guidance throughout all of this!

TheThirdOne added a commit that referenced this pull request Jun 4, 2019
Unterminated strings were mentioned in #29, I decided to make them an
error because it was slightly technically easier. Also I don't know of
any riscv assembler which allows unterminated strings.

The range warnings for .byte and .half were improved to not warn on
values the fit in an unsigned byte/half and to say exactly how the value
is truncated. Now it matches how gcc handles it. This was mentioned in
TheThirdOne added a commit that referenced this pull request Jun 4, 2019
Unterminated strings were mentioned in #29, I decided to make them an
error because it was slightly technically easier. Also I don't know of
any riscv assembler which allows unterminated strings.

The range warnings for .byte and .half were improved to not warn on
values the fit in an unsigned byte/half and to say exactly how the value
is truncated. Now it matches how gcc handles it. This was mentioned in #5
@TheThirdOne
Copy link
Owner

This was merged in with 3280cc7.

I found a few spots where support for utf-8 is missing, and added TODOs for them (1d5cdc3). They are all low priority though so, I felt comfortable merging in without them done. If you want to continue improving unicode support, you could fix them, but its fine if you don't.

The main change that I made to your code was using java.nio.charset.StandardCharsets.UTF_8 as the charset for Strings which removes the need for try catches. If you want to look over what else I changed, take a look at 10e9c6f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants