Implemented UTF-8 support #29

TaylorZowtuk · 2019-05-17T23:32:38Z

No description provided.

TaylorZowtuk · 2019-05-17T23:52:52Z

This is just an example of why I wanted unicode and what is possible with unicode support:

TheThirdOne · 2019-05-18T00:05:53Z

I have literally never used java in my life except for this so feel free to make any changes and consider this only as a proof of concept?

It seems like a very solid PR for someone who has never used Java before.

When building using ./build-jar.sh I get the following warnings:

Don't worry about the warnings. Those are caused by other parts of RARS. I should probably suppress them though.

results in these warnings in the terminal:

I believe those warnings would go away if you used .byte 0xFFFFFFE2, etc. This is related to #5

Also fixed copyright year ;)

Good catch.

TheThirdOne · 2019-05-18T00:10:37Z

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

Thanks for the contribution.

TaylorZowtuk · 2019-05-19T00:21:05Z

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

I should also mention that much more testing would need to be done obviously. I really just wanted to send you a PR before the long weekend.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

I am interested in contributing more so it may be possible but it would have to be in my free time from now on so im also not entirely sure when I would be able to get around to it. Thank you very much for all your feedback. Made my day honestly.

TaylorZowtuk · 2019-05-28T17:57:54Z

This looks good and I would only need to make a few touch ups before merging it in, but I would be uneasy about merging this without the other areas which could use unicode support also being implemented.

If you feel up to implementing the other areas (mentioned in #28), that would help me merge this in more quickly. Otherwise it will depend on whenever I can get around to implementing the other areas.

Would you be able to provide me with a full list of the areas where you would like unicode support extended for you to feel comfortable with merging it all in? I would like to see if it seems within my ability to complete this task and if it seems possible in a reasonable time frame. Thanks!

TheThirdOne · 2019-05-29T03:42:08Z

I mentioned two other areas in #28 already; they both would probably take about as much effort as the change you have already made.

I don't know exactly where other areas might be. I would have to take a look through all of the directives and system calls to make sure there is no other area that string <-> byte conversions happen.

I think those three areas are the only places, and if you implement them all, I'd be happy to merge this and fix any additional areas if I find them while reviewing.

Unicode support for string directives

TaylorZowtuk · 2019-05-30T22:41:55Z

TaylorZowtuk#1 (comment)

TheThirdOne · 2019-05-31T05:40:15Z

It looks like you are making progress. The write system call and bug-fixing look like all there is left.

Bug fixes and error handling

TaylorZowtuk · 2019-05-31T19:57:12Z

Unrelated to the unicode changes I noticed that when a string directive would end with a '\' the directive would save the ending quote from the directive itself. I assume this is unintended behavior so I did a quick check inside the case '"': which sets a flag that is used to avoid saving the " to memory.

https://github.com/TaylorZowtuk/rars/blob/e3a7c6624b8ae3d13c90b40367e01f5d5f6bb3f4/rars/assembler/Assembler.java#L1057-L1064

https://github.com/TaylorZowtuk/rars/blob/e3a7c6624b8ae3d13c90b40367e01f5d5f6bb3f4/rars/assembler/Assembler.java#L1099

Im unsure of whether this is the correct handling of this but I can change it however you'd like. I just thought I would raise your attention to this if it is unintended. With this most recent merge it handles a

.asciz
"\"

by not storing the \ or the directive quote ("). Should it still be storing just the \ which may have been intended by the user or should it raise an error to console considering this is an improper usage of an escape?

I have fixed the file creation issue and handled the \u escape errors better (and hopefully completely).

I will begin working on the write syscall.

TheThirdOne · 2019-06-01T01:08:51Z

The behavior of "\" follows how strings are handled in general. Ultimately quotes don't need to be closed. For example, the following is valid.

.asciz "Text without closing parenthesis

So "\" is equivilant to "\"" which when escaped should end up as ".

I definitely think that the not needing to close quotes is intuitive, but it is inherited from MARS. I think it would be a good idea to add a warning if a string is not closed, but I would rather this not be handled in the escaping code.

TaylorZowtuk · 2019-06-03T01:34:28Z

Ah I wasn't familiar with that being how strings are handled in general. This presents a new potential problem though unless I am still misunderstanding. The current release of rars doenst save the last char if you dont end the string directive with a closing quote. For example if you:

.asciz
"test

then only tes will be saved to memory and if you print later obviously only tes will be printed. This is because the for loop in store strings assumes both a starting and ending quote i believe and starts the loop before the first quote and ends at the last quote.

rars/rars/assembler/Assembler.java

Line 1033 in 7148112

for (int j = 1; j < quote.length() - 1; j++) {

EDIT: I suppose maybe you were aware of this and thats what you meant by wanting a warning raised.

TheThirdOne · 2019-06-03T02:23:20Z

I was not aware of that issue. I thought it was the same as if a quote was added on the end to terminate it.

In your opinion, should an open string be an error or warning? I would be ok breaking compatibility with MARS on this, but am on the fence about making it an error.

TaylorZowtuk · 2019-06-03T15:56:09Z

In your opinion, should an open string be an error or warning? I would be ok breaking compatibility with MARS on this, but am on the fence about making it an error.

Well, I think that if we want to accept unclosed quotes we can easily do this by doing a check right before the for loop of whether the last char in the string is a " or not. Based on that check we can initialize a size variable to be either quote.length - 1 (in the case that the string is ended in a quote) or quote.length (in the case that the string is not closed). This would fix the behavior of not saving the last char to memory in the unclosed case. It would also agree with your original answer of:

So "\" is equivilant to "\"" which when escaped should end up as ".

If we did that I feel like there would no need to raise either a warning or an error because as you said quotes don't need to be closed in general and the behavior, in either case, would then be as expected.

Though, would this be a RISC-V defined behavior or is this something that can be left up to the implementation of the assembler to decide?

TaylorZowtuk · 2019-06-03T16:30:00Z

Can you clarify for me what needs to be done for the write syscall. In #28 you say:

The write syscall looks like it is correct if the default encoding for Strings is UTF-8.

I notice that with my most recent commit I can write to a file unicode characters and have them appear as unicode within the file. So it appears that it does work as you say so what is left to change? To what exactly are you referring to when you say the default encoding? Do you mean of the file we are writing to or the system the program is run on? Is what is left to change simply forcing the file to have utf-8 encoding? Would something like using an OutputStreamWriter to force the characters to be written in utf-8 encoding be what you are asking for (similar to what is mentioned in: https://stackoverflow.com/a/1001562).

TheThirdOne · 2019-06-03T18:47:15Z

Well, I think that if we want to accept unclosed quotes ...

I know how to fix all of the problems from the code side without much issue. Changing how the tokenizer works can do all of that.

Though, would this be a RISC-V defined behavior or is this something that can be left up to the implementation of the assembler to decide?

AFAIK, no RISC-V assembler besides RARS does this. The reason RARS does it is because MARS does it and I didn't make the choice to change away from it.

If I were implementing from scratch I probably would not have made open strings possible.

I notice that with my most recent commit I can write to a file unicode characters and have them appear as unicode within the file. So it appears that it does work as you say so what is left to change?

I would expect unicode being written to files to work. OutputStreams just send bytes not characters so the utf-8 encoded strings passed in should be directly written to the file.

So it appears that it does work as you say so what is left to change? When you say the default encoding in:
if the default encoding for Strings is UTF-8.

I was referring to these lines:

rars/rars/util/SystemIO.java

Lines 265 to 266 in 522ebbd

    
           String data = new String(myBuffer); 
        
           Globals.getGui().getMessagesPane().postRunMessage(data);

The GUI console in RARS is posted to by strings rather than bytes and so a conversion must be made. If new String(bytes) by default interprets the bytes as utf-8 then it should already work. I don't think that is true, but making the encoding of the string be utf-8 should be as simple as adding a parameter.

I definitely should have explained that a little in the original issue.

UTF8 string construction in write syscall and unsupported encoding handling

TaylorZowtuk · 2019-06-03T21:43:55Z

I'm still not quite sure about how I handled the unsupported encoding exception in each of the files. I tried throwing the assertionError as you mentioned but dont feel comfortable enough with my java error handling abilities and in general how errors are handled in rars to fully implement it correctly. For now I simply followed what I saw from other files when rars would be forced to close (such as in syscallLoader). My understanding is that if i throw the aeesrtionError i would then need to modify wherever else calls these function and catch the thrown assertionError and handle it there? Would it be possible to leave this to you or would you be able to provide a little more direction on this matter?

How does this look and is there anything that needs to be improved?

TheThirdOne · 2019-06-04T00:42:43Z

My understanding is that if i throw the aeesrtionError i would then need to modify wherever else calls these function and catch the thrown assertionError and handle it there?

Java does allow undeclared exceptions so you can throw an AssertionError and it will propagate up the call stack until it is caught or leaves the root call. If uncaught it will terminate the program and show a stack trace. With threads this works a little differently, but you don't need to worry about it.

Would it be possible to leave this to you or would you be able to provide a little more direction on this matter?

How does this look and is there anything that needs to be improved?

I can handle the error handling. I think you've done everything I suggested so I'll take this, read over it, run through some tests and merge it in.

TaylorZowtuk · 2019-06-04T15:23:49Z

Thanks for all your help and guidance throughout all of this!

Unterminated strings were mentioned in #29, I decided to make them an error because it was slightly technically easier. Also I don't know of any riscv assembler which allows unterminated strings. The range warnings for .byte and .half were improved to not warn on values the fit in an unsigned byte/half and to say exactly how the value is truncated. Now it matches how gcc handles it. This was mentioned in

Unterminated strings were mentioned in #29, I decided to make them an error because it was slightly technically easier. Also I don't know of any riscv assembler which allows unterminated strings. The range warnings for .byte and .half were improved to not warn on values the fit in an unsigned byte/half and to say exactly how the value is truncated. Now it matches how gcc handles it. This was mentioned in #5

TheThirdOne · 2019-06-04T20:46:21Z

This was merged in with 3280cc7.

I found a few spots where support for utf-8 is missing, and added TODOs for them (1d5cdc3). They are all low priority though so, I felt comfortable merging in without them done. If you want to continue improving unicode support, you could fix them, but its fine if you don't.

The main change that I made to your code was using java.nio.charset.StandardCharsets.UTF_8 as the charset for Strings which removes the need for try catches. If you want to look over what else I changed, take a look at 10e9c6f.

implemented UTF-8 support

d057f77

TaylorZowtuk added 3 commits May 29, 2019 16:40

Implemented utf-8 for string loading in directive

33c9bd2

Added unicode escape parsing

6e0a24f

Merge pull request #1 from TaylorZowtuk/unicode

de9205a

Unicode support for string directives

TaylorZowtuk added 2 commits May 31, 2019 13:31

Bug fixes and error handling

579cad4

Merge pull request #2 from TaylorZowtuk/unicode

e3a7c66

Bug fixes and error handling

Reverted " flag in storeStrings

1b1d962

TaylorZowtuk added 2 commits June 3, 2019 15:13

Handled unsupported encoding

4a75c1c

Merge pull request #3 from TaylorZowtuk/unicode

d7ad1ca

UTF8 string construction in write syscall and unsupported encoding handling

TheThirdOne closed this Jun 4, 2019

TheThirdOne mentioned this pull request Jun 4, 2019

Printing unicode block elements using syscall #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented UTF-8 support #29

Implemented UTF-8 support #29

TaylorZowtuk commented May 17, 2019

TaylorZowtuk commented May 17, 2019

TheThirdOne commented May 18, 2019

TheThirdOne commented May 18, 2019

TaylorZowtuk commented May 19, 2019

TaylorZowtuk commented May 28, 2019 •

edited

Loading

TheThirdOne commented May 29, 2019

TaylorZowtuk commented May 30, 2019

TheThirdOne commented May 31, 2019

TaylorZowtuk commented May 31, 2019

TheThirdOne commented Jun 1, 2019

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TheThirdOne commented Jun 3, 2019

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TheThirdOne commented Jun 3, 2019

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TheThirdOne commented Jun 4, 2019

TaylorZowtuk commented Jun 4, 2019

TheThirdOne commented Jun 4, 2019

Implemented UTF-8 support #29

Implemented UTF-8 support #29

Conversation

TaylorZowtuk commented May 17, 2019

TaylorZowtuk commented May 17, 2019

TheThirdOne commented May 18, 2019

TheThirdOne commented May 18, 2019

TaylorZowtuk commented May 19, 2019

TaylorZowtuk commented May 28, 2019 • edited Loading

TheThirdOne commented May 29, 2019

TaylorZowtuk commented May 30, 2019

TheThirdOne commented May 31, 2019

TaylorZowtuk commented May 31, 2019

TheThirdOne commented Jun 1, 2019

TaylorZowtuk commented Jun 3, 2019 • edited Loading

TheThirdOne commented Jun 3, 2019

TaylorZowtuk commented Jun 3, 2019 • edited Loading

TaylorZowtuk commented Jun 3, 2019 • edited Loading

TheThirdOne commented Jun 3, 2019

TaylorZowtuk commented Jun 3, 2019 • edited Loading

TheThirdOne commented Jun 4, 2019

TaylorZowtuk commented Jun 4, 2019

TheThirdOne commented Jun 4, 2019

TaylorZowtuk commented May 28, 2019 •

edited

Loading

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading

TaylorZowtuk commented Jun 3, 2019 •

edited

Loading