Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check UTF-8 support #2024

Closed
kasemir opened this issue Oct 5, 2021 · 11 comments
Closed

Check UTF-8 support #2024

kasemir opened this issue Oct 5, 2021 · 11 comments

Comments

@kasemir
Copy link
Collaborator

kasemir commented Oct 5, 2021

EPICS database files are supposedly UTF8.
In the attached database file, the EGU are set to UTF-8 0xC2 0xB0 for the 'degree' symbol.

With CSS on Mac, it shows up as such:

Screen Shot 2021-10-05 at 11 15 52 AM

x.txt

Does that not work on Windows?
Is something missing to force UTF-8 interpretation of the units, so when Windows has some other default, the units are not properly represented?

@kasemir
Copy link
Collaborator Author

kasemir commented Oct 5, 2021

Hexdump of the file shows that the field(EGU, .. sets to c2 b0, the degree sign https://www.utf8icons.com/character/176/degree-sign

hexdump -C x.txt 
00000000  72 65 63 6f 72 64 28 61  69 2c 20 22 74 65 73 74  |record(ai, "test|
00000010  22 29 0a 7b 0a 20 20 20  66 69 65 6c 64 28 49 4e  |").{.   field(IN|
00000020  50 2c 20 22 30 78 46 46  22 29 0a 20 20 20 66 69  |P, "0xFF").   fi|
00000030  65 6c 64 28 50 49 4e 49  2c 20 22 59 45 53 22 29  |eld(PINI, "YES")|
00000040  0a 20 20 20 66 69 65 6c  64 28 45 47 55 2c 20 22  |.   field(EGU, "|
00000050  c2 b0 22 29 0a 7d 0a                              |..").}.|
00000057

@kasemir
Copy link
Collaborator Author

kasemir commented Nov 17, 2021

Here's that database executed by an IOC on Linux, where the 'EGU' are displayed correctly just as on Mac.

Screen Shot 2021-11-17 at 1 29 48 PM

Same with showing the units in CS-Studio, reading the PV via CA or PVA:

Screen Shot 2021-11-17 at 1 29 55 PM

@Sarat-Raj
Copy link

Sarat-Raj commented Dec 1, 2021

While working with Phoebus in windows and "°" symbol in EGU, I saw a different problem - attached in the images
image

image

@georgweiss
Copy link
Collaborator

georgweiss commented Dec 1, 2021

I can reproduce the issue on Windows. However, adding -Dfile.encoding=UTF8 on the command line will solve the issue in this particular example. Have not investigated potential side effects.
Capture

@kasemir
Copy link
Collaborator Author

kasemir commented Dec 1, 2021

adding -Dfile.encoding=UTF8 on the command line will solve the issue

Should we set that property somewhere early in the launcher, so there's no need to provide it on the command line?

@georgweiss
Copy link
Collaborator

Maybe. If we think there are no side effects.
In any case, adding System.setProperty("file.encoding", "UTF8") as first statement in main has the same effect on Windows, just verified it.

@kasemir
Copy link
Collaborator Author

kasemir commented Dec 1, 2021

It's unclear to me where the difference takes effect.
At first glance, "file.encoding" points to, well, reading or writing a file.
When we open the display files, the ModelReader selects "UTF-8",

final ByteArrayInputStream stream = new ByteArrayInputStream(xml.getBytes(XMLUtil.ENCODING));

public static final String ENCODING = "UTF-8";

So the case of opening a display file should already be handled.
But here we're receiving text from a PV, where we just fetch the String that we get from channel access:

Maybe that needs to be something like this:

new String(metadata.getUnits().getBytes(), "UTF-8"),

Problem is we don't get the raw bytes. We get the units already converted into a string, so how can you reliably go back to the original bytes? Maybe "file.encoding" is much broader than just reading from files and applies to any byte[]-to-String conversions, and thus the units string from the channel access client library is already in proper UTF-8 when we set "file.encoding".

@georgweiss
Copy link
Collaborator

Not sure I have the proper EPICS experience to comment at this point, but the metadata field referenced in DBHelper#L191 is jca magic, and I guess we'd like to avoid changing that.
As for file.encoding: I agree the naming suggests something quite broad, hence my hesitations. What if it means "file" in a broad sense, e.g. any type of data stream (file, socket...)?

@kasemir
Copy link
Collaborator Author

kasemir commented Dec 1, 2021

https://www.baeldung.com/java-char-encoding suggests that file.encoding is indeed quite broad, "the name of the default charset" used by String, input stream, .. Some newer API that doesn't use file.encoding like java.nio.file.Files defaults to UTF-8. So setting file.encoding to UTF-8 just brings everything into alignment with newer API, and since EPICS also uses UTF-8, my vote would be for setting file.encoding early in the launcher code.

@georgweiss
Copy link
Collaborator

Sounds reasonable...

@kasemir
Copy link
Collaborator Author

kasemir commented Mar 2, 2022

Turns out calling System.setProperty("file.encoding", "UTF8") as first statement in main does not always work.

When I run this on a Mac, the default charset remains "UTF-8" even though I try selecting UTF-16 via file.encoding:

// Try to compile and run outside an IDE to avoid anything else influencing the default charset:
//
// javac EncodingTest.java 
// java -cp . EncodingTest
import java.nio.charset.Charset;

public class EncodingTest
{
    public static void main(String[] args)
    {
        System.setProperty("file.encoding", "UTF-16");
        System.out.println("Default charset: " + Charset.defaultCharset());
        System.out.println("Requested file.encoding: " + System.getProperty("file.encoding"));
    }
}

Adding "-Dfile.encoding=..." to the java command line or JAVA_TOOL_OPTIONS does have an effect.
This matches descriptions on https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding :
".. file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding ... has been permanently cached."

Will create PR that adds -Dfile.encoding=... to the example start scripts and warns in Launcher if default charset differs from UTF-8.

@kasemir kasemir mentioned this issue Mar 2, 2022
@kasemir kasemir closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants