Clarification that text-valued variables and attributes can be Unicode `string` or UTF-8 `char` arrays #543

JonathanGregory · 2024-09-17T14:54:41Z

See issue #141 for discussion of these changes.

Release checklist

[NA] Authors updated in cf-conventions.adoc? Add in two places: on line 3 and under .Additional Authors in About the authors.
[NA] Next version in cf-conventions.adoc up to date? Versioning inspired by SemVer.
[Y] history.adoc up to date?
[NA] Conformance document up to date?

…e vlen strings or UTF-8 char arrays

ChrisBarker-NOAA · 2024-09-18T03:57:06Z

This is a challenge! I did a bit more (unsatisfying" research into netcdf, string, and Unicode.

See my comment on #141, but I don't think this is ready to merge :-(

ChrisBarker-NOAA

the only substantial question I have is if we can strengthen "may" to "should" or "must".

(For the UTF-8 encoding part)

maybe split it up:

A text string in a variable or an attribute may be stored in either a variable-length string or a fixed-length char array.
Text (either char arrays or variable-length strings) should be as NFC normalized UTF-8 encoded Unicode.

ChrisBarker-NOAA · 2024-10-21T18:16:00Z

ch02.adoc

-For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
-The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
-If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
+A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.


There is no such thing as "Unicode text". I suggest:

A text string in a variable or an attribute may be stored as NFC normalized UTF-8 encoded Unicode data [bytes?] in either a variable-length string or a fixed-length char array.

Question -- should that be "may", rather than "must" or "should" ?

ChrisBarker-NOAA · 2024-10-21T18:18:09Z

ch02.adoc

-The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
-If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
+A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
+Note that the ASCII one-byte character codes (hexadecimal `00`-`7F`) are a subset of UTF-8.


Note that the ASCII one-byte character codes (decimal 0-127, hexadecimal 00-7F) are a subset of UTF-8.

ChrisBarker-NOAA · 2024-10-21T18:19:40Z

ch02.adoc

+
+Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
+If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
+If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.


hmm -- could we say that using anything other than UTF-8 while not specifying the _Encoding is an error? or is that just a free for all anyway :-(

"Before version 1.12, CF did not require text in char arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used."

was there any requirement for strings? Did the NUG specify UTF-8 for strings from the start (of allowing strings...)?

If so , maybe we can specifically require UTF-* for strings.

NOTE: I'm pretty sure that some tools, e.g. netCDF4-python, does the _Encoding thing for both char arrays and strings, though it does default to UTF-8, do maybe any distinction is moot.

ChrisBarker-NOAA · 2024-10-21T18:27:58Z

ch02.adoc

+If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
+If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.
+
+An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.


The trick with UTF-8 is that it's a multi-byte encoding -- the number of bytes required may be more than the number of characters (code points) in the string. Should we mention that? or buyer beware if you are using non-ASCII code points?

(NOTE: this is why I was hoping we could restrict char arrays to ASCII -- but that boat has sailed :-( ) -- could we still suggest that?)

ch02.adoc

history.adoc

ChrisBarker-NOAA · 2024-10-22T16:58:54Z

ch02.adoc

@@ -19,8 +19,9 @@ It is possible to treat the **`byte`** and **`short`** types as unsigned by usin
 In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.

-A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
-Note that the ASCII one-byte character codes (hexadecimal `00`-`7F`) are a subset of UTF-8.
+Text strings must be represented in Unicode. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].


There's still a language issue here -- "represented in Unicode" doesn't have a precise meaning for binary data. In netcdf (and hdf) both string and array of char store bytes. Those bytes must be UTF-8 encoded. Both of these require an encoding.

Netcdf and CF require the UTF-8 encoding.

Also "must be represented in Unicode" is correct, but might be confusing -- some folks think that Unicode means "not ASCII".

How about:

Text strings must be either ASCII or Unicode, encoded as UTF-8, in variable-length netCDF string or fixed-length char array. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized]. Note that the ASCII character one-byte characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal 00-7F).

The ASCII is redundant, but I think it might be helpful for us old timers that still don't quite "get" Unicode -- if someone is currently using ASCII, they can read this and know they are all good, and don't need to figure out what the heck UTF-8 means.

I'm not entirely sure about specifying "composite characters" in:

"Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized]"

as a rule, users don't need to think about what are or are not composite characters, they can apply the normalization to all strings, and that will be figured out for you.

But it is correct , just not sure how best to be clear, but not confusing.

Clarification that text-valued variables and attributes can be Unicod…

a904814

…e vlen strings or UTF-8 char arrays

JonathanGregory linked an issue Sep 17, 2024 that may be closed by this pull request

Add support for attributes of type string #141

Open

JonathanGregory added this to the 1.12 milestone Sep 17, 2024

ChrisBarker-NOAA mentioned this pull request Sep 18, 2024

Add support for attributes of type string #141

Open

JonathanGregory added 2 commits October 20, 2024 21:24

update

67af9f6

update

0f88787

ChrisBarker-NOAA requested changes Oct 21, 2024

View reviewed changes

update

181709b

JonathanGregory removed this from the 1.12 milestone Oct 22, 2024

JonathanGregory removed a link to an issue Oct 22, 2024

Add support for attributes of type string #141

Open

ChrisBarker-NOAA reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification that text-valued variables and attributes can be Unicode `string` or UTF-8 `char` arrays #543

Clarification that text-valued variables and attributes can be Unicode `string` or UTF-8 `char` arrays #543

JonathanGregory commented Sep 17, 2024

ChrisBarker-NOAA commented Sep 18, 2024

ChrisBarker-NOAA left a comment

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 21, 2024

ChrisBarker-NOAA Oct 22, 2024

Clarification that text-valued variables and attributes can be Unicode string or UTF-8 char arrays #543

Are you sure you want to change the base?

Clarification that text-valued variables and attributes can be Unicode string or UTF-8 char arrays #543

Conversation

JonathanGregory commented Sep 17, 2024

Release checklist

ChrisBarker-NOAA commented Sep 18, 2024

ChrisBarker-NOAA left a comment

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 21, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA Oct 22, 2024

Choose a reason for hiding this comment

Clarification that text-valued variables and attributes can be Unicode `string` or UTF-8 `char` arrays #543

Clarification that text-valued variables and attributes can be Unicode `string` or UTF-8 `char` arrays #543