-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification that text-valued variables and attributes can be Unicode string
or UTF-8 char
arrays
#543
base: main
Are you sure you want to change the base?
Clarification that text-valued variables and attributes can be Unicode string
or UTF-8 char
arrays
#543
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,18 +12,24 @@ NetCDF files should have the file name extension "**`.nc`**". | |
|
||
// TODO: Check, should this be a bullet list? | ||
Data variables must be one of the following data types: **`string`**, **`char`**, **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, **`unsigned int64`**, **`float`** or **`real`**, and **`double`** (which are all the link:$$https://docs.unidata.ucar.edu/nug/current/md_types.html$$[netCDF external data types] supported by netCDF-4). | ||
The **`string`** type is only available in files using the netCDF version 4 (netCDF-4) format. | ||
The **`string`** type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format. | ||
The **`char`** and **`string`** types are not intended for numeric data. | ||
One byte numeric data should be stored using the **`byte`** or **`unsigned byte`** data types. | ||
It is possible to treat the **`byte`** and **`short`** types as unsigned by using the NUG convention of indicating the unsigned range using the **`valid_min`**, **`valid_max`**, or **`valid_range`** attributes. | ||
In many situations, any integer type may be used. | ||
When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**. | ||
|
||
Strings in variables may be represented one of two ways - as atomic strings or as character arrays. | ||
An n-dimensional array of strings may be implemented as a variable of type **`string`** with _n_ dimensions, or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. | ||
For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. | ||
The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled. | ||
If the atomic string option is chosen, each element of the variable can be assigned a string with a different length. | ||
A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array. | ||
Note that the ASCII one-byte character codes (hexadecimal `00`-`7F`) are a subset of UTF-8. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that the ASCII one-byte character codes (decimal 0-127, hexadecimal |
||
|
||
Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used. | ||
If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention. | ||
If the data-user has no information about the encoding, we suggest UTF-8 as a first guess. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm -- could we say that using anything other than UTF-8 while not specifying the _Encoding is an error? or is that just a free for all anyway :-( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Before version 1.12, CF did not require text in was there any requirement for strings? Did the NUG specify UTF-8 for strings from the start (of allowing strings...)? If so , maybe we can specifically require UTF-* for strings. NOTE: I'm pretty sure that some tools, e.g. netCDF4-python, does the |
||
|
||
An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The trick with UTF-8 is that it's a multi-byte encoding -- the number of bytes required may be more than the number of characters (code points) in the string. Should we mention that? or buyer beware if you are using non-ASCII code points? (NOTE: this is why I was hoping we could restrict char arrays to ASCII -- but that boat has sailed :-( ) -- could we still suggest that?) |
||
For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. | ||
The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. | ||
JonathanGregory marked this conversation as resolved.
Show resolved
Hide resolved
|
||
A **`string`** variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length. | ||
The CDL example below shows one variable of each type. | ||
|
||
[[char-and-string-variables-ex]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing as "Unicode text". I suggest:
A text string in a variable or an attribute may be stored as NFC normalized UTF-8 encoded Unicode data [bytes?] in either a variable-length
string
or a fixed-lengthchar
array.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- should that be "may", rather than "must" or "should" ?