cf-convention · JonathanGregory · Sep 17, 2024 · Oct 20, 2024 · Oct 20, 2024 · Oct 22, 2024
diff --git a/ch02.adoc b/ch02.adoc
@@ -12,18 +12,24 @@ NetCDF files should have the file name extension "**`.nc`**".
 
 // TODO: Check, should this be a bullet list?
 Data variables must be one of the following data types: **`string`**, **`char`**, **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, **`unsigned int64`**, **`float`** or **`real`**, and **`double`** (which are all the link:$$https://docs.unidata.ucar.edu/nug/current/md_types.html$$[netCDF external data types] supported by netCDF-4).
-The **`string`** type is only available in files using the netCDF version 4 (netCDF-4) format.
+The **`string`** type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.
 The **`char`** and **`string`** types are not intended for numeric data.
 One byte numeric data should be stored using the **`byte`** or **`unsigned byte`** data types.
 It is possible to treat the **`byte`** and **`short`** types as unsigned by using the NUG convention of indicating the unsigned range using the **`valid_min`**, **`valid_max`**, or **`valid_range`** attributes.
 In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
-Strings in variables may be represented one of two ways - as atomic strings or as character arrays.
-An n-dimensional array of strings may be implemented as a variable of type **`string`** with _n_ dimensions, or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
-For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
-The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
-If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
+A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
+Note that the ASCII one-byte character codes (hexadecimal `00`-`7F`) are a subset of UTF-8.
+
+Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
+If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
+If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.
+
+An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
+For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
+The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled.
+A **`string`** variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length.
 The CDL example below shows one variable of each type.
 
 [[char-and-string-variables-ex]]

diff --git a/history.adoc b/history.adoc
@@ -7,6 +7,7 @@
 
 === Working version (most recent first)
 
+* {issues}141[Issue #141]: Clarification that text-valued variables and attributes can be Unicode vlen strings or UTF-8 char arrays.
 * {issues}403[Issue #403]: Metadata to encode quantization properties
 * {issues}530{Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".
 * {issues}163[Issue #163]: Provide a convention for boundary variables for grids whose cells do not all have the same number of sides.