layout | toc_group | link_title | permalink |
---|---|---|---|
docs |
truffle |
Truffle Strings Guide |
/graalvm-as-a-platform/language-implementation-framework/TruffleStrings/ |
Truffle Strings is Truffle's primitive String type, which can be shared between languages. Language implementers are encouraged to use Truffle Strings as their language's string type for easier interoperability and better performance.
TruffleString
supports a plethora of string encodings, but is especially optimized for the most commonly used:
UTF-8
UTF-16
UTF-32
US-ASCII
ISO-8859-1
BYTES
All operations exposed by TruffleString
are provided as an inner Node
, and as static or instance methods.
Users should use the provided nodes where possible, as the static/instance methods are just shorthands for executing their respective node's uncached version.
All nodes are named {NameOfOperation}Node
, and all convenience methods are named {nameOfOperation}Uncached
.
Some operations support lazy evaluation, such as lazy concatenation or lazy evaluation of certain string properties.
Most of these operations provide a parameter boolean lazy
, which allows the user to enable or disable lazy evaluation on a per-callsite basis.
Operations dealing with index values, such as CodePointAtIndex
, are available in two variants: codepoint-based indexing and byte-based indexing.
Byte-based indexing is indicated by the ByteIndex
-suffix or prefix in an operation's name, otherwise indices are based on codepoints.
For example, the index parameter ofCodePointAtIndex
is codepoint-based, whereas CodePointAtByteIndex
uses a byte-based index.
The list of currently available operations is listed below and grouped by category.
Creating a new TruffleString
:
- FromCodePoint: Create a new TruffleString from a given codepoint.
- FromLong: Create a new TruffleString from a given long value.
- FromByteArray: Create a new TruffleString from a given byte array.
- FromCharArrayUTF16: Create a UTF-16 TruffleString from a given char array.
- FromIntArrayUTF32: Create a UTF-32 TruffleString from a given int array.
- FromJavaString:
Create a TruffleString from a given
java.lang.String
. - FromNativePointer: Create a new TruffleString from a given native pointer.
- Encoding#getEmpty: Get an empty TruffleString in that encoding.
Query string properties:
- isEmpty: Check if a string is empty.
- CodePointLength: Get a string's length in codepoints.
- byteLength: Get a string's length in bytes.
- IsValid: Check whether a string is encoded correctly.
- GetCodeRange: Get coarse information about the string's content (are all codepoints in this string from the ASCII/LATIN-1/BMP range?).
- GetByteCodeRange: Get coarse information about the string's content, without taking 16/32-bit based encodings into account.
- CodeRangeEquals: Check whether a string's code range equals the given code range.
- isCompatibleTo: Check if a string is compatible to / can be viewed in a given encoding.
- isManaged: Check if a string is not backed by a native pointer.
- isNative: Check if a string is backed by a native pointer.
- isImmutable:
Check if a string is an instance of
TruffleString
. - isMutable:
Check if a string is an instance of
MutableTruffleString
.
Comparison:
- Equal: Check if two strings are equal. Note that this operation is encoding-sensitive!
- RegionEqual: Check if two strings are equal in a given region defined by a codepoint-based offset and length.
- RegionEqualByteIndex: Check if two strings are equal in a given region defined by a byte-based offset and length.
- CompareBytes: Compare two strings byte-by-byte.
- CompareCharsUTF16: Compare two UTF-16 strings char-by-char.
- CompareIntsUTF32: Compare two UTF-32 strings int-by-int.
- HashCode: Get a string's hash code. The hash code is based on the string's bytes, so strings with the same codepoints but different encodings may have different hash codes.
Conversion:
- SwitchEncoding: Convert a string to a given encoding.
- ForceEncoding: Create a string containing the same bytes as the given string, but assigned to the given encoding.
- AsTruffleString: Convert a MutableTruffleString to an immutable TruffleString.
- AsManaged: Convert a TruffleString backed by a native pointer to one backed by a java byte array.
- CopyToByteArray: Copy a string's content into a byte array.
- GetInternalByteArray: Get a string's internal byte array.
- CopyToNativeMemory: Copy a string's content into a native pointer.
- GetInternalNativePointer: Get a native string's pointer object.
- ToJavaString:
Convert a string to a
java.lang.String
. - ParseInt: Parse a string's content as an int value.
- ParseLong: Parse a string's content as a long value.
- ParseDouble: Parse a string's content as a double value.
Accessing codepoints and bytes:
- Materialize: Use this node to avoid materialization code inside loops iterating over a string's code points or bytes.
- ReadByte: Read a single byte from a string.
- ReadCharUTF16: Read a single char from a UTF-16 string.
- CodePointAtIndex: Read a single codepoint from a string at a given codepoint-based index.
- CodePointAtByteIndex: Read a single codepoint from a string at a given byte-based index.
- CreateCodePointIterator:
Return a
TruffleStringIterator
object suitable for iterating the string's codepoints. - CreateBackwardCodePointIterator:
Return a
TruffleStringIterator
object suitable for iterating the string's codepoints, starting from the end of the string. - ByteLengthOfCodePoint: Return the number of bytes occupied by the codepoint starting at a given byte index.
- CodePointIndexToByteIndex: Convert a given codepoint index to a byte index on a given string.
- ByteIndexToCodePointIndex: Convert a given byte index to a codepoint index on a given string.
Search:
- ByteIndexOfAnyByte: Find the first occurrence of any of a set of given bytes in a string and return its byte-based index.
- CharIndexOfAnyCharUTF16: Find the first occurrence of any of a set of given chars in a UTF-16 string and return its char-based index.
- IntIndexOfAnyIntUTF32: Find the first occurrence of any of a set of given ints in a UTF-32 string and return its int-based index.
- IndexOfCodePoint: Find the first occurrence of a given codepoint in a string and return its codepoint-based index.
- ByteIndexOfCodePoint: Find the first occurrence of a given codepoint in a string and return its byte-based index.
- LastIndexOfCodePoint: Find the last occurrence of a given codepoint in a string and return its codepoint-based index.
- LastByteIndexOfCodePoint: Find the last occurrence of a given codepoint in a string and return its byte-based index.
- IndexOfString: Find the first occurrence of a given substring in a string and return its codepoint-based index.
- ByteIndexOfString: Find the first occurrence of a given substring in a string and return its byte-based index.
- LastIndexOfString: Find the last occurrence of a given substring in a string and return its codepoint-based index.
- LastByteIndexOfString: Find the last occurrence of a given substring in a string and return its byte-based index.
Combining:
- Concat: Concatenate two strings.
- Substring: Create a substring from a given string, bounded by a codepoint-based offset and length.
- SubstringByteIndex: Create a substring from a given string, bounded by a byte-based offset and length.
- Repeat: Repeat a given string n times.
A TruffleString
can be created from a codepoint, a number, a primitive array or a java.lang.String
.
Strings of any encoding can be created with TruffleString.FromByteArrayNode
, which expects a byte array containing the already encoded string.
This operation can be non-copying, by setting the copy
parameter to false
.
Important: TruffleStrings
will assume the array content to be immutable, do not modify the array after passing it to the non-copying variant of this operation.
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
abstract static class SomeNode extends Node {
@Specialization
static TruffleString someSpecialization(
@Cached TruffleString.FromByteArrayNode fromByteArrayNode) {
byte[] array = {'a', 'b', 'c'};
return fromByteArrayNode.execute(array, 0, array.length, TruffleString.Encoding.UTF_8, false);
}
}
For easier creation of UTF-16 and UTF-32 strings independent of the system's endianness, TruffleString
provides TruffleString.FromCharArrayUTF16Node
and TruffleString.FromIntArrayUTF32Node
.
TruffleString
may also be created via TruffleStringBuilder
, which is TruffleString
's equivalent to java.lang.StringBuilder
.
TruffleStringBuilder
provides the following operations:
- AppendByte: Append a single byte to a string builder.
- AppendCharUTF16: Append a single char to a UTF-16 string builder.
- AppendCodePoint: Append a single codepoint to string builder.
- AppendIntNumber: Append an integer number to a string builder.
- AppendLongNumber: Append a long number to a string builder.
- AppendString: Append a TruffleString to a string builder.
- AppendSubstringByteIndex: Append a substring, defined by a byte-based offset and length, to a string builder.
- AppendJavaStringUTF16: Append a Java String substring, defined by a char-based offset and length, to a string builder.
- ToString: Create a new TruffleString from a string builder.
See the below example:
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;
abstract static class SomeNode extends Node {
@Specialization
static TruffleString someSpecialization(
@Cached TruffleStringBuilder.AppendCharUTF16Node appendCharNode,
@Cached TruffleStringBuilder.AppendJavaStringUTF16Node appendJavaStringNode,
@Cached TruffleStringBuilder.AppendIntNumberNode appendIntNumberNode,
@Cached TruffleStringBuilder.AppendStringNode appendStringNode,
@Cached TruffleString.FromCharArrayUTF16Node fromCharArrayUTF16Node,
@Cached TruffleStringBuilder.AppendCodePointNode appendCodePointNode,
@Cached TruffleStringBuilder.ToStringNode toStringNode) {
TruffleStringBuilder sb = TruffleStringBuilder.create(TruffleString.Encoding.UTF_16);
sb = appendCharNode.execute(sb, 'a');
sb = appendJavaStringNode.execute(sb, "abc", /* fromIndex: */ 1, /* length: */ 2);
sb = appendIntNumberNode.execute(sb, 123);
TruffleString string = fromCharArrayUTF16Node.execute(new char[]{'x', 'y'}, /* fromIndex: */ 0, /* length: */ 2);
sb = appendStringNode.execute(sb, string);
sb = appendCodePointNode.execute(sb, 'z');
return toStringNode.execute(sb); // string content: "abc123xyz"
}
}
Every TruffleString
is encoded in a specific internal encoding, which is set during instantiation.
TruffleString
is fully optimized for the following encodings:
UTF-8
UTF-16
UTF-32
US-ASCII
ISO-8859-1
BYTES
Many other encodings are supported, but not fully optimized.
To use them, they must be enabled by setting needsAllEncodings = true
in the Truffle language registration.
A TruffleString
's internal encoding is not exposed.
Instead of querying a string's encoding, languages should pass an expectedEncoding
parameter to all methods where the string's encoding matters (which is almost all operations).
This allows re-using string objects when converting between encodings, if a string is byte-equivalent in both encodings.
A string can be converted to a different encoding using SwitchEncodingNode
, as shown in the following example:
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;
abstract static class SomeNode extends Node {
@Specialization
static void someSpecialization(
@Cached TruffleString.FromJavaStringNode fromJavaStringNode,
@Cached TruffleString.ReadByteNode readByteNode,
@Cached TruffleString.SwitchEncodingNode switchEncodingNode,
@Cached TruffleString.ReadByteNode utf8ReadByteNode) {
// instantiate a new UTF-16 string
TruffleString utf16String = fromJavaStringNode.execute("foo", TruffleString.Encoding.UTF_16);
// read a byte with expectedEncoding = UTF-16.
// if the string is not byte-compatible with UTF-16, this method will throw an IllegalArgumentException
System.out.printf("%x%n", readByteNode.execute(utf16String, /* byteIndex */ 0, TruffleString.Encoding.UTF_16));
// convert to UTF-8.
// note that utf8String may be reference-equal to utf16String!
TruffleString utf8String = switchEncodingNode.execute(utf16String, TruffleString.Encoding.UTF_8);
// read a byte with expectedEncoding = UTF-8
// if the string is not byte-compatible with UTF-8, this method will throw an IllegalArgumentException
System.out.printf("%x%n", utf8ReadByteNode.execute(utf8String, /* byteIndex */ 0, TruffleString.Encoding.UTF_8));
}
}
Byte-equivalency between encodings is determined with string compaction on UTF-16 and UTF-32, so e.g. a compacted UTF-16 String is byte-equivalent to ISO-8859-1, and if all of its characters are in the ASCII range (see CodeRange
), it is also byte-equivalent to UTF-8.
To check if your code is switching encodings properly, run your unit tests with the system property truffle.strings.debug-strict-encoding-checks=true
.
This disables re-using string objects when switching encodings, and makes encoding checks more strict: all operations working on a single string will enforce an exact match,
whereas operations working on two strings will still allow byte-equivalent re-interpretations.
All TruffleString
operations with more than one string parameter require the strings to be in an encoding compatible with the result encoding.
So either the strings need to be in the same encoding, or the caller must ensure that both Strings are compatible with the resulting encoding.
This enable callers which already know the SwitchEncodingNodes
would be noops to just skip them for footprint reasons.
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;
abstract static class SomeNode extends Node {
@Specialization
static boolean someSpecialization(
TruffleString a,
TruffleString b,
@Cached TruffleString.SwitchEncodingNode switchEncodingNodeA,
@Cached TruffleString.SwitchEncodingNode switchEncodingNodeB,
@Cached TruffleString.EqualNode equalNode) {
TruffleString utf8A = switchEncodingNodeA.execute(a, TruffleString.Encoding.UTF_8);
TruffleString utf8B = switchEncodingNodeB.execute(b, TruffleString.Encoding.UTF_8);
return equalNode.execute(utf8A, utf8B, TruffleString.Encoding.UTF_8);
}
}
TruffleString
exposes the following properties:
byteLength
: The string's length in bytes, exposed via thebyteLength
method.codePointLength
: The string's length in codepoints, exposed viaCodePointLengthNode
.isValid
: Can be queried viaIsValidNode
to check whether the string is encoded correctly.codeRange
: Provides coarse information about the string's content, exposed viaGetCodeRangeNode
. This property can have the following values:ASCII
: All codepoints in this string are part of the Basic Latin Unicode block, also known as ASCII (0x00 - 0x7f).LATIN-1
: All codepoints in this string are part of the ISO-8859-1 character set (0x00 - 0xff), which is equivalent to the union of the Basic Latin and the Latin-1 Supplement Unicode block. At least one codepoint in the string is greater than 0x7f. Only applicable to ISO-8859-1, UTF-16 and UTF-32.BMP
: All codepoints in this string are part of the Unicode Basic Multilingual Plane (BMP) (0x0000 - 0xffff). At least one codepoint in the string is greater than 0xff. Only applicable to UTF-16 and UTF-32.VALID
: This string is encoded correctly, and contains at least one codepoint outside the other applicable code ranges (e.g. for UTF-8, this means there is one codepoint outside the ASCII range, and for UTF-16 this means that there is one codepoint outside the BMP range).BROKEN
: This string is not encoded correctly. No further information about its contents can be determined.
hashCode
: The string's hash code, exposed viaHashCodeNode
. The hash code is dependent on the string's encoding; strings must always be converted to a common encoding before comparing their hash codes!
See the below example how to query all properties exposed by TruffleString
:
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
abstract static class SomeNode extends Node {
@Specialization
static TruffleString someSpecialization(
TruffleString string,
@Cached TruffleString.CodePointLengthNode codePointLengthNode,
@Cached TruffleString.IsValidNode isValidNode,
@Cached TruffleString.GetCodeRangeNode getCodeRangeNode,
@Cached TruffleString.HashCodeNode hashCodeNode) {
System.out.println("byte length: " + string.byteLength(TruffleString.Encoding.UTF_8));
System.out.println("codepoint length: " + codePointLengthNode.execute(string, TruffleString.Encoding.UTF_8));
System.out.println("is valid: " + isValidNode.execute(string));
System.out.println("code range: " + getCodeRangeNode.execute(string));
System.out.println("hash code: " + hashCodeNode.execute(string, TruffleString.Encoding.UTF_8));
}
}
TruffleString
objects should be checked for equality using EqualNode
.
Just like HashCodeNode
, the equality comparison is sensitive to the string's encoding, so before any comparison, strings should always be converted to a common encoding. Object#equals(Object)
behaves analogous to EqualNode
, but since this method does not have an expectedEncoding
parameter, it will determine the string's common encoding automatically.
If the string's encodings are not equal, TruffleString
will check whether one string is binary-compatible to the other string's encoding, and if so, match their content. Otherwise, the strings are deemed not equal, no automatic conversion is applied.
Note that since TruffleString
's hashCode
and equals
methods are sensitive to string encoding, TruffleString
objects must always be converted to a common encoding before, e.g., using them as keys in a HashMap
.
TruffleString
also provides three comparison nodes CompareBytesNode
, CompareCharsUTF16Node
, and CompareIntsUTF32Node
, to compare strings respectively byte-by-byte, char-by-char, and int-by-int.
Concatenation is done via ConcatNode
.
This operation requires both strings to be in expectedEncoding
, which is also the encoding of the resulting string. Lazy concatenation is supported via the lazy
parameter.
When two strings are concatenated lazily, the allocation and initialization of the new string's internal array is delayed until another operation requires direct access to that array.
Materialization of such "lazy concatenation strings" can be triggered explicitly with a MaterializeNode
.
This is useful to do before accessing a string in a loop, such as in the following example:
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
abstract static class SomeNode extends Node {
@Specialization
static TruffleString someSpecialization(
TruffleString utf8StringA,
TruffleString utf8StringB,
@Cached TruffleString.ConcatNode concatNode,
@Cached TruffleString.MaterializeNode materializeNode,
@Cached TruffleString.ReadByteNode readByteNode) {
// lazy concatenation
TruffleString lazyConcatenated = concatNode.execute(utf8StringA, utf8StringB, TruffleString.Encoding.UTF_8, /* lazy */ true);
// explicit materialization
TruffleString materialized = materializeNode.execute(lazyConcatenated, TruffleString.Encoding.UTF_8);
int byteLength = materialized.byteLength(TruffleString.Encoding.UTF_8);
for (int i = 0; i < byteLength; i++) {
// string is guaranteed to be materialized here, so no slow materialization code can end up in this loop
System.out.printf("%x%n", readByteNode.execute(materialized, i, TruffleString.Encoding.UTF_8));
}
}
}
Substrings can be created via SubstringNode
and SubstringByteIndexNode
, which use codepoint-based and byte-based indices, respectively.
Substrings can also be lazy
, meaning that no new array is created for the resulting string, but instead the parent string's array is re-used and just accessed with the offset and length passed to the substring node.
Currently, a lazy substring's internal array is never trimmed (i.e. replaced by a new array of the string's exact length).
Note that this behavior effectively creates a memory leak whenever a lazy substring is created.
An extreme example where this could be problematic: given a string that is 100 megabyte in size, any lazy substring created from this string will keep the 100 megabyte array alive, even when the original string is freed by the garbage collector.
Use lazy substrings with caution.
TruffleString provides FromJavaStringNode
for converting a java.lang.String
to TruffleString
.
To convert from TruffleString
to java.lang.String
, use a ToJavaStringNode
.
This node will internally convert the string to UTF-16, if necessary, and create a java.lang.String
from that representation.
Object#toString()
is implemented using the uncached version of ToJavaStringNode
and should be avoided on fast paths.
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
abstract static class SomeNode extends Node {
@Specialization
static void someSpecialization(
@Cached TruffleString.FromJavaStringNode fromJavaStringNode,
@Cached TruffleString.SwitchEncodingNode switchEncodingNode,
@Cached TruffleString.ToJavaStringNode toJavaStringNode,
@Cached TruffleString.ReadByteNode readByteNode) {
TruffleString utf16String = fromJavaStringNode.execute("foo", TruffleString.Encoding.UTF_16);
TruffleString utf8String = switchEncodingNode.execute(utf16String, TruffleString.Encoding.UTF_8);
System.out.println(toJavaStringNode.execute(utf8String));
}
}
TruffleString
also exposes #toStringDebug()
for debugging purposes.
Do not use this method for anything other than debugging, as its return value is unspecified and may change at any time.
The following items should be considered when switching from java.lang.String
to TruffleString
:
- The static overhead of
TruffleString
instances is larger than that ofjava.lang.String
objects. ATruffleString
object contains 2 pointers fields, 4int
fields and 4byte
fields, which will usually result in a total object size of 40 bytes (object header of 12 bytes, 4 bytes per pointer with compressed oops, 8-byte memory alignment). Ajava.lang.String
object contains one pointer field, oneint
field and onebyte
field, which in the same conditions results in a total object size of 24 bytes. This difference in memory footprint may negatively impact some cases where lots of small strings are generated. TruffleString
does string compaction just likejava.lang.String
.- If your language needs to convert strings to other encodings, e.g. UTF-8, which is very common in web applications,
TruffleString
can turn this operation into a no-op if the string does now contain special characters. For example, ASCII-only strings can be re-interpreted as almost any encoding, and converting an ASCII-only UTF-16 string to UTF-8 is a no-op. In cases where transcoding a string is unavoidable,TruffleString
s will cache the transcoded string in the original string, so transcoding is only done once per string and encoding. - In order to use 3rd party libraries,
TruffleString
object will have to be converted tojava.lang.String
and back. In order to make this as cheap as possible,TruffleString
re-uses Java String's internal byte arrays when converting fromjava.lang.String
toTruffleString
, and caches Java Strings created fromTruffleString
objects in the object itself. TruffleString
offers additional features not present injava.lang.String
:- Lazy concatenation and string views, which can significantly decrease the amount of array-copy operations your language may have to do.
String
views into native memory, completely avoiding the need to copy native memory into Java arrays before using it.String
content classification via thecodeRange
property, which allows specializations on strings that are ASCII-only et cetera. This can reduce the complexity of some string operations significantly.
- The performance of all
TruffleString
operations should be on par with or better than theirjava.lang.String
-counterparts.
TruffleString
provides TruffleStringIterator
as a means of iterating over a string's codepoints.
This method should be preferred over using CodePointAtIndexNode
in a loop, especially on variable-width encodings such as UTF-8, since CodePointAtIndexNode
may have to re-calculate the byte index equivalent of the given codepoint index on every call.
See the example:
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringIterator;
abstract static class SomeNode extends Node {
@Specialization
static void someSpecialization(
TruffleString string,
@Cached TruffleString.CreateCodePointIteratorNode createCodePointIteratorNode,
@Cached TruffleStringIterator.NextNode nextNode,
@Cached TruffleString.CodePointLengthNode codePointLengthNode,
@Cached TruffleString.CodePointAtIndexNode codePointAtIndexNode) {
// iterating over a string's codepoints using TruffleStringIterator
TruffleStringIterator iterator = createCodePointIteratorNode.execute(string, TruffleString.Encoding.UTF_8);
while (iterator.hasNext()) {
System.out.printf("%x%n", nextNode.execute(iterator));
}
// suboptimal variant: using CodePointAtIndexNode in a loop
int codePointLength = codePointLengthNode.execute(string, TruffleString.Encoding.UTF_8);
for (int i = 0; i < codePointLength; i++) {
// performance problem: codePointAtIndexNode may have to calculate the byte index corresponding
// to codepoint index i for every loop iteration
System.out.printf("%x%n", codePointAtIndexNode.execute(string, i, TruffleString.Encoding.UTF_8));
}
}
}
TruffleString
also provides a mutable string variant called MutableTruffleString
, which is also accepted in all nodes of TruffleString
. MutableTruffleString
is not thread-safe and allows overwriting bytes in its internal byte array or native pointer via WriteByteNode
.
The internal array or native pointer's content may also be modified externally, but the corresponding MutableTruffleString
must be notified of this via notifyExternalMutation()
.
MutableTruffleString
is not a Truffle interop type, and must be converted to an immutable TruffleString
via TruffleString.AsTruffleString
before passing a language boundary.