Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unsafe to speed up string marshaling #6433

Merged
merged 6 commits into from
Jun 3, 2024

Conversation

laurit
Copy link
Contributor

@laurit laurit commented May 6, 2024

To marshal strings without allocating extra memory we need to implement our own methods for estimating the byte size of utf-8 encoded string and encoding the string to utf8. Unfortunately jit compiler isn't able to optimize these too well. This PR explores using unsafe to speed up utf8 encoding. The focus is on strings that use only latin1 and are stored inside a byte array in the java.lang.String class since jdk9. Unsafe is used to access that byte array. Based on the byte array we determine whether the string is composed only of 7-bit characters, if so we can copy they byte array directly to output as no additional encoding is needed. To determine whether the string only has 7-bit characters we count bytes that have negative value, the count of such bytes corresponds to characters that would be encoded as 2 bytes in utf8. Unsafe is used to let this loop process 8 bytes at a time.

Benchmark                                                                      (stringSize)  Mode  Cnt      Score     Error   Units
StringMarshalBenchmark.marshalAsciiStringStateful                                       512  avgt   10      0.032 ±   0.001   us/op
StringMarshalBenchmark.marshalAsciiStringStateful:gc.alloc.rate                         512  avgt   10  17387.600 ± 362.188  MB/sec
StringMarshalBenchmark.marshalAsciiStringStateful:gc.alloc.rate.norm                    512  avgt   10    576.000 ±   0.001    B/op
StringMarshalBenchmark.marshalAsciiStringStateful:gc.count                              512  avgt   10    270.000            counts
StringMarshalBenchmark.marshalAsciiStringStateful:gc.time                               512  avgt   10     99.000                ms
StringMarshalBenchmark.marshalAsciiStringStatelessSafe                                  512  avgt   10      0.363 ±   0.008   us/op
StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.alloc.rate                    512  avgt   10      0.006 ±   0.001  MB/sec
StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.alloc.rate.norm               512  avgt   10      0.002 ±   0.001    B/op
StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.count                         512  avgt   10        ≈ 0            counts
StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe                                512  avgt   10      0.038 ±   0.001   us/op
StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.alloc.rate                  512  avgt   10    596.035 ±   7.136  MB/sec
StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.alloc.rate.norm             512  avgt   10     24.000 ±   0.001    B/op
StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.count                       512  avgt   10     18.000            counts
StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.time                        512  avgt   10     10.000                ms
StringMarshalBenchmark.marshalLatin1StringStateful                                      512  avgt   10      0.214 ±   0.001   us/op
StringMarshalBenchmark.marshalLatin1StringStateful:gc.alloc.rate                        512  avgt   10   4837.182 ±  14.924  MB/sec
StringMarshalBenchmark.marshalLatin1StringStateful:gc.alloc.rate.norm                   512  avgt   10   1088.001 ±   0.001    B/op
StringMarshalBenchmark.marshalLatin1StringStateful:gc.count                             512  avgt   10    145.000            counts
StringMarshalBenchmark.marshalLatin1StringStateful:gc.time                              512  avgt   10     47.000                ms
StringMarshalBenchmark.marshalLatin1StringStatelessSafe                                 512  avgt   10      0.817 ±   0.003   us/op
StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.alloc.rate                   512  avgt   10      0.006 ±   0.001  MB/sec
StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.alloc.rate.norm              512  avgt   10      0.005 ±   0.001    B/op
StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.count                        512  avgt   10        ≈ 0            counts
StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe                               512  avgt   10      0.555 ±   0.002   us/op
StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.alloc.rate                 512  avgt   10      0.006 ±   0.001  MB/sec
StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.alloc.rate.norm            512  avgt   10      0.003 ±   0.001    B/op
StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.count                      512  avgt   10        ≈ 0            counts
StringMarshalBenchmark.marshalUnicodeStringStateful                                     512  avgt   10      0.502 ±   0.004   us/op
StringMarshalBenchmark.marshalUnicodeStringStateful:gc.alloc.rate                       512  avgt   10   3037.872 ±  25.989  MB/sec
StringMarshalBenchmark.marshalUnicodeStringStateful:gc.alloc.rate.norm                  512  avgt   10   1600.003 ±   0.001    B/op
StringMarshalBenchmark.marshalUnicodeStringStateful:gc.count                            512  avgt   10     91.000            counts
StringMarshalBenchmark.marshalUnicodeStringStateful:gc.time                             512  avgt   10     33.000                ms
StringMarshalBenchmark.marshalUnicodeStringStatelessSafe                                512  avgt   10      0.784 ±   0.021   us/op
StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.alloc.rate                  512  avgt   10      0.006 ±   0.001  MB/sec
StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.alloc.rate.norm             512  avgt   10      0.005 ±   0.001    B/op
StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.count                       512  avgt   10        ≈ 0            counts
StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe                              512  avgt   10      0.770 ±   0.029   us/op
StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.alloc.rate                512  avgt   10      0.006 ±   0.001  MB/sec
StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.alloc.rate.norm           512  avgt   10      0.005 ±   0.001    B/op
StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.count                     512  avgt   10        ≈ 0            counts

@laurit laurit requested a review from a team May 6, 2024 08:19
Copy link

codecov bot commented May 6, 2024

Codecov Report

Attention: Patch coverage is 77.27273% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 90.83%. Comparing base (a764419) to head (eb1cc64).
Report is 9 commits behind head on main.

Current head eb1cc64 differs from pull request most recent head 43c82d6

Please upload reports for the commit 43c82d6 to get more accurate results.

Files Patch % Lines
...emetry/exporter/internal/marshal/UnsafeString.java 56.25% 4 Missing and 3 partials ⚠️
...emetry/exporter/internal/marshal/UnsafeAccess.java 72.22% 4 Missing and 1 partial ⚠️
...orter/internal/marshal/StatelessMarshalerUtil.java 89.28% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6433      +/-   ##
============================================
- Coverage     90.87%   90.83%   -0.05%     
- Complexity     6169     6197      +28     
============================================
  Files           678      680       +2     
  Lines         18507    18568      +61     
  Branches       1818     1829      +11     
============================================
+ Hits          16819    16866      +47     
- Misses         1153     1160       +7     
- Partials        535      542       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jkwatson
Copy link
Contributor

jkwatson commented May 7, 2024

What will happen with java 22, when Unsafe is deprecated?

@laurit
Copy link
Contributor Author

laurit commented May 8, 2024

What will happen with java 22, when Unsafe is deprecated?

Looking at https://openjdk.org/jeps/471 I'd say nothing will change before jdk 25. As unsafe usage is optional even then not much will change. Eventually users may need to opt in by allowing access to private members of java.lang.String so we could read the underlying byte array. Java agent can probably get access without requiring users to do anything. In coming jdk releases using vector api will allow us to further improve the performance.

@jkwatson
Copy link
Contributor

jkwatson commented May 8, 2024

Eventually users may need to opt in by allowing access to private members of java.lang.String so we could read the underlying byte array.

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

@trask
Copy link
Member

trask commented May 8, 2024

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

I believe the optimization only takes effect if Unsafe is available, @laurit can you clarify?

@laurit
Copy link
Contributor Author

laurit commented May 9, 2024

Yes the intent is to use this optimization only when unsafe is available and has the required methods, otherwise just fall back to the version that doesn't require unsafe. If we need to further future proof it we could consider limiting it by jvm version. If 21 is used we'd allow the optimization, but disallow it for future versions like 23 for which we don't know what is going to happen. When 23 is release we could verify that it works and enable the optimization for that version too. @jkwatson would you be more comfortable with something like this?

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

Given the way jdk authors have been trying to lock down access to jdk internals since jdk 9 at one point user code won't be able to access private fields in jdk classes unless user explicitly allows it with --add-opens=java.base/java.lang=ALL-UNNAMED. By the time this happens perhaps there are new apis or ideas or advances in the jit compiler that make using unsafe unnecessary.

@jhalliday
Copy link
Contributor

Interesting idea! I'll skip the obvious (Unsafe is unsafe, and has a limited shelf life at this point) and get right into the gritty details:

If I've understood it right, there are at least three distinct things here, which are conflated in the presentation of the benchmark data and which may be interesting to consider separately:

  • Stateful (the Marshaler is a one-shot object wrapping the thing to be marshalled, which is how all the others work?) vs Stateless (more c-style, the thing to be marshalled is passed as an argument). That seems an invasive change, in that any user of the Marshalers has to handle the stateless one differently. Presumably it's aimed at cutting object allocation, but that's distinct from just using Unsafe. Indeed, it's making life harder - the safe/unsafe switch could otherwise be an encapsulated implementation detail of the marshaler. Usability concerns aside, it's also I think the likely culprit causing perf regression in the Safe case, in that size now has to be recalculated instead of cached in marshaler instance. That seems especially painful for the common ascii case - 0.032 us/op growing to 0.363 is hard to ignore, even if it may be offset by gc savings.

  • The test string length has been changed from 16 to 512. Are either of those values based on observed data in production? If not, parametrising the benchmark over a range of values would be interesting. In particular, I'd like to see the numbers for the original 16 length.

  • The batching of bytes into a long feels overly complex. When doing compact strings in the JDK they went with a simple byte at a time loop, presumably relying on the compiler to unroll it and the CPU to pipeline it. Whilst I'm not keen on a combinatorial explosion of benchmark parameters, I'd be more comfortable with the extra code complexity here if it was backed by some evidence it's actually worthwhile.

  • StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious. What am I missing/

@laurit
Copy link
Contributor Author

laurit commented May 16, 2024

That seems an invasive change, in that any user of the Marshalers has to handle the stateless one differently.

These changes are already done and merged. I think they are out of the scope of this PR.

The test string length has been changed from 16 to 512. Are either of those values based on observed data in production?

They aren't, they were copied from the existing span marshaling benchmark that creates test data with either 16 or 512 spans. This benchmark was added in a previous PR that introduced stateless marshalers. I think I removed 16 just to make the benchmark run faster.

The batching of bytes into a long feels overly complex. When doing compact strings in the JDK they went with a simple byte at a time loop, presumably relying on the compiler to unroll it and the CPU to pipeline it.

If you look at the jdk code https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/java.base/share/classes/java/lang/StringCoding.java#L37 you'll see that the method with the simple loop is using @IntrinsicCandidate annotation the actual implementation https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L3462 is also quite complicated. Unfortunately jit compiler doesn't generate the fastest code for the simple loop. Hotspot jit vectorizes loops where you add 2 vectors, but as far as I know it does not vectorize loops that have a reduction (for example compute dot product). In future jdk version vector api could allow faster implementation for this method that does not use unsafe.

StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious.

The 24 bytes should come from writeBinaryTo method that allocates a new ProtoSerializer. My guess is that JIT compiler manages to stack allocate ProtoSerializer. JIT compiler works in mysterious ways.

@jack-berg
Copy link
Member

That seems especially painful for the common ascii case - 0.032 us/op growing to 0.363 is hard to ignore, even if it may be offset by gc savings.

My understanding is that there are two dimensions. 1. Whether or not the stateless marshaling is enabled, which is triggered by setting OTEL_JAVA_EXPERIMENTAL_EXPORTER_MEMORY_MODE=reusable_data. 2. Whether or not unsafe is available. These two dimensions result in the following perf figures for the ascii case:

Stateful Stateless
Safe 0.032 us/op
576 B/op
0.363 us/op
.002 B/op
Unsafe n/a 0.038 us/op
24 B/op

So you only get the 0.032 us/op case if you opt in reusable_data memory mode and unsafe is not available, which seems like it should be an unusual combination. Although it will become more common when we make reusable_data mode the default.

@laurit any indea why the ascii stateless/safe scenario seems to be so much worse that the latin1 or unicode equivalent?

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments but I'm in favor of this. I get that unsafe is unsafe, but:

  • We already have precedent for using it with BatchSpanProcessor
  • This is special case where we can benefit quite a bit from its usage
  • The usage is limited to a small area of code which is well defined / contained, with fallback logic when unsafe is not available

return encodedUtf8Length(string);
}

/** Returns the count of bytes with negative value. */
private static int countNegative(byte[] bytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a bit to parse what's going on here but I think I get it:

  • We want to use unsafe to read the value of a string's internal byte[] directly when a string is composed of all latin1 characters that fit within 7 bits (first 128 code points)
  • In order to determine if a string meets this condition as fast as possible, we use unsafe and some clever (seriously clever - where did you come up with this??) shifting to read through the string in chunks of 8 bytes, counting the number of times when a byte has a 1 in the most significant bit and therefore doesn't fit in 7 bits. Reading 8 bytes at a time is faster than reading 1 byte at a time, which is what we do currently when trying to compute the size of a string with reusable_data memory mode.
  • Later, when we are serializing the string, we check if the string is latin1 and had a size which was the same as the string length. If true, we can use unsafe to write the string's internal byte[] to the output stream. If false, then fallback to the existing serialization logic.

Does this sound like a correct characterization @laurit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sound like a correct characterization @laurit?

yes

We want to use unsafe to read the value of a string's internal byte[] directly when a string is composed of all latin1 characters that fit within 7 bits (first 128 code points)

when we read they byte array we don't know whether it is going to be 7bit or not

where did you come up with this?

Typically such loops are sped up by using vector instructions that allow processing multiple bytes simultaneously. Processing bytes on long a time is similar to using vector instructions. It would be interesting to see what effect there would be when this method is written using the vector api preview feature in recent jdks.

If true, we can use unsafe to write the string's internal byte[] to the output stream

That part doesn't use unsafe, just plain System.arraycopy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically such loops are sped up by using vector instructions that allow processing multiple bytes simultaneously. Processing bytes on long a time is similar to using vector instructions. It would be interesting to see what effect there would be when this method is written using the vector api preview feature in recent jdks.

Fascinating!

@@ -14,6 +14,7 @@ dependencies {
api(project(":sdk-extensions:autoconfigure-spi"))

compileOnly(project(":sdk:common"))
compileOnly(project(":exporters:common:compile-stub"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if you can explain why a separate module is needed for this, even tho its not published.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to compile a class that uses Unsafe with javac from jdk 17 targeting jdk 8 it fails with

javac --release 8 Test.java
Test.java:3: error: package sun.misc does not exist
		System.err.println(sun.misc.Unsafe.class.getName());

curiously when targeting jdk 11 it gets a warning instead

javac --release 11 Test.java
Test.java:3: warning: Unsafe is internal proprietary API and may be removed in a future release
		System.err.println(sun.misc.Unsafe.class.getName());

Not using --release 8 also results in a warning

javac --source 8 --target 8 Test.java
warning: [options] bootstrap class path not set in conjunction with -source 8
Test.java:3: warning: Unsafe is internal proprietary API and may be removed in a future release
		System.err.println(sun.misc.Unsafe.class.getName());

Creating a copy of Unsafe just for compiling felt like the easiest way to make this work. There is a comment in our version of Unsafe

/**
 * sun.misc.Unsafe from the JDK isn't found by the compiler, we provide out own trimmed down version
 * that we can compile against.
 */

I could add a similar comment here also. Is this what you were looking for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be helpful.

@jhalliday
Copy link
Contributor

reusable_data memory mode and unsafe is not available... will become more common when we make reusable_data mode the default.

yes, my concern is the future point where unsafe is retired or discouraged in the JDK and reusable_data is normal. The gc saving is nice if gc is a problem e.g. single core containers, but where gc is low cost e.g. enough spare cores to run it in the background, then the added CPU cost of the stateless serialization on the critical path by default may be unwelcome. As long as stateful is still available as an option it's not a big deal, the default can be tweaked fairly easily.

@jhalliday
Copy link
Contributor

StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious.

The 24 bytes should come from writeBinaryTo method that allocates a new ProtoSerializer. My guess is that JIT compiler manages to stack allocate ProtoSerializer. JIT compiler works in mysterious ways.

perhaps, but I'm not clear what's allowing it to elide the allocation in some cases but not others. If the code was polymorphic on the ascii vs latin1/unicode choice then it could be optimizing the two functions differently, but I don't see where it is. Log the compiler execution trace perhaps? It's not bad, it's just annoyingly inexplicable.

@laurit
Copy link
Contributor Author

laurit commented May 24, 2024

@laurit any idea why the ascii stateless/safe scenario seems to be so much worse that the latin1 or unicode equivalent?

My understand is that this is because JDK optimizes for the ascii case. See https://github.com/openjdk/jdk17/blob/4afbcaf55383ec2f5da53282a1547bac3d099e9d/src/java.base/share/classes/java/lang/String.java#L1264

        if (!StringCoding.hasNegatives(val, 0, val.length))
            return Arrays.copyOf(val, val.length);

The StringCoding.hasNegatives is annotated with @IntrinsicCandidate and gets replaced with https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L3462

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment about providing a property for explicit opt out. Seems like a conservative way to protect against any user concerns or issues.

@jack-berg jack-berg merged commit 56c1879 into open-telemetry:main Jun 3, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants