Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package cloudflare-zlib with the Docker image #81208

Closed
jpountz opened this issue Dec 1, 2021 · 7 comments · Fixed by #81245
Closed

Package cloudflare-zlib with the Docker image #81208

jpountz opened this issue Dec 1, 2021 · 7 comments · Fixed by #81245
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >enhancement Team:Delivery Meta label for Delivery team

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 1, 2021

Elasticsearch uses zlib for two purposes:

  • Compression of stored fields with index.codec: best_compression, which we use for observability and security data.
  • Request / response compression.

The original zlib, which is usually the one that is installed, optimizes for portability and misses a number of important optimizations such as leveraging vectorization support for x86 and ARM architectures. Several forks have been created in order to address this, notably an Intel fork, zlib-ng and a Cloudflare fork.

Historically, zlib was packaged within the JDK, so that users wouldn't have to have zlib installed for basic usage of Java. A downside of this approach is that it didn't allow using one of these faster forks, but since version 9 the JDK uses the system's zlib when available and falls back to the zlib that is packaged within the JDK if a system zlib cannot be found.

I performed testing with the Cloudflare zlib, which yielded almost 2x faster compression and decompression for JSON documents that are representative of those produced by the Elastic Observability solution. A run of the solutions/logs track with Rally yielded 2.35% faster indexing, 8.28% less cumulative merge time and 4.83% less cumulative indexing time. One particularity of the Cloudflare zlib is that compression levels retain the same semantics as the original zlib (unlike the Intel fork which uses a compatible format but gives different semantics for some compression levels) so the space efficiency of the produced indices was sensibly the same.

This issue suggests that we update our Docker image to use the Cloudflare fork of zlib instead of the original zlib so that users of the Elastic Cloud service and of the Docker image in general would get better performance out of their Elasticsearch clusters.

@jpountz jpountz added >enhancement :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts labels Dec 1, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Dec 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@jpountz
Copy link
Contributor Author

jpountz commented Dec 1, 2021

Note that a possible alternative would be to include the cloudflare-zlib library directly in the tarball. It felt more challenging than doing it in the Docker image (which is a more controlled environment), hence the suggestion to go with the Docker image. But I'd be equally happy with the tarball approach if it has the preference of the Delivery team.

@pugnascotia
Copy link
Contributor

Cloudflare helpfully have their own package repository, with instructions for Debian, Ubuntu and RHEL. Unfortunately, their repo doesn't include Ubuntu packages for ARM / aarch64. The lib is quick to build from source, so maybe we can do something cross-platform.

@pugnascotia
Copy link
Contributor

For the record, here are the steps that were used to test the zlib.

git clone [email protected]:cloudflare/zlib.git cloudflare-zlib
cd cloudflare-zlib
./configure --prefix=/usr/local/cloudflare-zlib
make
sudo make install

And then it's about setting the LD_LIBRARY_PATH environment variable, e.g. LD_LIBRARY_PATH=/usr/local/cloudflare-zlib/lib/ bin/elasticsearch, the JVM will automatically pick it up.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 1, 2021

I didn't check the Cloudflare packages but hopefully they put the lib in a directory that ld gives higher priority to the directory where the original zlib is so that it would get used naturally without additional measures.

@DJRickyB
Copy link
Contributor

DJRickyB commented Dec 1, 2021

Or if we do package it with the tarball, bin/elasticsearch itself could check for its presence and set/extend LD_PRELOAD or LD_LIBRARY_PATH before the JVM initialization

@pugnascotia
Copy link
Contributor

TBH I was planning on modifying LD_LIBRARY_PATH. We can mostly do what we like in the Docker images.

pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Dec 2, 2021
Closes elastic#81208. Elasticsearch uses zlib for two purposes:

   * Compression of stored fields with `index.codec: best_compression`,
     which we use for observability and security data.
   * Request / response compression.

Historically, zlib was packaged within the JDK, so that users wouldn't
have to have zlib installed for basic usage of Java. However, the
original zlib optimizes for portability and misses a number of important
optimizations such as leveraging vectorization support for x86 and ARM
architectures. Several forks have been created in order to address this.

Since version 9, the JDK uses the system's zlib when available and falls
back to the zlib that is packaged within the JDK if a system zlib cannot
be found.

This commit changes the Docker image to install the Cloudflare fork of
zlib, and run Java using the fork instead of the original zlib, so that
users of the Docker image can get better performance.

Other ES distribution types are out-of-scope, since configuring the JVM
to use an alternative zlib requires an environment config as well as
installed another zlib, and Docker is the only distribution type where
we can control both.
elasticsearchmachine pushed a commit that referenced this issue Dec 3, 2021
Closes #81208. Elasticsearch uses zlib for two purposes:    *
Compression of stored fields with `index.codec: best_compression`,     
which we use for observability and security data.    * Request /
response compression. Historically, zlib was packaged within the JDK, so
that users wouldn't have to have zlib installed for basic usage of Java.
However, the original zlib optimizes for portability and misses a number
of important optimizations such as leveraging vectorization support for
x86 and ARM architectures. Several forks have been created in order to
address this. Since version 9, the JDK uses the system's zlib when
available and falls back to the zlib that is packaged within the JDK if
a system zlib cannot be found. This commit changes the Docker image to
install the Cloudflare fork of zlib, and run Java using the fork instead
of the original zlib, so that users of the Docker image can get better
performance. Other ES distribution types are out-of-scope, since
configuring the JVM to use an alternative zlib requires an environment
config as well as installed another zlib, and Docker is the only
distribution type where we can control both.
pugnascotia added a commit that referenced this issue Dec 3, 2021
Closes #81208. Elasticsearch uses zlib for two purposes:    *
Compression of stored fields with `index.codec: best_compression`,     
which we use for observability and security data.    * Request /
response compression. Historically, zlib was packaged within the JDK, so
that users wouldn't have to have zlib installed for basic usage of Java.
However, the original zlib optimizes for portability and misses a number
of important optimizations such as leveraging vectorization support for
x86 and ARM architectures. Several forks have been created in order to
address this. Since version 9, the JDK uses the system's zlib when
available and falls back to the zlib that is packaged within the JDK if
a system zlib cannot be found. This commit changes the Docker image to
install the Cloudflare fork of zlib, and run Java using the fork instead
of the original zlib, so that users of the Docker image can get better
performance. Other ES distribution types are out-of-scope, since
configuring the JVM to use an alternative zlib requires an environment
config as well as installed another zlib, and Docker is the only
distribution type where we can control both.
pugnascotia added a commit that referenced this issue Dec 3, 2021
Closes #81208. Elasticsearch uses zlib for two purposes:    *
Compression of stored fields with `index.codec: best_compression`,
which we use for observability and security data.    * Request /
response compression. Historically, zlib was packaged within the JDK, so
that users wouldn't have to have zlib installed for basic usage of Java.
However, the original zlib optimizes for portability and misses a number
of important optimizations such as leveraging vectorization support for
x86 and ARM architectures. Several forks have been created in order to
address this. Since version 9, the JDK uses the system's zlib when
available and falls back to the zlib that is packaged within the JDK if
a system zlib cannot be found. This commit changes the Docker image to
install the Cloudflare fork of zlib, and run Java using the fork instead
of the original zlib, so that users of the Docker image can get better
performance. Other ES distribution types are out-of-scope, since
configuring the JVM to use an alternative zlib requires an environment
config as well as installed another zlib, and Docker is the only
distribution type where we can control both.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >enhancement Team:Delivery Meta label for Delivery team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants