[RFC] Jigsaw integration #1588

rishabhmaurya · 2021-11-19T02:18:14Z

Jigsaw integration OpenSearch

This proposal addresses following problems related to OpenSearch codebase and its plugins framework -

Strong encapsulation: the server module in OpenSearch is weakly encapsulated and all packages are exported both for direct access and through reflections.
Lack of separation of concern and monolith module: The server modules needs to be split down into components depending on the use-case they are solving in the system. Also, the plugin developer or consumer of these modules shouldn’t be concerned about the business logic or changes in unrelated modules. For e.g. AnalyzerPlugin should only depend on modules of its concern, maybe - analyzer-module, search-module, index-module etc.

Note: this proposal assumes breaking changes are allowed in OpenSearch 2.0 and that’s when these changes can go in all at once.

Solving above two problems using Java 9 modules will inherently address other problems such as -

OpenSearch version agnostic of individual plugins versions - Plugins will depend on the versions of modules on which they depend on and not the OpenSearch version. If some module doesn’t have any major version bump in OpenSearch release, the dependent plugin should continue to run as it is.
Code hygiene and better maintainability: modular codebase is more maintainable and easy to contribute to as it will have better visualization and dependency graph, which would make more sense to the opensource contributors who want to focus on specific part.
Unblock customers using Jigsaw modular architecture: Customers who have already migrated to Java 9 modules are currently blocked as OpenSearch rest client doesn’t supports Java 9 modules. More details here - Make it possible to use the high level rest client with modularized (jigsaw) applications elastic/elasticsearch#38299

Benefits of Java 9 modules?

Strong encapsulation: Java 9 modules enables strong encapsulation of internal classes of modules if the modules are properly configured. By default, all public types are not exported and needs to be explicitly exported by exporting corresponding packages.
Reliable configuration: The dependencies missing can be found at compile or launch time instead of delaying it to the runtime of application.
Security: Strong encapsulation ensures hiding internal packages not meant to be exposed to consumers.
Better migration support: Concepts like automatic, named, unnamed modules are designed to make it possible for large codebases such as OpenSearch to migrate to modular architecture incrementally. More details in following sections. All modules are backward compatible with JDK-8 as Java runtime environment, so applications running on JDK-8, will not break.
Better start-up performance: Since the packages exported and required types are known ahead of time, the JVM classloading for modular system is more optimized. Earlier, classloader had to perform a linear scan of all jars on the classpath to find out origin of a class at the time of loading.

Tradeoffs of Java 9 modules:

No support for module versions: Java 9 modules doesn’t support version. And for this usual build tools like gradle can be used just like they are used now to support versioning and specify dependency of a modular jar on any specific version.
Split packages: same packages cannot span across multiple jars and will result in split packages in JPMS. This was not the case before.
Cyclic dependencies are not allowed between module and dependency inversion using ServiceProvider is promoted in JPMS. This implies more refactoring work to split modules out of a gigantic module with circular dependencies among its packages.
Additional effort to maintain dependencies and export rules
Module configurations are immutable, so updating the modules at run time will not be supported out of the box.

There is a nice blog on concerns regarding Jigsaw - https://developer.jboss.org/blogs/scott.stark/2017/04/14/critical-deficiencies-in-jigsawjsr-376-java-platform-module-system-ec-member-concerns

Before jumping into integration, if you’re new to JPMS, below are some of its concepts which will be used throughout the document -

Module descriptor file: This files defines all the rules like dependencies and exports of a module as is named as module-info.java. It is placed at java source root. This is what differentiates a modular jar from non-modular.

Module path: Just like classpath for plain jars, module paths are for modular jar. It is not mandatory though to load all modular jars on module path, if its loaded onto classpath, then they would behave like a plain jar. JPMS has special rules for readability between module path and classpath which would become clear in following sections.

Unnamed module: All plain jars, not containing the module descriptor files and on classpath, will be all be categorized as unnamed module and will all be placed in unnamed module group. These non-modular jars are just like plain jars in non-jpms system, they export everything and can depend on classes of any jars present on the classpath. So any plain jar can depend on other plain jars. Also, unnamed modules have access to all modular jars and they don’t have to specify any dependency rules to depend on them. So, in a way, if we place all jars (both modular and non-modular) in classpath, everything would run in a same way as it is now.

Named module: Modules containing module-descriptor file placed on module path, will automatically be treated as modular jars. Named modules can only depend on other named modules and cannot directly depend on unnamed modules. This is way too strict, isn’t it? JPMS provides 2 ways to access plain jars not yet modularized or unnamed modules on classpath - by specifying readability edge between source module and all unnamed modules OR using automatic modules as described in next section. Adding readability edge isn’t a pleasant idea because of 2 reasons -

using readability edge, we cannot control jars or packages in a fine grained manner, so all plains jars on the classpath will be accessible to the source module irrespective of whether they are using them or not. It is an example of weak encapsulation which we are trying to address here.
If there is some type exposed by this source modular jar to its consumer which is defined in an unnamed module, then consumer of this jar will need to add readability edge as well to all unnamed modules. Think about the case where all lucene jars, which are not yet modularized, are placed on classpath as unnamed modules and source module is exposing some type of lucene to its consumer. Now consumer is also required to add readability edge to all unnamed modules, which will not just expose all lucene jars to the consumer, but all other plain jars which are not yet modularized (like server jar) on classpath.

Automatic module: jars without module-descriptor files, but placed on module path will be treated as automatic modules. Automatic modules have access to all named modules as well as unnamed modules on the classpath. They export all their packages. Also, both modular and non-modular jars have access to everything in automatic modules. They act as a bridge between named and unnamed modules and is a great way to create a link between modular jars on modulepath to use non-modular jar on classpath. Modular jars can refer them by the automatic name given by JPMS to these jars, which is derived from the jar name, more details here. Its a great way to progressively migrate packages to non-modular jars to its modular version.

They come with a tradeoff, since they export everything, so when a jar is modularized, it may stop exporting some of the packages used by its consumer, and that could be a breaking change for its consumer. For e.g. if we put lucene jars on module path and treat them as automatic modules, they will export all their packages. So as an OpenSearch plugin developer or one of its module, if it has dependency on this lucene automatic modules, their code might break when lucene actually modularize their jars and stop exporting some of the packages used by its consumer.

Split packages: when one package is present in multiple jars, they cannot be loaded as separate modules on module path. They all can reside on classpath though like the way it supports them today. OpenSearch codebase is full of split packages between server module and other modules like opensearch-core, opensearch-x-content, opensearch-cli etc. It also have split packages with lucene jars as it contains org.apache.lucene package. Not just OpenSearch, lucene also have packages spread across multiple modules, more details here - https://lucene.apache.org/core/7_3_0/MIGRATE.html and https://issues.apache.org/jira/browse/LUCENE-9623

There are other important concepts such as open modules for reflection access, patch module to avoid split packages and accessing services using ServiceLoader APIs etc, which one can explore on their own and are important to enumerate all possible scenarios and workarounds.

Lets visualize existing non-JPMS modules in OpenSearch and dependency graph -

Strategy to modularize

From everything on the classpath, modularize and move everything to the module path incrementally -

Note: The new modules in diagram above are hypothetical names used to explain the strategy.

Rules:
Below rules are just proposal and there might be better alternative available depending on what tradeoffs are acceptable and we are willing to take.

From unnamed to automatic module -
Automatic modules are work in progress modules where classes and packages are not yet final. So, packages can move from unnamed module to an automatic module over the time. The constraints are -

any packages in automatic module should not have cyclic dependency on any other module (both named and unnamed) in the system. OpenSearch codebase is full of them, so lot of refactoring like dependency inversion or moving common utilities into a separate module would be required.
Split package: Two modules on modulepath cannot have package with same name. So either the new automatic module should rename those packages if same package is present in unnamed/other named module OR move all of the packages into one module. Patch module should only be used for exceptions and with careful considerations.

From automatic to named module -
Module should only be treated as a named module when all its dependencies either are modularized or are present in one of the automatic module. This rules should be enforced for most of the cases and readability edge from named to unnamed module should be avoided. There could be exceptions though for readability edge from named to all-unnamed modules only if that dependency cannot be immediately modularized and its an internal dependency of that named module and not something exposed and published to the external consumers. This is to prevent external consumers to not to depend on unnamed modules. Also, export rules needs to be defined before declaring a module named.

Open Questions:

How to prioritize initial use-cases to migrate to modules?
Should it be rest client or analyzers or geo.

how to define base modules for initial use cases which would fit in overall modularization strategy?
A good way is to analyze all packages in server modules and try to look at their dependency graph. There are cyclic dependency among almost all packages. Number of edges of dependency between 2 packages do help in finding if that dependency is really needed or not. Thereafter, we can use the role and concern of each package in the system which we want modularize and see if the dependency makes sense or not. If not, we need to refactor the code and remove those cycles.

Renaming package while modularization?
This is required to avoid split packages. With lack of better solution so far, the only way out is to rename packages while migrating. There are 2 problems with this approach -

It will for certain break all consumers of the package since its renamed.
If the type was package private and the package is partially moved to module, then those types will no more be accessible in old package not yet migrated and we may have to do some dirty refactoring in codebase to make them accessible.

Modularizing libraries like lucene which are not yet modularized and have split packages among their jars
Lucene recommends creating Uber jar for all modules and then use it as a dependency. We can make use of patch module to create one Uber-lucene jar for all lucene plain jars, and then add it to the module path and treat it as automatic module.
https://lucene.apache.org/core/7_3_0/MIGRATE.html & https://issues.apache.org/jira/browse/LUCENE-9623
Libraries getting modularized later will break consumers using them, is it something acceptable?
It can potentially break the plugins again using these dependencies.

Challenges:

Some of the perceivable challenges in implementing these changes -

Resolving cyclic dependencies among packages in server modules is going to be labor intensive and time taking work.
Resolving split packages among modules will require lot of refactoring throughout the codebase.
Defining module configuration and what to export.
Java 8 runtime compatibility and multi-release jar
Gradle and IDE support incorporating module descriptor files rules
Test coverage for all refactored changes
Special configuration for testing the modular jar
Migration tool for existing plugins - a tool capable of detecting breaking changes in existing plugin and possible proposing a fix.
Time and labor intensive work with a constraint to release all migrated modules at once
Keeping the feature branch in sync with main.

Related requests and open issues in elasticsearch - elastic/elasticsearch#38299 & elastic/elasticsearch#28984

Looking for feedbacks here.

The text was updated successfully, but these errors were encountered:

dblock · 2021-11-19T15:29:14Z

This is pretty ambitious and would be a significant improvement to the codebase, so I 👍 .

bbarani · 2024-02-06T20:32:04Z

@rishabhmaurya @anasalkouz @dblock can you please confirm if this change can be included in 2.x without breaking existing API? Basically can this change be added in a backward compatible manner in 2.x line?

We are evaluating if this change warrants a 3.0 release or can be included in 2.x line so need your inputs.

rishabhmaurya · 2024-02-06T21:43:33Z

@bbarani I'm quite sure this change will have breaking changes so we shouldn't consider it for 2.x.

rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 19, 2021

anasalkouz assigned rishabhmaurya Nov 23, 2021

anasalkouz removed the untriaged label Nov 23, 2021

anasalkouz changed the title ~~Jigsaw integration~~ [RFC] Jigsaw integration Nov 23, 2021

anasalkouz added RFC Issues requesting major changes discuss Issues intended to help drive brainstorming and decision making labels Nov 23, 2021

rishabhmaurya mentioned this issue Jan 3, 2022

Opensearch Modularization #1838

Open

7 tasks

nknize mentioned this issue Jun 16, 2023

[META] Split and modularize the server monolith #8110

Open

9 tasks

andrross mentioned this issue Jan 31, 2024

[PROPOSAL] Finalize 2024 release schedule for OpenSearch opensearch-project/.github#186

Closed

msfroh added the Roadmap:Modular Architecture Project-wide roadmap label label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Jigsaw integration #1588

[RFC] Jigsaw integration #1588

rishabhmaurya commented Nov 19, 2021 •

edited

Loading

dblock commented Nov 19, 2021

bbarani commented Feb 6, 2024

rishabhmaurya commented Feb 6, 2024

[RFC] Jigsaw integration #1588

[RFC] Jigsaw integration #1588

Comments

rishabhmaurya commented Nov 19, 2021 • edited Loading

Jigsaw integration OpenSearch

Strategy to modularize

Open Questions:

Challenges:

dblock commented Nov 19, 2021

bbarani commented Feb 6, 2024

rishabhmaurya commented Feb 6, 2024

rishabhmaurya commented Nov 19, 2021 •

edited

Loading