This repository contains mishmash.io builds of open source projects and frameworks that are popular in distributed computing.
Projects and frameworks are originally developed by other parties (find the list below) and then customized by mishmash.io.
At mishmash.io, we use a lot of open source - in our distributed database or other software that we publish (such as our open source analytics for OpenTelemetry). Sometimes we customize the original project's code to better suit our needs and we're publishing our patches here.
In more technical terms - this repository contains a build process that:
- Fetches the original code of a number of open source projects
- Applies our changes
- Rebuilds
- Retests
- Packages smaller, per-feature components that you can stack together on an as-needed basis.
You can also find ready-made stacks for common use case scenarios.
Important
This repository is a Work in progress!
A number of stacks we've accumulated internally are not published or not documented fully yet.
Use the watch
button above to get updates on progress.
In this README you will find:
- The motivation: why do we patch and rebuild?
- The goals and principles: what are we changing?
- The rules we follow:
- The stacks:
- The repository:
- The background:
Note
Three major reasons why we customize and rebuild other open source projects:
-
Publishing secure software
We update code and dependencies to latest versions, especially when new vulnerabilities are reported and fixes are published.
-
Unfied set of dependencies
We modify open source projects to use the same set of dependencies (and their versions).
-
Minimal software packages
We break down larger open source projects with multiple features into smaller, 'per-feature' modules that can be used on an as-needed basis.
For example, our distributed database mishmash io uses some core functionalities from Apache HDFS, namely, to manage cold- or hot-storage disks; to replicate large data blocks across zones and clusters; and to provide zero-copy data access. Apache HDFS is part of Apache Hadoop and comes with many more features (such as REST-based management APIs or Web GUI apps) and these extra features come with their additional code and dependencies.
By splitting HDFS into smaller, feature-based modules we minimize image sizes of software we publish and simultaneously reduce the attack surface.
Also, as we officially support all software that we deliver, including its dependencies - we would like to make our lives easier by supporting as little code as possible. Therefore, we modify open source projects to use a single dependency for logging; another single dependency for networking; and so on.
Details about changes we've done to each individual open source package are documented separately, but in general, our modifications fall into a few categories:
-
Upgrading a dependency to its currently maintained version
When the new dependency has an incompatible API we do modify the code that uses it. For example, upgrading
jetty
from version 9 to 12 (latest version ofjetty
at the time of writing this README) requires changes toimport
statements, method calls intojetty
APIs, web servlets configuration and more.Alternatively, if the new dependency version does not include breaking changes - we do not modify code. We only rebuild and retest to be sure the code works.
-
Splitting code into per-functionality packages
Breaking code apart usually requires refactoring - moving classes to new
packages
and potentially also changing a method's accessibility (making itpublic
for example). For such changes to work we also refactor related test code.For example, we split Apache Zookeeper into
minimal client
,minimal server
,cli
and a few other packages. This requires some classes to be moved to differentjava packages
to make sure that theminimal client
bundle does not use code fromminimal server.
For more details keep reading this document and follow the links to each artifact's docs.
Typically, we begin by experimenting (internally) with the original source code to get an idea of how much changes will be needed. This is then weighed against the gains, with security, for example - influencing strongly in its favor.
Should we decide to proceed with patching the original code - we evaluate our changes and re-test them within the broader ecosystem of open source that we use. Or in other words - we re-test all open source distributed computing stacks that we use.
The steps above are still done internally (nothing gets published) until tests show that everything works as expected.
Tip
During a rerun of tests we also collect telemetry data and analyze it to find out if our changes hurt performance.
To find out more on how we do this, and how you might apply a similar practice to your own software development - check out Analytics tools for OpenTelemetry GitHub repository..
As an extra precaution at this point we also modify the original source code to report a patched version
to avoid confusion in production deployments.
Open source projects often include APIs that can be interrogated when the user (or an admin) needs to verify what software version is running on a particular server. To make sure instances of popular projects modified by mishmash.io are not mistaken for their original versions we do also patch these APIs to respond accordingly (more on this in the project-specific docs).
Once we're confident our code changes are functional and safe to use - we publish the code here. The stacks are rebuilt and retested once more, with all relevant information - like build and test logs, dependency provenance, etc - saved and made available for everyone to see.
Publishing binaries on other repositories (such as maven central
or DockerHub
) only happens when:
- A new version of the original open source project is released
- A dependency of the original open source project is upgraded because of a newly discovered vulnerability in it
Or in other words - we'll only release binaries when the original open source project releases a new version or when the security of its current version is compromised.
Find out more about how we do versioning of our patched releases below.
(Coming soon)
(Coming soon)
(Coming soon)
(Coming soon)
(Coming soon)