Open Source Stacks for Distributed Computing, rebuilt by mishmash.io

This repository contains mishmash.io builds of open source projects and frameworks that are popular in distributed computing.

Projects and frameworks are originally developed by other parties (find the list below) and then customized by mishmash.io.

At mishmash.io, we use a lot of open source - in our distributed database or other software that we publish (such as our open source analytics for OpenTelemetry). Sometimes we customize the original project's code to better suit our needs and we're publishing our patches here.

In more technical terms - this repository contains a build process that:

Fetches the original code of a number of open source projects
Applies our changes
Rebuilds
Retests
Packages smaller, per-feature components that you can stack together on an as-needed basis.

You can also find ready-made stacks for common use case scenarios.

Important

This repository is a Work in progress!

A number of stacks we've accumulated internally are not published or not documented fully yet.

Use the watch button above to get updates on progress.

In this README you will find:

The motivation: why do we patch and rebuild?
The goals and principles: what are we changing?
The rules we follow:
- When patching
- When publishing
The stacks:
The repository:
- How to build your own stack
The background:
- About mishmash.io

Why do we rebuild other open source projects?

Note

Three major reasons why we customize and rebuild other open source projects:

Publishing secure software

We update code and dependencies to latest versions, especially when new vulnerabilities are reported and fixes are published.
Unfied set of dependencies

We modify open source projects to use the same set of dependencies (and their versions).
Minimal software packages

We break down larger open source projects with multiple features into smaller, 'per-feature' modules that can be used on an as-needed basis.

For example, our distributed database mishmash io uses some core functionalities from Apache HDFS, namely, to manage cold- or hot-storage disks; to replicate large data blocks across zones and clusters; and to provide zero-copy data access. Apache HDFS is part of Apache Hadoop and comes with many more features (such as REST-based management APIs or Web GUI apps) and these extra features come with their additional code and dependencies.

By splitting HDFS into smaller, feature-based modules we minimize image sizes of software we publish and simultaneously reduce the attack surface.

Also, as we officially support all software that we deliver, including its dependencies - we would like to make our lives easier by supporting as little code as possible. Therefore, we modify open source projects to use a single dependency for logging; another single dependency for networking; and so on.

Summary of what's modified

Details about changes we've done to each individual open source package are documented separately, but in general, our modifications fall into a few categories:

Upgrading a dependency to its currently maintained version

When the new dependency has an incompatible API we do modify the code that uses it. For example, upgrading jetty from version 9 to 12 (latest version of jetty at the time of writing this README) requires changes to import statements, method calls into jetty APIs, web servlets configuration and more.

Alternatively, if the new dependency version does not include breaking changes - we do not modify code. We only rebuild and retest to be sure the code works.
Splitting code into per-functionality packages

Breaking code apart usually requires refactoring - moving classes to new packages and potentially also changing a method's accessibility (making it public for example). For such changes to work we also refactor related test code.

For example, we split Apache Zookeeper into minimal client, minimal server, cli and a few other packages. This requires some classes to be moved to different java packages to make sure that the minimal client bundle does not use code from minimal server.

For more details keep reading this document and follow the links to each artifact's docs.

Patching process outline

Typically, we begin by experimenting (internally) with the original source code to get an idea of how much changes will be needed. This is then weighed against the gains, with security, for example - influencing strongly in its favor.

Should we decide to proceed with patching the original code - we evaluate our changes and re-test them within the broader ecosystem of open source that we use. Or in other words - we re-test all open source distributed computing stacks that we use.

The steps above are still done internally (nothing gets published) until tests show that everything works as expected.

Tip

During a rerun of tests we also collect telemetry data and analyze it to find out if our changes hurt performance.

To find out more on how we do this, and how you might apply a similar practice to your own software development - check out Analytics tools for OpenTelemetry GitHub repository..

As an extra precaution at this point we also modify the original source code to report a patched version to avoid confusion in production deployments.

Open source projects often include APIs that can be interrogated when the user (or an admin) needs to verify what software version is running on a particular server. To make sure instances of popular projects modified by mishmash.io are not mistaken for their original versions we do also patch these APIs to respond accordingly (more on this in the project-specific docs).

Publishing patches

Once we're confident our code changes are functional and safe to use - we publish the code here. The stacks are rebuilt and retested once more, with all relevant information - like build and test logs, dependency provenance, etc - saved and made available for everyone to see.

Publishing binaries on other repositories (such as maven central or DockerHub) only happens when:

A new version of the original open source project is released
A dependency of the original open source project is upgraded because of a newly discovered vulnerability in it

Or in other words - we'll only release binaries when the original open source project releases a new version or when the security of its current version is compromised.

Find out more about how we do versioning of our patched releases below.

The Stacks

(Coming soon)

Versionsing

(Coming soon)

Using the stacks

(Coming soon)

Modifying the stacks

(Coming soon)

About mishmash.io

(Coming soon)

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github		.github
data		data
fs/fs-sftp		fs/fs-sftp
misc/openid		misc/openid
patched-projects		patched-projects
quorum		quorum
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Source Stacks for Distributed Computing, rebuilt by mishmash.io

Why do we rebuild other open source projects?

Summary of what's modified

Patching process outline

Publishing patches

The Stacks

Versionsing

Using the stacks

Modifying the stacks

About mishmash.io

About

Releases

Packages

Contributors 3

Languages

License

mishmash-io/for-apache

Folders and files

Latest commit

History

Repository files navigation

Open Source Stacks for Distributed Computing, rebuilt by mishmash.io

Why do we rebuild other open source projects?

Summary of what's modified

Patching process outline

Publishing patches

The Stacks

Versionsing

Using the stacks

Modifying the stacks

About mishmash.io

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages