This is a collection of some scripts that I built up while working with ArangoDB. Some of them may be useful to others, so I figured I would share them.
This folder contains a few benchmarks for testing out changes to some ArangoDB
features. There's one to test out how our AQL condition normalization behaves,
one for working with geo indexes, one for testing out our caching behavior
using a small hotset within a large collection of data, and finally one for
incremental replication. All but the last are stand-alone JS scripts. The last
is meant to be added as a file within the replication_sync
test suite, and
will require changes if it is to be used elsewhere.
This folder contains some scripts that I used when building and maintaining the
main codebase. There are a number of build-
* scripts, which are convenience
methods to run cmake
in a given directory to initialize a build folder. I
found it much easier to remember to call mkdir build-enterprise; ~/work/scripts/build-enterprise build-enterprise ~/src/arangodb/arangodb
than to remember all
the different cmake
flags I used for development. All of these scripts should
be configured to use ccache
along with either g++
or clang++
. They may use
some hard-coded paths from my system, but those should be pretty easy to spot
and modify.
One script here that might be quite useful is format
. It's a one-liner that
is used to apply clang-format
with the correct projects settings only to
your staged changes prior to a commit. E.g.
git add *
format
git add *
git commit
It requires that you install the git clang-format
integration, which is
maintained by the llvm
project
here.
These scripts are simple. They perform some analysis to determine how many
files in the codebase depend on a given header. The first, get-includes.sh
simply scans all the files to extract which #include
statements appear in
which files. The second, process-includes.js
, writes this data as a graph
into ArangoDB and uses the graph to do the quantitative analysis. This analysis
can be used while refactoring, to determine how much code will have to
recompile if you change a given header. It can also be used to determine what
the central header files are, to better target any optimization efforts. For
instance, I spent some time removing unnecessary includes (helped by
include-what-you-use
) from various central header files, replacing some
includes with forward declarations, moving code from header to implementation
files, etc. and was able to achieve a considerable speedup in project
compilation time (and greatly reduce the number of files that need to be
re-compiled when various headers are changed.)
These scripts are used to take data generated by arangodump
from a
single-server instance and prepare it to be restored using arangorestore
into
a cluster instance to be used with Pregel. In particular, it changes the number
of shards for the collections, distributes each of them like a prototype
collection, and adds a new shard key vertex
for edge collection so that each
edge ends up the same shard as its source vertex (a requirement for Pregel).
Each script operates on a single file in the dump directory at a time, and will
generate a new file with the extension .fixed
. It shouldn't be to hard to
modify these scripts to run in a loop over all relevant files in the directory,
and it also shouldn't be hard to write a short script to remove all the
original files and rename the new files without the .fixed
extension.
These are some old benchmarks which were used circa 2019 to evaluate ArangoDB's current storage format for timeseries data ingestion, compared to an optmized prototype, compared to the then-current release of TimescaleDB. To be honest, I don't remember too much about the tests, but I would guess it's straightforward from the source code.