Proof of concept for custom temporary array allocation #370

ennerf · 2021-02-24T16:23:13Z

Problem

During profiling I noticed that a noticeable performance degradation caused by a lot of temporary array allocations in CachedDataPoints. Each trace has a maximum number of points, but the amount of visible points can change on the selected time window and can become larger and smaller. This messes with the getArrayExact method and results in allocating new arrays at almost every frame.

A FlighRecording of 10 charts over 30 seconds looks like this:

Custom Buffer Pool

In my case I already know the perfect buffer size, so all of that is unnecessary overhead. I created a little PR that allows users to customize the behavior a bit more, e.g., below is an example for a pool that doesn't require synchronization and doesn't allocate arrays below a certain size:

public class ChartTrace extends ErrorDataSetRenderer {

    {
        setArrayPool(CustomPool.INSTANCE);
    }

    /**
     * Non thread-safe pool that is always called from the JavaFX thread,
     * and allocates arrays with at least MIN_SIZE length. Falls back to
     * the default implementation for larger arrays.
     */
    private static enum CustomPool implements DoubleArrayPool {
        INSTANCE;

        @Override
        public double[] allocate(int requiredSize) {
            FxUtils.checkFxThread();
            if (requiredSize <= MIN_SIZE) {
                return smallArrays.isEmpty() ? new double[MIN_SIZE] : smallArrays.pop();
            }
            return DoubleArrayCache.getMinSizeInstance().allocate(requiredSize);
        }

        @Override
        public void release(double[] array) {
            FxUtils.checkFxThread();
            if (array.length == MIN_SIZE) {
                smallArrays.push(array);
            } else {
                DoubleArrayCache.getMinSizeInstance().release(array);
            }
        }

        private final Stack<double[]> smallArrays = new Stack<>();
        private static final int MIN_SIZE = LiveChart.DEFAULT_MAX_SAMPLES;

    }

}

Result

In my case this allocates a total of only 6 buffers over the entire lifetime of the app. The rendering performance at higher loads is noticeably smoother

ennerf · 2021-02-24T16:25:44Z

I created this PR to serve as a baseline for a discussion. I don't think that it makes sense to have several different cache implementations, but I don't want to go down a rabbit hole implementation without discussing it first.

I think that it'd be better to use an interface over the existing singleton/static-method approach. Primarily because

It allows users to create complex custom implementations that can get rid of synchronization and/or use application specific insights
It allows creating a barebones implementations that can serve as a baseline in performance tests (e.g. impact compared to allocating every array)

codecov · 2021-02-24T16:27:47Z

Codecov Report

Merging #370 (241e647) into master (c5ea4dc) will decrease coverage by 0.00%.
The diff coverage is 63.33%.

@@             Coverage Diff              @@
##             master     #370      +/-   ##
============================================
- Coverage     52.07%   52.07%   -0.01%     
- Complexity     7151     7155       +4     
============================================
  Files           383      383              
  Lines         40269    40285      +16     
  Branches       6506     6506              
============================================
+ Hits          20971    20979       +8     
- Misses        17782    17793      +11     
+ Partials       1516     1513       -3

Impacted Files	Coverage Δ	Complexity Δ
...in/java/de/gsi/dataset/utils/DoubleArrayCache.java	`68.57% <20.00%> (-19.43%)`	`11.00 <0.00> (ø)`
...rer/spi/AbstractErrorDataSetRendererParameter.java	`97.77% <40.00%> (-2.23%)`	`54.00 <1.00> (+1.00)`	⬇️
...va/de/gsi/chart/renderer/spi/CachedDataPoints.java	`62.86% <100.00%> (+0.12%)`	`61.00 <1.00> (ø)`
...e/gsi/chart/renderer/spi/ErrorDataSetRenderer.java	`79.48% <100.00%> (ø)`	`77.00 <0.00> (ø)`
...hart/plugins/measurements/DataSetMeasurements.java	`75.77% <0.00%> (+0.34%)`	`153.00% <0.00%> (+2.00%)`
...main/java/de/gsi/chart/ui/HiddenSidesPaneSkin.java	`44.97% <0.00%> (+0.47%)`	`36.00% <0.00%> (+1.00%)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5ea4dc...241e647. Read the comment docs.

ennerf · 2021-02-25T17:01:27Z

I did some tests on how long the code spends in the render function when I open 6 plots at 640Hz:

@Override
public List<DataSet> render(final GraphicsContext gc, final Chart chart, final int dataSetOffset,
                            final ObservableList<DataSet> datasets) {
    long t0 = System.nanoTime();
    try {
        return super.render(gc, chart, dataSetOffset, datasets);
    } finally {
        long us = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - t0);
        recorder.recordValue(Math.max(1, us));
    }
}

There are 3 cases:

normal: stock behavior of 11.5.2
noop-lock: dataset::getLock returns a custom lock that removes any synchronization (all operations are done on the JavaFX thread, so this is all pure overhead)
noalloc: a custom array allocation scheme

The top plot shows the maximum render time for each second. The bottom plot is a percentile graph, e.g., 99.9% of measurements are below X.

ennerf · 2021-02-25T19:23:29Z

Here is another one running on JDK 15 with ZGC.

dark blue: synchronization taken out
green: custom array pool, i.e., this PR
teal: completely custom renderer that eliminates everything that isn't required for the particular plot (e.g. no error bars, etc.)

So in the fast majority of our cases, changing the array cache to be a customizable interface provides pretty much the same speedup as writing an optimized custom renderer.

RalphSteinhagen · 2021-02-26T13:16:26Z

N.B. (for others reading this) the function of the [Double...]ArrayCache is there to minimise the frequent reallocation of transient arrays, that are needed for screen coordinate transforms, data reduction, and other algorithms. These arrays are usually only temporarily needed (e.g. in case of in-place math operation are not possible or desired). The reason its global is similar to thread- or worker pools, where this allows sharing resources across different parts of an application (ie. data reception, processing, display etc) and keeps the overall resource footprint small(er). This is already a great improvement to the JVM stock behaviour that requires allocation-deallocation gymnastics by the GC and is generic, thread-safe and safe to use for regular developer/user.

Having said that -- I like your idea of allowing to customise either the existing parameters (e.g. allocating strict or not) or to also inject another specific allocator/buffer for a given renderer or scope of the lib.
This is a common feature in other high-performance languages (ie. C++'s polymorphic allocators -PMR).

While this appears as a significant improvement, it's important to point out that this custom implementation has some caveat and design assumptions that may not be safe to use for a non-initiated developer that hasn't bought into the disruptor paradigm of thinking:

it by-passes general thread-safeties and assumes a very specific single-threaded use-case
it assumes a relatively small number of samples per DataSet or array buffer length as a small upper bound -- ie. a few ten thousands of points compared to several million data points that the library has to support, and
becomes primarily the apparent bottleneck because in your use case you update the chart at 640 Hz (N.B. most applications would rate-limit the JavaFX-UI updates to less than the screen updates).

To put in my two-penny worth:

I think it's an interesting idea/option -- notably beyond the usage in Chart-fx itself.
In order to be able to maintain and keep the lib general enough it's important that this fits for all generic array types, is unit-tested, the default case safe for general usage, and the new functionality well documented (including the caveats).
I'd suggest first start with the simple boolean property that selects whether the arrays are allocated to the 'exact' or 'at least' size boundary condition. I guess this flag could be kept in the superordinate abstract class most Renderers derive from.
As a second step (or if you have time also in the same PR) one could generalize and add an Interface that could replace the default array cache with a custom implementation. I think since this behaviour is also pretty domain-specific, that this should be a property that, for example, contains the call-back function/interface for the allocate/release call. Since the array cache is needed at least for doubles, floats, bytes and ints, either a generic interface definition or the code-generation framework would be advisable.

ennerf · 2021-02-26T15:25:41Z

it by-passes general thread-safeties and assumes a very specific single-threaded use-case

it assumes a relatively small number of samples per DataSet or array buffer length as a small upper bound -- ie. a few ten thousands of points compared to several million data points that the library has to support, and

This is of course correct. I mainly wanted to show a use case that is very application dependent and couldn't be easily implemented with a true/false flag. The thread-safety overhead is pretty significant if it's not needed.

becomes primarily the apparent bottleneck because in your use case you update the chart at 640 Hz (N.B. most applications would rate-limit the JavaFX-UI updates to less than the screen updates).

To clarify, the measurements are measuring how long the JavaFX thread spent in the Renderer::render method, so this is independent of the actual update rate.

The data was updated at the 60Hz frame rate (AnimationTimer), so all reads and writes only happen on the JavaFX thread. This is actually a best case scenario for synchronization as the lock is only accessed by a single thread without contention.

ennerf · 2021-02-28T00:09:22Z

I tried to reproduce it on the RollingBufferSample, but the difference was negligible. This problem seems to be tied to our very specific use case, so there are probably very few other users who will ever run into this.

Given that there are already other extension points we can use to work around this I'll go ahead and close the PR.

…shared array cache (#370 and #557)

added a proof of concept for allowing user customizable array allocation

241e647

ennerf requested a review from RalphSteinhagen February 24, 2021 16:23

ennerf had a problem deploying to coverage February 24, 2021 16:23 Failure

ennerf temporarily deployed to coverage February 24, 2021 16:23 Inactive

restyled-io bot mentioned this pull request Feb 24, 2021

Restyle Proof of concept for custom temporary array allocation #371

Closed

ennerf had a problem deploying to coverage February 24, 2021 16:29 Failure

ennerf temporarily deployed to coverage February 24, 2021 16:29 Inactive

ennerf closed this Feb 28, 2021

wirew0rm mentioned this pull request Feb 7, 2022

Use of (boxed) Integer in ArrayCache.java doesn't work with interning #484

Closed

ennerf mentioned this pull request Nov 6, 2022

resolve performance issue caused by using DoubleArrayCache.getArrayExact #557

Closed

ennerf added a commit that referenced this pull request Aug 15, 2023

changed error data set renderer to use static points rather than the …

65d9a2d

…shared array cache (#370 and #557)

ennerf added a commit that referenced this pull request Aug 16, 2023

changed error data set renderer to use static points rather than the …

c482470

…shared array cache (#370 and #557)

ennerf added a commit that referenced this pull request Aug 17, 2023

changed error data set renderer to use static points rather than the …

1ff984d

…shared array cache (#370 and #557)

ennerf added a commit that referenced this pull request Aug 18, 2023

changed error data set renderer to use static points rather than the …

66aebfa

…shared array cache (#370 and #557)

ennerf added a commit that referenced this pull request Aug 24, 2023

changed error data set renderer to use static points rather than the …

ac1ae15

…shared array cache (#370 and #557)

wirew0rm pushed a commit that referenced this pull request Sep 18, 2023

changed error data set renderer to use static points rather than the …

4ba4ce0

…shared array cache (#370 and #557)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept for custom temporary array allocation #370

Proof of concept for custom temporary array allocation #370

ennerf commented Feb 24, 2021 •

edited

Loading

ennerf commented Feb 24, 2021 •

edited

Loading

codecov bot commented Feb 24, 2021 •

edited

Loading

ennerf commented Feb 25, 2021

ennerf commented Feb 25, 2021

RalphSteinhagen commented Feb 26, 2021

ennerf commented Feb 26, 2021 •

edited

Loading

ennerf commented Feb 28, 2021

Proof of concept for custom temporary array allocation #370

Proof of concept for custom temporary array allocation #370

Conversation

ennerf commented Feb 24, 2021 • edited Loading

Problem

Custom Buffer Pool

Result

ennerf commented Feb 24, 2021 • edited Loading

codecov bot commented Feb 24, 2021 • edited Loading

Codecov Report

ennerf commented Feb 25, 2021

ennerf commented Feb 25, 2021

RalphSteinhagen commented Feb 26, 2021

ennerf commented Feb 26, 2021 • edited Loading

ennerf commented Feb 28, 2021

ennerf commented Feb 24, 2021 •

edited

Loading

ennerf commented Feb 24, 2021 •

edited

Loading

codecov bot commented Feb 24, 2021 •

edited

Loading

ennerf commented Feb 26, 2021 •

edited

Loading