Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept for custom temporary array allocation #370

Closed
wants to merge 1 commit into from

Conversation

ennerf
Copy link
Collaborator

@ennerf ennerf commented Feb 24, 2021

Problem

During profiling I noticed that a noticeable performance degradation caused by a lot of temporary array allocations in CachedDataPoints. Each trace has a maximum number of points, but the amount of visible points can change on the selected time window and can become larger and smaller. This messes with the getArrayExact method and results in allocating new arrays at almost every frame.

A FlighRecording of 10 charts over 30 seconds looks like this:

image

Custom Buffer Pool

In my case I already know the perfect buffer size, so all of that is unnecessary overhead. I created a little PR that allows users to customize the behavior a bit more, e.g., below is an example for a pool that doesn't require synchronization and doesn't allocate arrays below a certain size:

public class ChartTrace extends ErrorDataSetRenderer {

    {
        setArrayPool(CustomPool.INSTANCE);
    }

    /**
     * Non thread-safe pool that is always called from the JavaFX thread,
     * and allocates arrays with at least MIN_SIZE length. Falls back to
     * the default implementation for larger arrays.
     */
    private static enum CustomPool implements DoubleArrayPool {
        INSTANCE;

        @Override
        public double[] allocate(int requiredSize) {
            FxUtils.checkFxThread();
            if (requiredSize <= MIN_SIZE) {
                return smallArrays.isEmpty() ? new double[MIN_SIZE] : smallArrays.pop();
            }
            return DoubleArrayCache.getMinSizeInstance().allocate(requiredSize);
        }

        @Override
        public void release(double[] array) {
            FxUtils.checkFxThread();
            if (array.length == MIN_SIZE) {
                smallArrays.push(array);
            } else {
                DoubleArrayCache.getMinSizeInstance().release(array);
            }
        }

        private final Stack<double[]> smallArrays = new Stack<>();
        private static final int MIN_SIZE = LiveChart.DEFAULT_MAX_SAMPLES;

    }

}

Result

In my case this allocates a total of only 6 buffers over the entire lifetime of the app. The rendering performance at higher loads is noticeably smoother

image

@ennerf
Copy link
Collaborator Author

ennerf commented Feb 24, 2021

I created this PR to serve as a baseline for a discussion. I don't think that it makes sense to have several different cache implementations, but I don't want to go down a rabbit hole implementation without discussing it first.

I think that it'd be better to use an interface over the existing singleton/static-method approach. Primarily because

  • It allows users to create complex custom implementations that can get rid of synchronization and/or use application specific insights
  • It allows creating a barebones implementations that can serve as a baseline in performance tests (e.g. impact compared to allocating every array)

@codecov
Copy link

codecov bot commented Feb 24, 2021

Codecov Report

Merging #370 (241e647) into master (c5ea4dc) will decrease coverage by 0.00%.
The diff coverage is 63.33%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #370      +/-   ##
============================================
- Coverage     52.07%   52.07%   -0.01%     
- Complexity     7151     7155       +4     
============================================
  Files           383      383              
  Lines         40269    40285      +16     
  Branches       6506     6506              
============================================
+ Hits          20971    20979       +8     
- Misses        17782    17793      +11     
+ Partials       1516     1513       -3     
Impacted Files Coverage Δ Complexity Δ
...in/java/de/gsi/dataset/utils/DoubleArrayCache.java 68.57% <20.00%> (-19.43%) 11.00 <0.00> (ø)
...rer/spi/AbstractErrorDataSetRendererParameter.java 97.77% <40.00%> (-2.23%) 54.00 <1.00> (+1.00) ⬇️
...va/de/gsi/chart/renderer/spi/CachedDataPoints.java 62.86% <100.00%> (+0.12%) 61.00 <1.00> (ø)
...e/gsi/chart/renderer/spi/ErrorDataSetRenderer.java 79.48% <100.00%> (ø) 77.00 <0.00> (ø)
...hart/plugins/measurements/DataSetMeasurements.java 75.77% <0.00%> (+0.34%) 153.00% <0.00%> (+2.00%)
...main/java/de/gsi/chart/ui/HiddenSidesPaneSkin.java 44.97% <0.00%> (+0.47%) 36.00% <0.00%> (+1.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5ea4dc...241e647. Read the comment docs.

@ennerf ennerf temporarily deployed to coverage February 24, 2021 16:29 Inactive
@ennerf
Copy link
Collaborator Author

ennerf commented Feb 25, 2021

I did some tests on how long the code spends in the render function when I open 6 plots at 640Hz:

@Override
public List<DataSet> render(final GraphicsContext gc, final Chart chart, final int dataSetOffset,
                            final ObservableList<DataSet> datasets) {
    long t0 = System.nanoTime();
    try {
        return super.render(gc, chart, dataSetOffset, datasets);
    } finally {
        long us = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - t0);
        recorder.recordValue(Math.max(1, us));
    }
}

There are 3 cases:

  • normal: stock behavior of 11.5.2
  • noop-lock: dataset::getLock returns a custom lock that removes any synchronization (all operations are done on the JavaFX thread, so this is all pure overhead)
  • noalloc: a custom array allocation scheme

The top plot shows the maximum render time for each second. The bottom plot is a percentile graph, e.g., 99.9% of measurements are below X.

renderperf

@ennerf
Copy link
Collaborator Author

ennerf commented Feb 25, 2021

Here is another one running on JDK 15 with ZGC.

  • dark blue: synchronization taken out
  • green: custom array pool, i.e., this PR
  • teal: completely custom renderer that eliminates everything that isn't required for the particular plot (e.g. no error bars, etc.)

So in the fast majority of our cases, changing the array cache to be a customizable interface provides pretty much the same speedup as writing an optimized custom renderer.

renderperf-zgc

@RalphSteinhagen
Copy link
Member

N.B. (for others reading this) the function of the [Double...]ArrayCache is there to minimise the frequent reallocation of transient arrays, that are needed for screen coordinate transforms, data reduction, and other algorithms. These arrays are usually only temporarily needed (e.g. in case of in-place math operation are not possible or desired). The reason its global is similar to thread- or worker pools, where this allows sharing resources across different parts of an application (ie. data reception, processing, display etc) and keeps the overall resource footprint small(er). This is already a great improvement to the JVM stock behaviour that requires allocation-deallocation gymnastics by the GC and is generic, thread-safe and safe to use for regular developer/user.

Having said that -- I like your idea of allowing to customise either the existing parameters (e.g. allocating strict or not) or to also inject another specific allocator/buffer for a given renderer or scope of the lib.
This is a common feature in other high-performance languages (ie. C++'s polymorphic allocators -PMR).

While this appears as a significant improvement, it's important to point out that this custom implementation has some caveat and design assumptions that may not be safe to use for a non-initiated developer that hasn't bought into the disruptor paradigm of thinking:

  • it by-passes general thread-safeties and assumes a very specific single-threaded use-case
  • it assumes a relatively small number of samples per DataSet or array buffer length as a small upper bound -- ie. a few ten thousands of points compared to several million data points that the library has to support, and
  • becomes primarily the apparent bottleneck because in your use case you update the chart at 640 Hz (N.B. most applications would rate-limit the JavaFX-UI updates to less than the screen updates).

To put in my two-penny worth:

  • I think it's an interesting idea/option -- notably beyond the usage in Chart-fx itself.
  • In order to be able to maintain and keep the lib general enough it's important that this fits for all generic array types, is unit-tested, the default case safe for general usage, and the new functionality well documented (including the caveats).
  • I'd suggest first start with the simple boolean property that selects whether the arrays are allocated to the 'exact' or 'at least' size boundary condition. I guess this flag could be kept in the superordinate abstract class most Renderers derive from.
  • As a second step (or if you have time also in the same PR) one could generalize and add an Interface that could replace the default array cache with a custom implementation. I think since this behaviour is also pretty domain-specific, that this should be a property that, for example, contains the call-back function/interface for the allocate/release call. Since the array cache is needed at least for doubles, floats, bytes and ints, either a generic interface definition or the code-generation framework would be advisable.

@ennerf
Copy link
Collaborator Author

ennerf commented Feb 26, 2021

  • it by-passes general thread-safeties and assumes a very specific single-threaded use-case
  • it assumes a relatively small number of samples per DataSet or array buffer length as a small upper bound -- ie. a few ten thousands of points compared to several million data points that the library has to support, and

This is of course correct. I mainly wanted to show a use case that is very application dependent and couldn't be easily implemented with a true/false flag. The thread-safety overhead is pretty significant if it's not needed.

  • becomes primarily the apparent bottleneck because in your use case you update the chart at 640 Hz (N.B. most applications would rate-limit the JavaFX-UI updates to less than the screen updates).

To clarify, the measurements are measuring how long the JavaFX thread spent in the Renderer::render method, so this is independent of the actual update rate.

The data was updated at the 60Hz frame rate (AnimationTimer), so all reads and writes only happen on the JavaFX thread. This is actually a best case scenario for synchronization as the lock is only accessed by a single thread without contention.

@ennerf
Copy link
Collaborator Author

ennerf commented Feb 28, 2021

I tried to reproduce it on the RollingBufferSample, but the difference was negligible. This problem seems to be tied to our very specific use case, so there are probably very few other users who will ever run into this.

Given that there are already other extension points we can use to work around this I'll go ahead and close the PR.

@ennerf ennerf closed this Feb 28, 2021
ennerf added a commit that referenced this pull request Aug 15, 2023
ennerf added a commit that referenced this pull request Aug 16, 2023
ennerf added a commit that referenced this pull request Aug 17, 2023
ennerf added a commit that referenced this pull request Aug 18, 2023
ennerf added a commit that referenced this pull request Aug 24, 2023
wirew0rm pushed a commit that referenced this pull request Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants