Skip to content

Commit

Permalink
Add a more efficient PathExtractor Implementation (#28)
Browse files Browse the repository at this point in the history
The existing implementation loops through every potential search
path at each step of matching. It therefore exhibits exponential time
complexity as the number of extracted fields grows. Some of the
Amazon usage involves extractions of thousands of attributes and
users needed to craft a workaround for the performance hit.

This change adds a more efficient PathExtractor implementation that
supports a narrower set of SearchPaths and combinations thereof, but
avoids the exponential time cost. This is reflected in the benchmark
results and testing on my own toy dataset. The results show that
the relative performance of the new implementation improves as the
number of extracted attributes increases.

It is built with a key assumption that does not hold for the existing
implementation: that users may not register paths that would result
in a single element matching more than one terminal matcher.

I searched for Amazon internal usages and believe that the supported
usage patterns cover any and all usage I could find. I found no usage
of Annotations not on the Top-Level-Values themselves.

Note: I changed the "dom" benchmarks to iterate over Top-Level-Values.
This makes them faster and a fairer comparison target.
```
Control (Iterate over IonValues)
PathExtractorBenchmark.domBinary                  thrpt    5   5.060 ±  0.075  ops/s
PathExtractorBenchmark.domText                      thrpt    5   1.172 ±  0.040  ops/s

Benchmark (legacy)                                            Mode  Cnt   Score   Error  Units
PathExtractorBenchmark.fullBinary                    thrpt    5   3.846 ± 0.033  ops/s
PathExtractorBenchmark.fullText                        thrpt    5   1.094 ± 0.008  ops/s
PathExtractorBenchmark.partialBinary               thrpt    5  43.020 ± 2.681  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  42.628 ± 2.549  ops/s
PathExtractorBenchmark.partialText                   thrpt    5   2.461 ± 0.121  ops/s
PathExtractorBenchmark.partialTextNoDom      thrpt    5   2.368 ± 0.121  ops/s

Benchmark (strict).                                             Mode  Cnt   Score    Error  Units
PathExtractorBenchmark.fullBinary                     thrpt    5   6.011 ±  0.107  ops/s
PathExtractorBenchmark.fullText                        thrpt    5   1.214 ±  0.025  ops/s
PathExtractorBenchmark.partialBinary               thrpt    5  57.329 ± 13.585  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  56.598 ±  2.424  ops/s
PathExtractorBenchmark.partialText                  thrpt    5   2.430 ±  0.073  ops/s
PathExtractorBenchmark.partialTextNoDom     thrpt    5   2.416 ±  0.175  ops/s
```

Co-authored-by: Tyler Gregg <[email protected]>
  • Loading branch information
rmarrowstone and tgregg authored Sep 18, 2024
1 parent fd88da9 commit 84ef782
Show file tree
Hide file tree
Showing 21 changed files with 852 additions and 69 deletions.
31 changes: 22 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ data on reader: {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedVa
(*) - matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue"
() - matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"}
(bar) - matches "myBarValue" and A::"annotatedValue"
(A::bar) - matches A::"annotatedValue"
(A::*) - matches A::"annotatedValue"
(A::bar) - matches A::"annotatedValue" (is not supported in "strict" mode, see #Optimization below)
```

The `()` matcher matches all values in the stream but you can also use annotations with it, example:
Expand Down Expand Up @@ -63,6 +64,17 @@ PathExtractorBuilder.standard()

see PathExtractorBuilder [javadoc](https://static.javadoc.io/com.amazon.ion/ion-java-path-extraction/1.0.1/com/amazon/ionpathextraction/PathExtractorBuilder.html) for more information on configuration options and search path registration.

### Optimization

There are two implementations: "strict" and "legacy". The strict implementation is more performant, particularly as the
number of fields extracted grows. By default `PathExtractorBuilder.build()` will try to build you a strict extractor and
will fall back to the legacy extractor. You may be explicit that you want a specific implementation by calling
`PathExtractorBuilder.buildStrict()` or `PathExtractorBuilder.buildLegacy()`.

The strict implementation supports basic paths, with field names, index ordinals, and annotations on top-level-values or
wildcards. It does not support mixing field names and index ordinals, multiple callbacks on the same path or annotations
on non-wildcard values. Case-insensitive annotations matching is not supported.

### Notification
Each time the `PathExtractor` encounters a value that matches a registered search path it will invoke the respective
callback passing the reader positioned at the current value. See `PathExtractorBuilder#withSearchPath` methods for more
Expand Down Expand Up @@ -165,6 +177,7 @@ binary file is ~81M and the text file ~95M. There are four benchmarks types:
1. `partial`: materializes a single struct fields as `IonValue` using a path extractor.a
1. `partialNoDom`: access the java representation directly of a single struct field without materializing an `IonValue`.

All the path extractor benchmarks are run in "strict" mode.
There is a binary and a text version for all four benchmark types. See the [PathExtractorBenchmark](https://github.com/amzn/ion-java-path-extraction/blob/master/src/jmh/java/com/amazon/ionpathextraction/benchmarks/PathExtractorBenchmark.java) class for
more details.

Expand All @@ -173,14 +186,14 @@ Results below, higher is better.

```
Benchmark Mode Cnt Score Error Units
PathExtractorBenchmark.domBinary thrpt 10 1.128 ± 0.050 ops/s
PathExtractorBenchmark.domText thrpt 10 0.601 ± 0.019 ops/s
PathExtractorBenchmark.fullBinary thrpt 10 1.227 ± 0.014 ops/s
PathExtractorBenchmark.fullText thrpt 10 0.665 ± 0.010 ops/s
PathExtractorBenchmark.partialBinary thrpt 10 14.912 ± 0.271 ops/s
PathExtractorBenchmark.partialBinaryNoDom thrpt 10 15.650 ± 0.297 ops/s
PathExtractorBenchmark.partialText thrpt 10 1.343 ± 0.029 ops/s
PathExtractorBenchmark.partialTextNoDom thrpt 10 1.307 ± 0.015 ops/s
PathExtractorBenchmark.domBinary thrpt 5 5.060 ± 0.075 ops/s
PathExtractorBenchmark.domText thrpt 5 1.172 ± 0.040 ops/s
PathExtractorBenchmark.fullBinary thrpt 5 6.011 ± 0.107 ops/s
PathExtractorBenchmark.fullText thrpt 5 1.214 ± 0.025 ops/s
PathExtractorBenchmark.partialBinary thrpt 5 57.329 ± 13.585 ops/s
PathExtractorBenchmark.partialBinaryNoDom thrpt 5 56.598 ± 2.424 ops/s
PathExtractorBenchmark.partialText thrpt 5 2.430 ± 0.073 ops/s
PathExtractorBenchmark.partialTextNoDom thrpt 5 2.416 ± 0.175 ops/s
```

Using the path extractor has equivalent performance for both text and binary when fully materializing the document and
Expand Down
4 changes: 2 additions & 2 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,10 @@ jmh {
failOnError = true

// warmup
warmupIterations = 5
warmupIterations = 2

// iterations
iterations = 10
iterations = 5
}

checkstyle {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

import com.amazon.ion.IonReader;
import com.amazon.ion.IonSystem;
import com.amazon.ion.IonValue;
import com.amazon.ion.IonWriter;
import com.amazon.ion.system.IonBinaryWriterBuilder;
import com.amazon.ion.system.IonReaderBuilder;
Expand All @@ -28,6 +29,7 @@
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.util.Iterator;
import java.util.function.Function;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
Expand Down Expand Up @@ -171,15 +173,29 @@ public Object partialTextNoDom(final ThreadState threadState) {
*/
@Benchmark
public Object domBinary() {
return DOM_FACTORY.getLoader().load(bytesBinary);
IonReader reader = newReader(new ByteArrayInputStream(bytesBinary));
// iterating over Top-Level-Values is more apples:apples to path extractor
// vs loading all as a datagram
Iterator<IonValue> iter = DOM_FACTORY.iterate(reader);
while (iter.hasNext()) {
iter.next();
}
return reader;
}

/**
* Text version of {@link #domBinary()}.
*/
@Benchmark
public Object domText() {
return DOM_FACTORY.getLoader().load(bytesText);
IonReader reader = newReader(new ByteArrayInputStream(bytesText));
// iterating over Top-Level-Values is more apples:apples to path extractor
// vs loading all as a datagram
Iterator<IonValue> iter = DOM_FACTORY.iterate(reader);
while (iter.hasNext()) {
iter.next();
}
return reader;
}

/**
Expand Down
43 changes: 43 additions & 0 deletions src/main/java/com/amazon/ionpathextraction/FsmMatcher.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
/*
* Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
* Licensed under the Apache License, Version 2.0 (the "License").
* You may not use this file except in compliance with the License.
* A copy of the License is located at:
*
* http://aws.amazon.com/apache2.0/
*
* or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific
* language governing permissions and limitations under the License.
*/

package com.amazon.ionpathextraction;

import com.amazon.ion.IonReader;
import java.util.function.BiFunction;
import java.util.function.Supplier;

/**
* Base class for match states in the Finite State Machine matching implementation.
*/
abstract class FsmMatcher<T> {
/**
* Callback for match state. May be null.
*/
BiFunction<IonReader, T, Integer> callback;

/**
* Indicates there are no possible child transitions.
*/
boolean terminal = false;

/**
* Return the child matcher for the given reader context.
* Return null if there is no match.
* <br>
* @param position will be -1 for top-level-values, otherwise will be the position ordinal
* of the value in the container, both for sequences and structs.
* @param fieldName will be non-null only for struct values.
*/
abstract FsmMatcher<T> transition(String fieldName, int position, Supplier<String[]> annotations);
}
Loading

0 comments on commit 84ef782

Please sign in to comment.