Add a more efficient PathExtractor Implementation (#28)

The existing implementation loops through every potential search path at each step of matching. It therefore exhibits exponential time complexity as the number of extracted fields grows. Some of the Amazon usage involves extractions of thousands of attributes and users needed to craft a workaround for the performance hit. This change adds a more efficient PathExtractor implementation that supports a narrower set of SearchPaths and combinations thereof, but avoids the exponential time cost. This is reflected in the benchmark results and testing on my own toy dataset. The results show that the relative performance of the new implementation improves as the number of extracted attributes increases. It is built with a key assumption that does not hold for the existing implementation: that users may not register paths that would result in a single element matching more than one terminal matcher. I searched for Amazon internal usages and believe that the supported usage patterns cover any and all usage I could find. I found no usage of Annotations not on the Top-Level-Values themselves. Note: I changed the "dom" benchmarks to iterate over Top-Level-Values. This makes them faster and a fairer comparison target. ``` Control (Iterate over IonValues) PathExtractorBenchmark.domBinary thrpt 5 5.060 ± 0.075 ops/s PathExtractorBenchmark.domText thrpt 5 1.172 ± 0.040 ops/s Benchmark (legacy) Mode Cnt Score Error Units PathExtractorBenchmark.fullBinary thrpt 5 3.846 ± 0.033 ops/s PathExtractorBenchmark.fullText thrpt 5 1.094 ± 0.008 ops/s PathExtractorBenchmark.partialBinary thrpt 5 43.020 ± 2.681 ops/s PathExtractorBenchmark.partialBinaryNoDom thrpt 5 42.628 ± 2.549 ops/s PathExtractorBenchmark.partialText thrpt 5 2.461 ± 0.121 ops/s PathExtractorBenchmark.partialTextNoDom thrpt 5 2.368 ± 0.121 ops/s Benchmark (strict). Mode Cnt Score Error Units PathExtractorBenchmark.fullBinary thrpt 5 6.011 ± 0.107 ops/s PathExtractorBenchmark.fullText thrpt 5 1.214 ± 0.025 ops/s PathExtractorBenchmark.partialBinary thrpt 5 57.329 ± 13.585 ops/s PathExtractorBenchmark.partialBinaryNoDom thrpt 5 56.598 ± 2.424 ops/s PathExtractorBenchmark.partialText thrpt 5 2.430 ± 0.073 ops/s PathExtractorBenchmark.partialTextNoDom thrpt 5 2.416 ± 0.175 ops/s ``` Co-authored-by: Tyler Gregg <[email protected]>
amazon-ion · Sep 18, 2024 · 84ef782 · 84ef782
1 parent fd88da9
commit 84ef782
Show file tree

Hide file tree

Showing 21 changed files with 852 additions and 69 deletions.
diff --git a/README.md b/README.md
@@ -35,7 +35,8 @@ data on reader: {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedVa
 (*)           - matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue"
 ()            - matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"}
 (bar)         - matches "myBarValue" and A::"annotatedValue"
-(A::bar)      - matches A::"annotatedValue"
+(A::*)        - matches A::"annotatedValue"
+(A::bar)      - matches A::"annotatedValue" (is not supported in "strict" mode, see #Optimization below)
 ```
 
 The `()` matcher matches all values in the stream but you can also use annotations with it, example:
@@ -63,6 +64,17 @@ PathExtractorBuilder.standard()
 
 see PathExtractorBuilder [javadoc](https://static.javadoc.io/com.amazon.ion/ion-java-path-extraction/1.0.1/com/amazon/ionpathextraction/PathExtractorBuilder.html) for more information on configuration options and search path registration.
 
+### Optimization
+
+There are two implementations: "strict" and "legacy". The strict implementation is more performant, particularly as the
+number of fields extracted grows. By default `PathExtractorBuilder.build()` will try to build you a strict extractor and
+will fall back to the legacy extractor. You may be explicit that you want a specific implementation by calling
+`PathExtractorBuilder.buildStrict()` or `PathExtractorBuilder.buildLegacy()`.
+
+The strict implementation supports basic paths, with field names, index ordinals, and annotations on top-level-values or
+wildcards. It does not support mixing field names and index ordinals, multiple callbacks on the same path or annotations
+on non-wildcard values. Case-insensitive annotations matching is not supported.
+
 ### Notification
 Each time the `PathExtractor` encounters a value that matches a registered search path it will invoke the respective
 callback passing the reader positioned at the current value. See `PathExtractorBuilder#withSearchPath` methods for more
@@ -165,6 +177,7 @@ binary file is ~81M and the text file ~95M. There are four benchmarks types:
 1. `partial`: materializes a single struct fields as `IonValue` using a path extractor.a
 1. `partialNoDom`: access the java representation directly of a single struct field without materializing an `IonValue`.
 
+All the path extractor benchmarks are run in "strict" mode.
 There is a binary and a text version for all four benchmark types. See the [PathExtractorBenchmark](https://github.com/amzn/ion-java-path-extraction/blob/master/src/jmh/java/com/amazon/ionpathextraction/benchmarks/PathExtractorBenchmark.java) class for
 more details.
 
@@ -173,14 +186,14 @@ Results below, higher is better.
 
 ```
 Benchmark                                   Mode  Cnt   Score   Error  Units
-PathExtractorBenchmark.domBinary           thrpt   10   1.128 ± 0.050  ops/s
-PathExtractorBenchmark.domText             thrpt   10   0.601 ± 0.019  ops/s
-PathExtractorBenchmark.fullBinary          thrpt   10   1.227 ± 0.014  ops/s
-PathExtractorBenchmark.fullText            thrpt   10   0.665 ± 0.010  ops/s
-PathExtractorBenchmark.partialBinary       thrpt   10  14.912 ± 0.271  ops/s
-PathExtractorBenchmark.partialBinaryNoDom  thrpt   10  15.650 ± 0.297  ops/s
-PathExtractorBenchmark.partialText         thrpt   10   1.343 ± 0.029  ops/s
-PathExtractorBenchmark.partialTextNoDom    thrpt   10   1.307 ± 0.015  ops/s
+PathExtractorBenchmark.domBinary           thrpt    5   5.060 ±  0.075  ops/s
+PathExtractorBenchmark.domText             thrpt    5   1.172 ±  0.040  ops/s
+PathExtractorBenchmark.fullBinary          thrpt    5   6.011 ±  0.107  ops/s
+PathExtractorBenchmark.fullText            thrpt    5   1.214 ±  0.025  ops/s
+PathExtractorBenchmark.partialBinary       thrpt    5  57.329 ± 13.585  ops/s
+PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  56.598 ±  2.424  ops/s
+PathExtractorBenchmark.partialText         thrpt    5   2.430 ±  0.073  ops/s
+PathExtractorBenchmark.partialTextNoDom    thrpt    5   2.416 ±  0.175  ops/s
 ```
 
 Using the path extractor has equivalent performance for both text and binary when fully materializing the document and

diff --git a/build.gradle b/build.gradle
@@ -83,10 +83,10 @@ jmh {
     failOnError = true
 
     // warmup
-    warmupIterations = 5
+    warmupIterations = 2
 
     // iterations
-    iterations = 10
+    iterations = 5
 }
 
 checkstyle {

diff --git a/src/jmh/java/com/amazon/ionpathextraction/benchmarks/PathExtractorBenchmark.java b/src/jmh/java/com/amazon/ionpathextraction/benchmarks/PathExtractorBenchmark.java
@@ -15,6 +15,7 @@
 
 import com.amazon.ion.IonReader;
 import com.amazon.ion.IonSystem;
+import com.amazon.ion.IonValue;
 import com.amazon.ion.IonWriter;
 import com.amazon.ion.system.IonBinaryWriterBuilder;
 import com.amazon.ion.system.IonReaderBuilder;
@@ -28,6 +29,7 @@
 import java.io.InputStream;
 import java.io.OutputStream;
 import java.net.URL;
+import java.util.Iterator;
 import java.util.function.Function;
 import java.util.stream.Stream;
 import org.openjdk.jmh.annotations.Benchmark;
@@ -171,15 +173,29 @@ public Object partialTextNoDom(final ThreadState threadState) {
      */
     @Benchmark
     public Object domBinary() {
-        return DOM_FACTORY.getLoader().load(bytesBinary);
+        IonReader reader = newReader(new ByteArrayInputStream(bytesBinary));
+        // iterating over Top-Level-Values is more apples:apples to path extractor
+        // vs loading all as a datagram
+        Iterator<IonValue> iter = DOM_FACTORY.iterate(reader);
+        while (iter.hasNext()) {
+            iter.next();
+        }
+        return reader;
     }
 
     /**
      * Text version of {@link #domBinary()}.
      */
     @Benchmark
     public Object domText() {
-        return DOM_FACTORY.getLoader().load(bytesText);
+        IonReader reader = newReader(new ByteArrayInputStream(bytesText));
+        // iterating over Top-Level-Values is more apples:apples to path extractor
+        // vs loading all as a datagram
+        Iterator<IonValue> iter = DOM_FACTORY.iterate(reader);
+        while (iter.hasNext()) {
+            iter.next();
+        }
+        return reader;
     }
 
     /**

diff --git a/src/main/java/com/amazon/ionpathextraction/FsmMatcher.java b/src/main/java/com/amazon/ionpathextraction/FsmMatcher.java
@@ -0,0 +1,43 @@
+/*
+ * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ * Licensed under the Apache License, Version 2.0 (the "License").
+ * You may not use this file except in compliance with the License.
+ * A copy of the License is located at:
+ *
+ *     http://aws.amazon.com/apache2.0/
+ *
+ * or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+
+package com.amazon.ionpathextraction;
+
+import com.amazon.ion.IonReader;
+import java.util.function.BiFunction;
+import java.util.function.Supplier;
+
+/**
+ * Base class for match states in the Finite State Machine matching implementation.
+ */
+abstract class FsmMatcher<T> {
+    /**
+     * Callback for match state. May be null.
+     */
+    BiFunction<IonReader, T, Integer> callback;
+
+    /**
+     * Indicates there are no possible child transitions.
+     */
+    boolean terminal = false;
+
+    /**
+     * Return the child matcher for the given reader context.
+     * Return null if there is no match.
+     * <br>
+     * @param position will be -1 for top-level-values, otherwise will be the position ordinal
+     *         of the value in the container, both for sequences and structs.
+     * @param fieldName will be non-null only for struct values.
+     */
+    abstract FsmMatcher<T> transition(String fieldName, int position, Supplier<String[]> annotations);
+}