-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a more efficient PathExtractor Implementation (#28)
The existing implementation loops through every potential search path at each step of matching. It therefore exhibits exponential time complexity as the number of extracted fields grows. Some of the Amazon usage involves extractions of thousands of attributes and users needed to craft a workaround for the performance hit. This change adds a more efficient PathExtractor implementation that supports a narrower set of SearchPaths and combinations thereof, but avoids the exponential time cost. This is reflected in the benchmark results and testing on my own toy dataset. The results show that the relative performance of the new implementation improves as the number of extracted attributes increases. It is built with a key assumption that does not hold for the existing implementation: that users may not register paths that would result in a single element matching more than one terminal matcher. I searched for Amazon internal usages and believe that the supported usage patterns cover any and all usage I could find. I found no usage of Annotations not on the Top-Level-Values themselves. Note: I changed the "dom" benchmarks to iterate over Top-Level-Values. This makes them faster and a fairer comparison target. ``` Control (Iterate over IonValues) PathExtractorBenchmark.domBinary thrpt 5 5.060 ± 0.075 ops/s PathExtractorBenchmark.domText thrpt 5 1.172 ± 0.040 ops/s Benchmark (legacy) Mode Cnt Score Error Units PathExtractorBenchmark.fullBinary thrpt 5 3.846 ± 0.033 ops/s PathExtractorBenchmark.fullText thrpt 5 1.094 ± 0.008 ops/s PathExtractorBenchmark.partialBinary thrpt 5 43.020 ± 2.681 ops/s PathExtractorBenchmark.partialBinaryNoDom thrpt 5 42.628 ± 2.549 ops/s PathExtractorBenchmark.partialText thrpt 5 2.461 ± 0.121 ops/s PathExtractorBenchmark.partialTextNoDom thrpt 5 2.368 ± 0.121 ops/s Benchmark (strict). Mode Cnt Score Error Units PathExtractorBenchmark.fullBinary thrpt 5 6.011 ± 0.107 ops/s PathExtractorBenchmark.fullText thrpt 5 1.214 ± 0.025 ops/s PathExtractorBenchmark.partialBinary thrpt 5 57.329 ± 13.585 ops/s PathExtractorBenchmark.partialBinaryNoDom thrpt 5 56.598 ± 2.424 ops/s PathExtractorBenchmark.partialText thrpt 5 2.430 ± 0.073 ops/s PathExtractorBenchmark.partialTextNoDom thrpt 5 2.416 ± 0.175 ops/s ``` Co-authored-by: Tyler Gregg <[email protected]>
- Loading branch information
1 parent
fd88da9
commit 84ef782
Showing
21 changed files
with
852 additions
and
69 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
43 changes: 43 additions & 0 deletions
43
src/main/java/com/amazon/ionpathextraction/FsmMatcher.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
/* | ||
* Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
* Licensed under the Apache License, Version 2.0 (the "License"). | ||
* You may not use this file except in compliance with the License. | ||
* A copy of the License is located at: | ||
* | ||
* http://aws.amazon.com/apache2.0/ | ||
* | ||
* or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific | ||
* language governing permissions and limitations under the License. | ||
*/ | ||
|
||
package com.amazon.ionpathextraction; | ||
|
||
import com.amazon.ion.IonReader; | ||
import java.util.function.BiFunction; | ||
import java.util.function.Supplier; | ||
|
||
/** | ||
* Base class for match states in the Finite State Machine matching implementation. | ||
*/ | ||
abstract class FsmMatcher<T> { | ||
/** | ||
* Callback for match state. May be null. | ||
*/ | ||
BiFunction<IonReader, T, Integer> callback; | ||
|
||
/** | ||
* Indicates there are no possible child transitions. | ||
*/ | ||
boolean terminal = false; | ||
|
||
/** | ||
* Return the child matcher for the given reader context. | ||
* Return null if there is no match. | ||
* <br> | ||
* @param position will be -1 for top-level-values, otherwise will be the position ordinal | ||
* of the value in the container, both for sequences and structs. | ||
* @param fieldName will be non-null only for struct values. | ||
*/ | ||
abstract FsmMatcher<T> transition(String fieldName, int position, Supplier<String[]> annotations); | ||
} |
Oops, something went wrong.