-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend allowed characters for grok field names (#21745) #31653
Conversation
Pinging @elastic/es-core-infra |
@@ -439,4 +439,13 @@ public void testExponentialExpressions() { | |||
run.set(false); | |||
assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms")); | |||
} | |||
|
|||
public void testUnicodeFieldnames() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add more tests here for when things are expected to fail and/or succeed.
maybe pick random unicode codepoints between a certain range that is known to be :alphanum:
and
have explicit checks for random ones that are out of the alphanum range. I do not know too much about these ranges, so I understand if that is not practical.
at least we can add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talevy as far as I know the [:alphanum:]
range isn't contiguous (it def. isn't at the boundary between letters and numbers I think).
So:
add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work
seems the way to go imo (obviously could choose a different range as well but I don't think it would be logically different from just testing the old range + 2 concrete examples of things that broke before).
Are you sure we want to test the negative case here too? So basically just use some char that we def don't support in the filename like say (field
and match for null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a sanity check that declares we do not support arbitrary unicode. I don't think we have that around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talevy sure, will add that (though come to think of it, the behaviour there kind of sucks in that we're just quietly returning null
for fieldnames that can't even work to begin with).
I wonder if we should maybe add some validation for the grokPattern
to the Grok
constructor (not necessarily in this PR ... more thinking in general)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behaviour there kind of sucks in that we're just quietly returning null for fieldnames that can't even work to begin
This definitely sounds like something that should be fixed in a followup.
@rjernst Pinging for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, please don't title commits as "be more lenient". Leniency has a negative connotation among the team. Instead, something like "Extend allowed characters for grok field names" is more apt.
|
||
public void testUnicodeFieldnames() { | ||
for(String fieldName : Arrays.asList("@metadata", "@metädata", "@metädat[a]")) { | ||
String line = "foo"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this put into a local var? It's only used once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -439,4 +439,13 @@ public void testExponentialExpressions() { | |||
run.set(false); | |||
assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms")); | |||
} | |||
|
|||
public void testUnicodeFieldnames() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that more tests would be good. A common pattern we have in tests is to add a special assert method for something like this, then call that method from different tests cases. I would have tests for random alphanumeric, special characters (including each that are defined in the regex), and then something outside of the characters defined in the regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst done I think :)
Note:
- The
:
is in the regex but not a supported part of a field name because of the possibility of type definition in the Grok section making that impossible. - Added a simple check for the unsupported
()
as an example of a char falling outside of the allowed range.
8c903ef
to
c8397fb
Compare
@rjernst all points addressed I think/hope :) |
test failure looks like #30869 |
Jenkins test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with one caveat: why is colon in the regex if colon is not actually supported?
I think the test failures you have will be fixed by merging in latest master (those tests have been @AwaitsFix)
c8397fb
to
0ef98d6
Compare
thanks, will squash+merge when green then :)
We need to match on the colon there to match two colon cases (with type) like
rebased now, let's see :) |
In the future, please do not use rebases. Merging allows reviewers to more easily incrementally look at changes, and the elasticsearch project is configured to only do squash merges when pushing the merge button, so there is no need to squash on your own. |
@rjernst backport this to |
* master: Do not check for object existence when deleting repository index files (#31680) Remove extra check for object existence in repository-gcs read object (#31661) Support multiple system store types (#31650) [Test] Clean up some repository-s3 tests (#31601) [Docs] Use capital letters in section headings (#31678) [DOCS] Add PQL language Plugin (#31237) Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (#31607) TEST: Fix test task invocation (#31657) Revert "[TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg" Fix RealmInteg test failures Extend allowed characters for grok field names (#21745) (#31653) [DOCS] Fix licensing API details (#31667) [TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg Fix CreateSnapshotRequestTests Failure (#31630) Configurable password hashing algorithm/cost (#31234) [TEST] Mute failing NamingConventionsTaskIT tests [DOCS] Replace CONFIG_DIR with ES_PATH_CONF (#31635) Core: Require all actions have a Task (#31627)
Yes 6.x should be fine. Please also put version labels for each branch this is checked into. For example, this PR would have |
* 6.x: Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) (#31789) [Docs] Correct default window_size (#31582) S3 fixture should report 404 on unknown bucket (#31782) [ML] Limit ML filter items to 10K (#31731) Fixture for Minio testing (#31688) [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647) [DOCS] Add missing get mappings docs to HLRC (#31765) [DOCS] Starting Elasticsearch (#31701) Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747) Painless: Complete Removal of Painless Type (#31699) Consolidate watcher setting update registration (#31762) [DOCS] Adds empty 6.3.1 release notes page ingest: Introduction of a bytes processor (#31733) [test] don't run bats tests for suse boxes (#31749) Add analyze API to high-level rest client (#31577) Implemented XContent serialisation for GetIndexResponse (#31675) [DOCS] Typos DOC: Add examples to the SQL docs (#31633) Add support for AWS session tokens (#30414) Watcher: Reenable start/stop yaml tests (#31754) JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735) Support multiple system store types (#31650) Add write*Blob option to replace existing blob (#31729) Split CircuitBreaker-related tests (#31659) Painless: Add Context Docs (#31190) Docs: Remove missing reference Migrate scripted metric aggregation scripts to ScriptContext design (#30111) Watcher: Fix chain input toXcontent serialization (#31721) Remove _all example (#31711) rest-high-level: added get cluster settings (#31706) Docs: Match the examples in the description (#31710) [Docs] Correct typos (#31720) Extend allowed characters for grok field names (#21745) (#31653) (#31722) [DOCS] Check for Windows and *nix file paths (#31648) [ML] Validate ML filter_id (#31535) Fix gradle4.8 deprecation warnings (#31654) Update numbers to reflect 4-byte UTF-8-encoded characters (#27083)
Fixes #21745
I couldn't make it as lenient as matching every visible char because whatever is matched by the
subname
group is itself used as the name of a matcher group (so things like\
won't work without additional escaping).So I left level of leniency out for now (I can look into making that work as well, but it's going to be a trickier change potentially) and used @clintongormley 's suggestion here (#21745 (comment)) of simply using unicode properties (
:alnum:
matches all letters and numbers here) + allowed punctuation (I defined allowed as exactly what LS grok allows so only[\[\]@]
was added) for now.let me know what you think :)