-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: added parser_options for more control over XML parsing #68
Conversation
|
||
def xml_parse_options | ||
return Nokogiri::XML::ParseOptions::DEFAULT_XML unless @parse_options # (RECOVER | NONET) | ||
@xml_parse_options ||= begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a question @kares, today in the docs we say that supported options are:
- strict
- no_warning
- no_error
the ones provided by Nokojiri parser.
In the future if we update the parser version, and the parser changes/updates the list of flag without we synch the documentation, then we have a misalignment. Do you think is best to have strict check of the flags here instead of relay on the validation done by the parser library, also if it means repeating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the reasoning: initially I thought about only having strict => true
option.
but than the options allow finer control, for whatever scenarios we might run into ...
(e.g. might wan to allow network access for the parser to download schemas etc.)
so there's more parser options we can set, I purposefully only documented these few.
maybe even no_error and no_warning might be a bit too much as what is a warning/error
depends on the parser (might not be clear).
being able to (unoficially) set all of the internal ones is the reason why I decided not to validate against a fixed set of options (that would be enumerated here).
going to remove no_error and no_warning from the docs and only document strict for now.
and checking that the official parser_options
work ('strict'
) will be covered by specs.
does that sound okay or do you have concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok, go with the strict
as parser_options
. My thoughts was based on the assumption that everything we provide should be documented, if we let the door open so that someone can tinkering with undocumented behaviours, then in case we change library we potentially break more pipeline configurations than expected.
docs/index.asciidoc
Outdated
By default the parser is not strict and thus accepts some invalid content. | ||
Currently supported options are: | ||
|
||
- _strict_ - forces the parser to fail early when content is not valid xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karenzone could you please check these for me.
we might add more to the list later (unofficially other options work) but for now only strict is to be documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestions for your consideration. Otherwise, LGTM
===== `parser_options` | ||
|
||
* Value type is <<string,string>> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding default value here as a bullet, since it sounds like strict
is not enabled by default. I'm not sure how to notate that. Would something like * Default value is []
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, did the same as with rest ... to mention there's no default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We documented parser_options
, but the actual parameter is parse_options
in the code. I can submit a PR to fix the doc but is it ok to keep parse_options
?
docs/index.asciidoc
Outdated
By default the parser is not strict and thus accepts some invalid content. | ||
Currently supported options are: | ||
|
||
- _strict_ - forces the parser to fail early when content is not valid xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- _strict_ - forces the parser to fail early when content is not valid xml | |
- `strict` - forces the parser to fail early when content is not valid xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"forces the parser to fail early when content is not valid xml"
Maybe include the alternative. Something like "forces the parser to fail early instead of accumulating errors when content is not valid xml."
it 'does fail parsing' do | ||
subject.filter(event) | ||
expect( event.get("tags") ).to_not be nil | ||
expect( event.get("tags") ).to include '_xmlparsefailure' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we document this tag
? In other plugins such as grok
, it can be changed/customized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the docs the tag isn't documented.
we usually try to keep PRs such as this one minimalistic, so I think you should open an issue as a feature request to be able to configure the failure tag, on top of that (as the feature is delivered) documentation will get improved ...
Hi, can we get the documentation on the website updated, it seems the setting has to be mentioned like below: filter But the documenation on the website states that the setting to be: parser_options (snapshot attached below) |
Hi @uzairarif. Thanks for this comment. I've opened #75 to get this change vetted and merged. If you see other inconsistencies such as this, please open a github issue in the appropriate plugin repo. We rarely interact with PRs after they have been merged, and comments in merged PRs are likely to get lost. It would be a shame for valuable comments such as this to be lost. Thanks for helping us make our plugins and documentation better. |
users sometimes throw non valid XML data at this plugin, which is 💩 but still
in certain cases (e.g. when the document is huge) this could lead to leaks as the parser by default is very forgiving and does not fail on errors but instead tries to build smt out of the provided content really hard.
in those cases internal errors (such as
SAXParseException
) are stored in a list at the native end.(to be mapped and retrievable as
doc.errors
when parsing completes)having more control of the parsing process would allow us to fail early instead of piling up errors.
the proposal supports
xml { parse_options => 'strict' }
resolves #66