-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to disable numerical parsing of s-expressions? #23
Comments
The code which parses numbers from atoms is necessary, so can't easily be eliminated. However, the case of However, this may not be what your specific flavor of S-Expression requires if You may need to add a reader subclass (similar to Schema, and SPARQL readers) to deal with any specific variations you need. I'll fix the problem properly handling |
Try the version from develop branch, and if satisfied,, I'll release a new version containing this fix. |
@gkellogg Oh yes this is an improvement! Feel free to release. :) |
Your release fixed the problem, thank you @gkellogg! |
@gkellogg In case you're interested, I should clarify that from the original s-expression, which I accidentally copied in the first example (with S_EXPRESSION_REPLACEMENTS = {
'(``' => '(INITIAL_QUOTE',
"(''" => '(QUOTE',
'(.' => '(PUNCTUATION',
'-LRB- (' => 'LRB -LRB-',
'-RRB- )' => 'RRB -RRB-',
'-LRB- {' => 'LRB -LCB-',
'-RRB- }' => 'RRB -RCB-',
'-LRB- [' => 'LRB -LSB-',
'-RRB- ]' => 'RRB -RSB-',
'(,' => '(NFP', # Non-final punctuation, also known as superfluous punctuation
/\(-([[:alpha:]]{3})-/ => '(\1', # E.g., (-LRB- to (LRB
/[^(\\]"/ => ' "\""', # Escape unescaped right-side double quotes, places escaped double quote in quotes.
'WP$' => 'WP-S', # You can't have a dollar sign as a valid XML element name. WP-S is the alternative POS name.
'PRP$' => 'PRP-S', # You can't have a dollar sign as a valid XML element name. PRP-S is the alternative POS name.
'0' => 'ZERO' # An integer that begins as zero is not a valid Integer for SXP to parse. e.g. 038
}.freeze
# There are several left-side atoms from the constituency parser's s-expression that would make invalid DOM
# element names. This replaces those with names that work.
# @example replacement
# (. !) means punctuation with value "!"
# (. !) will be transformed to (PUNCTUATION !)
# @private
# @param s_expression [String]
# @return [void]
def sanitize(s_expression)
str = s_expression.dup
S_EXPRESSION_REPLACEMENTS.each { |k, v| str.gsub!(k, v) }
str
end Which we call with: sxp = SXP.read(sanitize(s_expression)) There could be possible fixes to SXP with our work here. e.g. An integer that begins as zero is not a valid Integer for SXP to parse. e.g. |
PRs welcome; you might consider a special reader. Potentially the numeric components of an atom create stay unparsed in the Basic implementation, and the parsing left so subclasses for Common, SPARQL, Schema, and Common Lisp. That wouldn't be too DRY for the common use case, though. Certainly other over-eager atom parsing for some of the workaround you've noted could potentially be addressed in the basic reader. |
Describe the bug
I'm working with s-expressions from an NLP parser, and I'm having difficulty parsing items as a string into symbols without raising float errors.
For the following s-expression string:
The SXP parser errors with:
To reproduce
Steps to reproduce the behavior:
Expected behavior
SXP parses the nodes as strings rather than as numericals. I want characters like periods to be seen as strings.
Desktop (please complete the following information):
Additional context
It appears a possible fix for our problem is monkey patching the gem to remove numerical parsing altogether:
I would like your advice on if that's correct and insight into if there is a way to fix this within the gem, perhaps as a form of configuration.
The text was updated successfully, but these errors were encountered: